Skip to content

Conversation

araleza
Copy link

@araleza araleza commented Jul 1, 2025

Some people are using the --train_batch_size command line option without realizing that they are regularly doing training steps that have a batch size of (e.g.) just 1 due to image bucketing. It's likely that many people don't even know what image bucketing is, and are unaware that there's even a problem. Underfilled buckets can significantly damage batched training, reducing image quality.

This change introduces a clear warning if buckets are underfilled for the current batch size - defined as a bucket that's less than half of the batch size. e.g.:

image

So for a batch size of 5, the buckets must have at least 3 images in them to not trigger the warning. For training that isn't using batches, the warning doesn't appear at all.

For those who don't know, batches can only be formed from images found in a single bucket at a time. If a bucket has a single image in it, the batch size will be 1 for that training step, irrespective of what the --train_batch_size is set to. (Gradient accumulation doesn't suffer from this limitation.)

This debug output is useful even for people who do know what buckets are, as it names (the end of) an example image filename from each bucket that's considered to be underfilled. This is because there are people who notice that they have a bucket with (e.g.) one image in it, but they don't know exactly which one it was. It's not even possible to trivially search by image size, as images are clipped down to multiples of the bucket resolution.

@araleza
Copy link
Author

araleza commented Jul 1, 2025

I've checked this change into the SD3 branch, but I imagine it should work for SDXL too, so it should probably go to main if you decide to take it.

@blackmagic24
Copy link

Isn’t it the same when the number of images in a bucket is not exactly divisible by the batch size?
The last step will always be incomplete. Correct?

Example from your screenshot:
For a batch size of 5:
bucket 20 with 6 images.

Is that serious?

@kohya-ss
Copy link
Owner

Isn’t it the same when the number of images in a bucket is not exactly divisible by the batch size?
The last step will always be incomplete. Correct?

Your understanding is correct.

However, since the image selected for the last step is random each epoch, I don't think there will be serious impact if you train for a certain amount of epoch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants