-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add Flash, cuDNN, Efficient attention for Flux #2045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sd3
Are you sure you want to change the base?
Conversation
Can it be applied to SDXL also? I see there are some SDPA mentioned in |
If you wouldn't mind, could you please explain to those who are less familiar with the subject what kind of impact the "apply_t5_attn_mask" True/False has on the final results of the Flux LoRA training? |
Flash Attention gives a phenomenal speed boost compared to CuDNN. But what are the potential side effects of 'apply_t5_attn_mask=False'? |
@iqddd Flux was supposedly trained without attn masks so the padding was trained into the model. So maybe the proper way is to not use attention masks (which is the default without setting the variable). |
@rockerBOO May I ask what led you to this conclusion? Is it based on your own reasoning, or are there specific facts or sources supporting it? |
Maybe just a rumor but they mentioned it being something they looked into for https://huggingface.co/lodestones/Chroma . I don't have proof as it hasn't been spoken of. I have another idea that might help with the masking after the attention completes. We can apply the mask after flash_attention completes or something. Not a performance improvement but might offer a middle ground. |
From HF (https://huggingface.co/lodestones/Chroma):
Well, the author claims that it's obvious, but to be honest, it’s not at all obvious to me. :) If that's really the case, though, it's quite significant. Many inference environments actually cut off the padding part to speed up processing: |
Using
sdpa_kernel
it will pick the attention kernel based on what is available.sdpa_kernel
is in beta still so would probably need some testing. I am currently trying flash attention 2.Flash Attention does not support attention masks (
apply_t5_attn_mask = false
).cuDNN may have issues but you can enable it via
TORCH_CUDNN_SDPA_ENABLED=1
.Supported kernels may depend on your version of PyTorch and CUDA versions.
In priority order: