Add Flash, cuDNN, Efficient attention for Flux #2045

rockerBOO · 2025-04-12T03:23:25Z

Using sdpa_kernel it will pick the attention kernel based on what is available. sdpa_kernel is in beta still so would probably need some testing. I am currently trying flash attention 2.

Flash Attention does not support attention masks (apply_t5_attn_mask = false).
cuDNN may have issues but you can enable it via TORCH_CUDNN_SDPA_ENABLED=1.

Supported kernels may depend on your version of PyTorch and CUDA versions.

In priority order:

FLASH_ATTENTION: The flash attention backend for scaled dot product attention.
CUDNN_ATTENTION: The cuDNN backend for scaled dot product attention.
EFFICIENT_ATTENTION: The efficient attention backend for scaled dot product attention.
MATH: The math backend for scaled dot product attention.

6DammK9 · 2025-04-22T04:40:30Z

Can it be applied to SDXL also? I see there are some SDPA mentioned in sdxl_original_unet.py but not this kind of implementation.
Currently I'm in multiGPU full funtining, suffering in NCCL stuffs (VRAM overhead, CPU instensive all_reduce, deepspeed is tested stalls heavily) and this may help.

rockerBOO · 2025-04-22T17:54:17Z

@6DammK9 I have added SDPABackend for SD and SDXL in #2061 which is based on the main branch. Flash attention probably won't work because we are using an attention mask though (Flash attention doesn't support attention masks).

iqddd · 2025-05-20T12:47:08Z

If you wouldn't mind, could you please explain to those who are less familiar with the subject what kind of impact the "apply_t5_attn_mask" True/False has on the final results of the Flux LoRA training?

iqddd · 2025-05-20T20:52:02Z

Flash Attention gives a phenomenal speed boost compared to CuDNN. But what are the potential side effects of 'apply_t5_attn_mask=False'?

rockerBOO · 2025-06-03T22:33:44Z

@iqddd Flux was supposedly trained without attn masks so the padding was trained into the model. So maybe the proper way is to not use attention masks (which is the default without setting the variable).

iqddd · 2025-06-04T12:37:19Z

@rockerBOO May I ask what led you to this conclusion? Is it based on your own reasoning, or are there specific facts or sources supporting it?

rockerBOO · 2025-06-04T17:01:59Z

Maybe just a rumor but they mentioned it being something they looked into for https://huggingface.co/lodestones/Chroma . I don't have proof as it hasn't been spoken of.

I have another idea that might help with the masking after the attention completes. We can apply the mask after flash_attention completes or something. Not a performance improvement but might offer a middle ground.

iqddd · 2025-06-05T08:36:37Z

@rockerBOO

From HF (https://huggingface.co/lodestones/Chroma):

It might not be obvious, but BFL had some oversight during pre-training where they forgot to mask both T5 and MMDiT tokens. So, for example, a short sentence like “a cat sat on a mat” actually looks like this in both T5 and MMDiT: a cat sat on a mat ...

Well, the author claims that it's obvious, but to be honest, it’s not at all obvious to me. :) If that's really the case, though, it's quite significant. Many inference environments actually cut off the padding part to speed up processing:
tokenized = self.tokenizer(texts, truncation=False, add_special_tokens=False)["input_ids"].
During training, I always used a mask shaped like “11111...00000 (text) + 11111.... (image)”, which seems to correspond to inference with the trimmed mask, right?

rockerBOO added 2 commits April 11, 2025 23:14

Add Flash, cuDNN, Efficient attention for Flux

2ab5bc6

Remove priority

90d14b9

rockerBOO mentioned this pull request Apr 22, 2025

Add SDPABackend for SD and SDXL #2061

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Flash, cuDNN, Efficient attention for Flux #2045

Add Flash, cuDNN, Efficient attention for Flux #2045

Uh oh!

rockerBOO commented Apr 12, 2025

Uh oh!

6DammK9 commented Apr 22, 2025

Uh oh!

rockerBOO commented Apr 22, 2025

Uh oh!

iqddd commented May 20, 2025

Uh oh!

iqddd commented May 20, 2025

Uh oh!

rockerBOO commented Jun 3, 2025

Uh oh!

iqddd commented Jun 4, 2025

Uh oh!

rockerBOO commented Jun 4, 2025

Uh oh!

iqddd commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Add Flash, cuDNN, Efficient attention for Flux #2045

Are you sure you want to change the base?

Add Flash, cuDNN, Efficient attention for Flux #2045

Uh oh!

Conversation

rockerBOO commented Apr 12, 2025

Uh oh!

6DammK9 commented Apr 22, 2025

Uh oh!

rockerBOO commented Apr 22, 2025

Uh oh!

iqddd commented May 20, 2025

Uh oh!

iqddd commented May 20, 2025

Uh oh!

rockerBOO commented Jun 3, 2025

Uh oh!

iqddd commented Jun 4, 2025

Uh oh!

rockerBOO commented Jun 4, 2025

Uh oh!

iqddd commented Jun 5, 2025

Uh oh!

Uh oh!