Skip to content

Conversation

rockerBOO
Copy link
Contributor

Using sdpa_kernel it will pick the attention kernel based on what is available. sdpa_kernel is in beta still so would probably need some testing. I am currently trying flash attention 2.

Flash Attention does not support attention masks (apply_t5_attn_mask = false).
cuDNN may have issues but you can enable it via TORCH_CUDNN_SDPA_ENABLED=1.

Supported kernels may depend on your version of PyTorch and CUDA versions.

In priority order:

FLASH_ATTENTION: The flash attention backend for scaled dot product attention.
CUDNN_ATTENTION: The cuDNN backend for scaled dot product attention.
EFFICIENT_ATTENTION: The efficient attention backend for scaled dot product attention.
MATH: The math backend for scaled dot product attention.

@6DammK9
Copy link

6DammK9 commented Apr 22, 2025

Can it be applied to SDXL also? I see there are some SDPA mentioned in sdxl_original_unet.py but not this kind of implementation.
Currently I'm in multiGPU full funtining, suffering in NCCL stuffs (VRAM overhead, CPU instensive all_reduce, deepspeed is tested stalls heavily) and this may help.

@rockerBOO
Copy link
Contributor Author

@6DammK9 I have added SDPABackend for SD and SDXL in #2061 which is based on the main branch. Flash attention probably won't work because we are using an attention mask though (Flash attention doesn't support attention masks).

@iqddd
Copy link

iqddd commented May 20, 2025

If you wouldn't mind, could you please explain to those who are less familiar with the subject what kind of impact the "apply_t5_attn_mask" True/False has on the final results of the Flux LoRA training?

@iqddd
Copy link

iqddd commented May 20, 2025

Flash Attention gives a phenomenal speed boost compared to CuDNN. But what are the potential side effects of 'apply_t5_attn_mask=False'?

@rockerBOO
Copy link
Contributor Author

@iqddd Flux was supposedly trained without attn masks so the padding was trained into the model. So maybe the proper way is to not use attention masks (which is the default without setting the variable).

@iqddd
Copy link

iqddd commented Jun 4, 2025

@rockerBOO May I ask what led you to this conclusion? Is it based on your own reasoning, or are there specific facts or sources supporting it?

@rockerBOO
Copy link
Contributor Author

Maybe just a rumor but they mentioned it being something they looked into for https://huggingface.co/lodestones/Chroma . I don't have proof as it hasn't been spoken of.

I have another idea that might help with the masking after the attention completes. We can apply the mask after flash_attention completes or something. Not a performance improvement but might offer a middle ground.

@iqddd
Copy link

iqddd commented Jun 5, 2025

@rockerBOO

From HF (https://huggingface.co/lodestones/Chroma):

It might not be obvious, but BFL had some oversight during pre-training where they forgot to mask both T5 and MMDiT tokens. So, for example, a short sentence like “a cat sat on a mat” actually looks like this in both T5 and MMDiT: a cat sat on a mat ...

Well, the author claims that it's obvious, but to be honest, it’s not at all obvious to me. :) If that's really the case, though, it's quite significant. Many inference environments actually cut off the padding part to speed up processing:
tokenized = self.tokenizer(texts, truncation=False, add_special_tokens=False)["input_ids"].
During training, I always used a mask shaped like “11111...00000 (text) + 11111.... (image)”, which seems to correspond to inference with the trimmed mask, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants