Doubts about Flex Attention masks used #8

amodab01 · 2025-02-26T22:47:23Z

amodab01
Feb 26, 2025

Thanks for working through this so quickly @lucidrains ! I had a couple of questions

I understand is_mem_kv in the flex compress_mask represents some sort of memory tokens at the start of the the compressed kv which are always computed. I'm not able to correlate this to what part of the paper mentions using these tokens.
Similarly, block_diagonal in the flex fine_mask will allow tokens corresponding to a particular block even if that block is not selected in topk. I'm again not able to correlate this to the part of the paper that mentions using these.

Any help regarding these implementations would be greatly appreciated!

lucidrains · 2025-02-26T23:06:36Z

lucidrains
Feb 26, 2025
Maintainer

@amodab01 hey Amogh! good to hear from you and thank you for raising these questions! i found the paper lacking in detail, which led to a few of these improvised solutions, guided by my own intuition. let me know if they make any sense

the problem that led to 1. is that the first query block of compression length will have nothing to attend to, so i decided to add a few memory tokens from an old paper from Lample et al, also recently used by hymba at nvidia

for 2, this actually came about while i was examining figure 2 in the paper (which is the one displayed on the repo readme). you may notice on the right hand side, the diagram in the middle w/ the text "Selected Attention Mask", the very block on the right hand side is actually included (with causal masking). i took this to mean they always included this block for whatever reason (perhaps under some condition where sliding window length is low you need to cover all the local tokens).

let me know your thoughts, and if we could address these issues in other ways, happy to build out all the options and just make them hparam configurable

2 replies

amodab01 Feb 27, 2025
Author

Thanks!
I wasn't aware of the first implementation and the second one does make sense from Fig 2 even though there isn't exact clarity from the paper, thanks!

lucidrains Feb 27, 2025
Maintainer

@amodab01 no problem! i'll probably make the causal optional at some point, as i want to generalize it to encoders for long context video attention, genomics, all that jazz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Doubts about Flex Attention masks used #8

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Doubts about Flex Attention masks used #8

Uh oh!

amodab01 Feb 26, 2025

Replies: 1 comment · 2 replies

Uh oh!

lucidrains Feb 26, 2025 Maintainer

Uh oh!

amodab01 Feb 27, 2025 Author

Uh oh!

Uh oh!

lucidrains Feb 27, 2025 Maintainer

amodab01
Feb 26, 2025

Replies: 1 comment 2 replies

lucidrains
Feb 26, 2025
Maintainer

amodab01 Feb 27, 2025
Author

lucidrains Feb 27, 2025
Maintainer