Replies: 1 comment 2 replies
-
@amodab01 hey Amogh! good to hear from you and thank you for raising these questions! i found the paper lacking in detail, which led to a few of these improvised solutions, guided by my own intuition. let me know if they make any sense the problem that led to 1. is that the first query block of compression length will have nothing to attend to, so i decided to add a few memory tokens from an old paper from Lample et al, also recently used by hymba at nvidia for 2, this actually came about while i was examining figure 2 in the paper (which is the one displayed on the repo readme). you may notice on the right hand side, the diagram in the middle w/ the text "Selected Attention Mask", the very block on the right hand side is actually included (with causal masking). i took this to mean they always included this block for whatever reason (perhaps under some condition where sliding window length is low you need to cover all the local tokens). let me know your thoughts, and if we could address these issues in other ways, happy to build out all the options and just make them hparam configurable |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Thanks for working through this so quickly @lucidrains ! I had a couple of questions
is_mem_kv
in the flexcompress_mask
represents some sort of memory tokens at the start of the the compressed kv which are always computed. I'm not able to correlate this to what part of the paper mentions using these tokens.block_diagonal
in the flexfine_mask
will allow tokens corresponding to a particular block even if that block is not selected intopk
. I'm again not able to correlate this to the part of the paper that mentions using these.Any help regarding these implementations would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions