todo

- [x] compressed attention needs to be overlapping segments
- [x] make sure compress and fine block sizes can be different, deal with the importance score as explained in paper
- [x] handle < block size sequence lengths, make sure it doesn't break
  - [x] handle any block size < num selected blocks
  - [x] handle no blocks
- [x] flex attention for starters
  - [x] add for sliding windows
  - [x] fine attention
  - [x] add the mask function for compressed attention assuming at some future date, attention logits can be extracted
- [x] replace einx get at
- [x] ~allow for ablating this extra block causal diagonal in fine attention~ nevermind, it is necessary for the first block
- [x] make it possible to customize the MLP for compressing key / value
- [x] add attention pool as a type of compression module, even if they said "MLP"
- [x] ~figure out relative positions from query to each compressed key. perhaps use the midpoint of the set of keys~ given some recent literature, prob best not to have relative positions for the compressed pathway
- [x] gqa
- [x] in gqa, allow each query head to select different sets of key / values. averaging the importance score across grouped query heads seems not for the best, but will allow for both to see if some tradeoff can be made
- [x] experiments - https://api.wandb.ai/links/lucidrains/d7or7n5n
- [ ] few tests to make sure flex vs cpu version lines up
- [ ] inference pathways
  - [x] sliding windows
  - [x] block causal
  - [x] make rotary embed torch lib support gqa
  - [x] rotary for fine key / values
  - [ ] some kv cache management
    - [ ] keep cache on cpu and load only kv segments into gpu for selective attn
    - [ ] memmap
  - [x] assert cache and without yield same result
  - [ ] fix grouped each query head seeing different kv segments
  - [x] replace slow `get_at` in fine attn inference
  - [x] running seq to be compressed, as well as all compressed - think about compressed sliding windows for even longer context
  - [x] computing the importance score + loading sparse kv blocks into mem
- [x] wire up flex fine mask, see which one is faster, play around with `BlockMask` if slow, then move on
- [x] offer another type of gating by importance score, but on the fine attention output - default to this one as flex fine mask is not compatible
- [x] ~figure out whether they used some soft topk or gating with the importance scores~ just use one hot straight through on compress attention probs and gate the selected keys and values
- [x] revise the importance score for different compress vs fine block sizes based on dialogue
- [ ] build out the triton kernel
  - [ ] figure out how best to deal with skinny matrix tl.dot - at this point in time, i'm expanding the dims to the minimum dim of 16 to carry out tl.dot with two 3ds. if any triton / cuda kernel expert has a better suggestion, let me know
  - [x] parallelize across sequence for backwards pass, make it optional
  - [x] forwards
  - [x] backwards
    - [x] dv
    - [x] dk
    - [x] dq
    - [x] swap q and kv loops
    - [x] figure out why dk is intermittently failing
  - [x] take care of gqa
    - [x] forwards
    - [x] backwards
  - [x] fix nan issue at higher batch sizes when fine selection is turned on
  - [x] allow for `query_heads_share_selected_kv` - autodetect indices and mask having query number of heads and pass in a flag
  - [x] make the block causal diagonal optional and prep an encoder nsa variant
  - [x] take care of block sizes for both m and n less than fine block size
  - [ ] flag in function that deduplicates selected indices
  - [x] debug nan issue with grouped query heads (4 / 2) and 4 selected fine blocks
  - [x] make sure triton nsa tolerates any seq length, for generation
  - [x] just move head to always the first dimension
  - [x] ~seek a code review from triton experts~ can't find any, expertise is still too rare
- [ ] improvisations
  - [ ] generalize to multi-level hierarchical sparse attention
  - [x] add an encoder variant for long context video
  - [ ] try adding a fused talking heads on the gqa, since they are all loaded in
  - [ ] deduplicate some of the computation between fine attention block causal + sliding window causal, or try to carry them out in parallel
  - [ ] offer a variant where one does attn softclamping to 40-50, and remove lse / maximum altogether, saving a ton of complexity
  - [ ] 2d / 3d versions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

todo #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

todo #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions