-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed
Description
We explored and discussed some ideas and we want to write it down for tracking, also welcome community developer to try out those unfinished
- (Good first issue) Skip
len
operation, get it directly from forward batch: FA3 speed up: skip len operation and get batch size directly from forward batch #5969 @lifuhuang - GQA head packing: https://github.com/Dao-AILab/flash-attention/blob/main/hopper/flash_attn_interface.py#L658 Change it to True and run benchmark.
- Split-KV. aka Flash Decoding: We already enabled it, it is indeed faster in lower batch and long context scenario. Benchmark will be attached.
- PDL: Dao-AILab/flash-attention@000090d
- (Won't do) Prepare Scheduler Metadata: Dao-AILab/flash-attention@fa60e7c (From Tri Dao's note, it can only speed up 2us, we can keep an eye on this, not recommending adopting this)
- For Llama Models, we observed that Spec Decoding with Top K > 1 is slightly slower than Flash Infer backend, we need comprehensive profiling and optimize it @MrAta
- Replace Pad operation by Copy: Optimize a pad operation to accelerate 25us #5945
- Remove is_fa3_supported from fa3 kernel: Remove unecessary is_fa3_supported check #6112
- Remove pad operation for all decode cases: optimize pad operations in fa3 to accelarate 100+us #6077
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed