-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed
1 / 11 of 1 issue completedClosed
1 / 11 of 1 issue completed
Copy link
Description
Functionality
- Basic FA3 support including MHA Models (Llama, QWen and etc), Cuda Graph, Sliding Window (Gemma): Support FA3 as Attention backend by using
--attention-backend fa3
#4680 @hebiao064 @qingquansong - Support Page Size > 1 Support Page Size > 1 for FA3 #4832 @hebiao064
- Support MLA for Deepseek-like models [Feature] Support FA3 backend for MLA #4831 @Fridge003
- Support Speculative Decoding PR1, PR2, PR3 PR4 @qingquansong @hebiao064 @zcnrex
- Figure out how to build FA3 into SGLang: [feat] add fa3 in sgl-kernel #4902 @yinfan98
- Add E2E Test like
sglang/test/srt/test_triton_attention_backend.py
: Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 #4760 @yubofredwang - Support Multimodal [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct #5103 @zcnrex @mickqian @yizhang2077
- Support FP8 Fix loading KV quantization scale; Enable modelopt kv cache #4686 @yundai424
Documentation and Benchmark:
- Add Docs about attention backend, maybe we need to add a comprehensive one with Support Matrix such as: #4865
- Benchmark Attention Backends with different features and models #5172 @zhyncs @hebiao064 Shivam (In Review, will be tracked offline)
Perf Optimization and Accuracy Problems
- Fix Cuda Graph Accuracy problem for Page Size > 1: Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed #4855 @qingquansong
- Optimizing Decoding by remove `item() device sync: [FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 #4745 @hebiao064
- Optimizing Prefill by remove
item()
device sync: [Fix] avoid stream sync and torch compile in prefill for fa3 backend #4932 @Fridge003 - Optimizing Draft Decode and Target Verify CUDA Graph Latency: Refactor and Optimize FA3 Code #5090 @hebiao064
Success Criteria:
- The latency should be on par with vLLM FlashAttention3 and SGLang's FlashInfer implementation
- The accuracy should be on par with vLLM FlashAttention3 and SGLang's FlashInfer implementation
Other issues we surfaced but not scoped in this task:
- Flash Infer accuracy is bad for Gemma 2 Models
- VSCode Test Explorer is broken since some circular dependency: Fix circular imports in gptq.py and unblock test explorer #4736 @hebiao064
zhyncs, zcnrex, qingquansong, Hongbosherlock, ispobock and 12 moreHaiShaw, jiangguochaoGG, zhyncs, slin1237, merrymercy and 1 more