[Roadmap] FlashAttention3 Support as SGLang Attention Backend


**Functionality**
- [x] Basic FA3 support including MHA Models (Llama, QWen and etc), Cuda Graph, Sliding Window (Gemma): https://github.com/sgl-project/sglang/pull/4680 @hebiao064 @qingquansong 
- [x] Support Page Size > 1 https://github.com/sgl-project/sglang/pull/4832 @hebiao064 
- [x] Support MLA for Deepseek-like models #4831 @Fridge003  
- [x] Support Speculative Decoding [PR1](https://github.com/sgl-project/sglang/pull/4951), [PR2](https://github.com/sgl-project/sglang/pull/5050/files), [PR3](https://github.com/sgl-project/sglang/pull/5168) [PR4](https://github.com/sgl-project/sglang/pull/5318) @qingquansong @hebiao064 @zcnrex 
- [x] Figure out how to build FA3 into SGLang: https://github.com/sgl-project/sglang/pull/4902 @yinfan98 
- [x] Add E2E Test like `sglang/test/srt/test_triton_attention_backend.py`: https://github.com/sgl-project/sglang/pull/4760 @yubofredwang 
- [x] Support Multimodal  https://github.com/sgl-project/sglang/pull/5103 @zcnrex @mickqian @yizhang2077 
- [x] Support FP8 https://github.com/sgl-project/sglang/pull/4686 @yundai424 


**Documentation and Benchmark:**
- [x] https://github.com/sgl-project/sglang/issues/4865
- [x] https://github.com/sgl-project/sglang/issues/5172 @zhyncs @hebiao064 Shivam (In Review, will be tracked offline)

**Perf Optimization and Accuracy Problems**
- [x] Fix Cuda Graph Accuracy problem for Page Size > 1: https://github.com/sgl-project/sglang/pull/4855 @qingquansong 
- [x] Optimizing Decoding by remove `item() device sync: https://github.com/sgl-project/sglang/pull/4745 @hebiao064 
- [x] Optimizing Prefill by remove `item()` device sync: https://github.com/sgl-project/sglang/pull/4932 @Fridge003 
- [x] Optimizing Draft Decode and Target Verify CUDA Graph Latency: https://github.com/sgl-project/sglang/pull/5090 @hebiao064 

Success Criteria: 
- The latency should be on par with vLLM FlashAttention3 and SGLang's FlashInfer implementation
- The accuracy should be on par with vLLM FlashAttention3 and SGLang's FlashInfer implementation



Other issues we surfaced but not scoped in this task:
- Flash Infer accuracy is bad for Gemma 2 Models
- [x] VSCode Test Explorer is broken since some circular dependency: https://github.com/sgl-project/sglang/pull/4736 @hebiao064 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions