Fix swa eagle verify accuracy for Triton backend #9279

ispobock · 2025-08-17T16:24:14Z

Motivation

For sliding window layers, we should add offset to read custom mask in the sliding window.

Accuracy Test

python3 -m sglang.launch_server --model openai/gpt-oss-20b --speculative-algorithm EAGLE3 --speculative-draft-model-path zhuyksir/EAGLE3-gpt-oss-20b-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --attention-backend triton
OPENAI_BASE_URL=http://localhost:30000/v1 OPENAI_API_KEY=dummy python -m simple-evals.simple_evals --model o4-mini-with-chat-completion-and-4k-gen --eval mmlu --examples 1000

main banch:

| model_name                                              |   ('metric', 'mmlu') |
|:--------------------------------------------------------|---------------------:|
| o4-mini-with-chat-completion-and-4k-gen_20250817_161821 |                0.777 |

this PR:

| model_name                                              |   ('metric', 'mmlu') |
|:--------------------------------------------------------|---------------------:|
| o4-mini-with-chat-completion-and-4k-gen_20250817_015710 |                0.816 |

gemini-code-assist

Summary of Changes

Hello @ispobock, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue in the Triton backend's handling of sliding window attention (SWA) by introducing a mechanism to correctly apply custom masks. The core change involves adding an offset to accurately read custom masks within the sliding window, which is essential for proper attention calculation. This fix significantly improves model accuracy, as evidenced by an increase in the MMLU score from 0.777 to 0.816, ensuring more reliable and performant SWA operations.

Highlights

Sliding Window Attention (SWA) Offset Handling: Introduced window_kv_offsets to correctly calculate the start position for custom masks within sliding window layers, ensuring proper attention mask application in the Triton backend.
Triton Backend Integration: Modified TritonAttnBackend to manage and pass the new window_kv_offsets through various forward and CUDA graph initialization paths, integrating it into the attention computation flow.
Attention Kernel Logic Update: Updated the extend_attention.py Triton kernel to utilize the window_kv_offset when loading custom masks, which is crucial for accurate SWA behavior.
Decode Attention Batch Calculation: Corrected the batch size calculation in decode_attention.py kernels by using q.shape[0] instead of kv_indptr.shape[0] - 1, ensuring correct batch dimension for attention operations.
Performance/Accuracy Improvement: The changes lead to an improvement in model accuracy, as demonstrated by an increase in the MMLU score from 0.777 to 0.816.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a bug in the Triton backend for sliding window attention when used with EAGLE speculative decoding. The core of the fix is to correctly calculate and propagate the key-value cache offset (window_kv_offsets) for sliding windows. This offset is then used in the Triton attention kernel to accurately index the custom attention mask, which is crucial for the correctness of speculative decoding verification.

The changes are well-structured:

In python/sglang/srt/layers/attention/triton_backend.py, window_kv_offsets is added to the ForwardMetadata and plumbed through various initialization functions, including those for CUDA graphs. The offset is correctly computed by update_sliding_window_buffer.
In python/sglang/srt/layers/attention/triton_ops/extend_attention.py, the Triton kernel _fwd_kernel is updated to accept and use this offset to correctly calculate the indices for the custom mask.
A minor but good refactoring in python/sglang/srt/layers/attention/triton_ops/decode_attention.py makes the batch size calculation more direct and robust.

The implementation appears correct and addresses the issue described in the pull request, as supported by the significant accuracy improvement shown in the MMLU benchmark. The code is clean and the changes are consistent. I have no further recommendations.

fix triton swa eagle

b3d0adf

ispobock requested review from merrymercy, Ying1123, zhyncs, HaiShaw, ch-wan, BBuf, kushanam and Edwardf0t1 as code owners August 17, 2025 16:24

ispobock changed the title ~~Fix swa eagle verify for Triton backend~~ Fix swa eagle verify accuracy for Triton backend Aug 17, 2025

gemini-code-assist bot reviewed Aug 17, 2025

View reviewed changes

Merge branch 'main' into fix-triton-eagle

edb8182

gemini-code-assist bot reviewed Aug 17, 2025

View reviewed changes

zhyncs merged commit be1a3cd into main Aug 17, 2025
69 of 71 checks passed

zhyncs deleted the fix-triton-eagle branch August 17, 2025 19:52

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

Fix swa eagle verify accuracy for Triton backend (sgl-project#9279)

ebd4846

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix swa eagle verify accuracy for Triton backend #9279

Fix swa eagle verify accuracy for Triton backend #9279

Uh oh!

ispobock commented Aug 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Fix swa eagle verify accuracy for Triton backend #9279

Fix swa eagle verify accuracy for Triton backend #9279

Uh oh!

Conversation

ispobock commented Aug 17, 2025

Motivation

Accuracy Test

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!