FA cute #9428

hyhieu · 2025-08-21T05:02:24Z

Motivation

Integrate Flash Attention's implementation in CuTe DSL.

Modifications

Copy the code into sglang/srt/layers/attention/cute_ops
Create a new backend blackwell_prefill_attention_backend.py
Allow --prefill-attention-backend to take the value "fa-cute"

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

…t_executor.py

gemini-code-assist

Summary of Changes

Hello @hyhieu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new, highly optimized Flash Attention implementation tailored for NVIDIA Blackwell (SM100) GPUs. It introduces a dedicated prefill attention backend that utilizes a newly added Flash Attention kernel built with NVIDIA's CuTe DSL. This enhancement aims to significantly boost performance for prefill operations, particularly with paged KV cache, by leveraging low-level hardware optimizations.

Highlights

New Blackwell-Optimized Attention Backend: Introduces a new fa-cute attention backend specifically designed for NVIDIA Blackwell (SM100) GPUs, leveraging their advanced hardware capabilities.
CuTe DSL-based Flash Attention Kernel: Integrates a highly optimized Flash Attention kernel built using NVIDIA's CuTe DSL, enabling significant performance improvements for attention computations.
Dedicated Prefill Attention Backend: Adds a dedicated BlackwellPrefillAttentionBackend to handle prefill operations efficiently, utilizing the new Flash Attention kernel and supporting paged KV cache.
Automatic Page Size Configuration: Ensures compatibility and optimal performance by automatically setting the page_size to 128 when the fa-cute backend is selected, aligning with the kernel's requirements.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request integrates Flash Attention's implementation in CuTe DSL, adding a new attention backend fa-cute for Blackwell GPUs. The changes are well-structured, including the new backend, supporting kernel code, and necessary updates to the model runner and server arguments. I have a couple of minor suggestions to remove some dead code and a debug print statement to improve code clarity and maintainability.

python/sglang/srt/layers/attention/blackwell_prefill_attention_backend.py

python/sglang/srt/model_executor/cuda_graph_runner.py

yiakwy-xpu-ml-framework-team · 2025-08-21T08:18:30Z

Hi @hyhieu really nice job for integrating new attention backend. Do we have any performance benchmarking against latest triton (triton_kernels), and cutlass implementation ?

root and others added 22 commits August 11, 2025 00:19

Blackwell files.

c1d6993

Can trigger the bad assertion

ca7d206

ref impl

fc6ccee

now require FlashAttentionForwardSm100

84a7ccb

copied in the code from flash-attn

30af6a2

fix padding

2f47cc6

can pass unit tests

dec1b18

fix gargabe collector

77a2128

separate test code into a new file. with this, no need to hack the ji…

109c51d

…t_executor.py

fixed imports

7097820

sync

2df4547

Merge branch 'main' into hieu/fa4

2e2e967

paged

a4fe336

kind of understood what paged is doing

1d5acc7

passed all paged tests

ab1db6e

comment and format

8faccb1

more tests

a836a86

trying. still fail because of shape mismatch

9ff07a1

non-cuda graph case works.

84367d6

cuda graph. why doesnt this fail?

7cc9d52

all work.

de4a470

fa4 -> fa-cute.

4c2f68c

hyhieu requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, HaiShaw, ch-wan and BBuf as code owners August 21, 2025 05:02

hyhieu requested review from kushanam and Edwardf0t1 as code owners August 21, 2025 05:02

gemini-code-assist bot reviewed Aug 21, 2025

View reviewed changes

python/sglang/srt/layers/attention/blackwell_prefill_attention_backend.py Outdated Show resolved Hide resolved

python/sglang/srt/model_executor/cuda_graph_runner.py Outdated Show resolved Hide resolved

address Gemini comments

aca9688

zhyncs self-assigned this Aug 21, 2025

zhyncs added the high priority label Aug 21, 2025

zhyncs changed the title ~~FA cute.~~ FA cute Aug 21, 2025

zhyncs assigned ispobock and Qiaolin-Yu Aug 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FA cute #9428

FA cute #9428

hyhieu commented Aug 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Aug 21, 2025

Uh oh!

Uh oh!

FA cute #9428

Are you sure you want to change the base?

FA cute #9428

Conversation

hyhieu commented Aug 21, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Aug 21, 2025

Uh oh!

Uh oh!