[new feat] ascend backend support fia fusion kernel #8328

ZhengdQin · 2025-07-24T14:03:54Z

Motivation

In this MR, we implemented the NPU fusion kernels npu_fused_infer_attention_score in the Qwen2.5-7b, deepseek-v2-lite and deepseek-v3 models, this fusion kernel is suitable for the graph mode. One needs to export ASCEND_USE_FIA=ture to activate this fusion kernel.

Modifications

Ascend Backend: support npu_fused_infer_attention_score kernel
Add unittest: test_ascend_tp_fia_bf16.py and test_ascend_mla_fia_w8a8int8.py

Exclusive support for paged attention, currently ONLY support page size 128.
You can activate this feature by setting export ASCEND_USE_FIA=ture and --attention-backend ascend.

Memory Management Advancement

We modify the AscendMLAPagedTokenToKVPool class, split the kvbuffer to the k_buffer and v_buffer, in order to remove the split op in MLA attention.

Testing Framework

Unit tests for the Ascend attention backend have been added and can be found at: /test/srt/ascend/test_ascend_tp_fia_bf16.py and /test/srt/ascend/test_ascend_mla_fia_w8a8int8.py

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Accuracy and performance result

python -m unittest test_npu_mla_backend.TestNpuMlaBackend.test_gsm8k

Pre-commit check

gemini-code-assist

Summary of Changes

Hello @ZhengdQin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the SGLang framework's hardware compatibility by introducing comprehensive support for DeepSeek-v3 models on Ascend NPU devices. It achieves this through the integration of a new NPU-optimized attention backend, custom fusion kernels for key computational and communication patterns, and specialized memory management and quantization techniques tailored for NPU architecture. The changes aim to unlock high-performance inference for DeepSeek-v3 on NPU, particularly with dynamic w8a8 precision.

Highlights

NPU DeepSeek-v3 Support: Implemented a new npumla attention backend specifically for DeepSeek-v3 models on Ascend NPU devices, enabling dynamic w8a8 precision and leveraging 16 NPUs.
Custom NPU Fusion Kernels: Introduced a suite of NPU-optimized fusion kernels for attention, quantization, and Mixture-of-Experts (MoE) operations, including _npu_fused_infer_attention_score, _npu_dequant_swiglu_quant, npu_grouped_matmul, npu_moe_distribute_combine, npu_moe_distribute_dispatch, npu_moe_re_routing, npu_moe_init_routing_v2, npu_moe_finalize_routing, npu_add_rms_norm, npu_dynamic_quant, npu_quant_matmul, and npu_moe_gating_top_k.
Enhanced MoE Dispatching: Developed NpuDeepEPMoE and NpuDeepEPDispatcher to optimize MoE operations on NPU, utilizing specialized NPU kernels for routing, dispatch, and combine phases, including support for dynamic quantization scales.
Paged Attention and Memory Management: Integrated paged attention support for NPU with a fixed page size of 128, and introduced MLATokenToKVPool for efficient KV cache buffer management on NPU, including NPU-specific allocation kernels.
W8A8 Int8 Quantization: Extended W8A8 Int8 quantization support to NPU, including dynamic quantization and NPU-specific handling for RMSNorm and quantized matrix multiplications.
Comprehensive Testing: Added new unit tests (test_npu_mla_backend.py) to validate the functionality and performance of the NPU MLA backend, specifically for DeepSeek-v3 with w8a8 quantization on NPU.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The code changes introduce the ability to use Ascend NPU for the deepseek-v3 model. I've identified a few issues, including one critical bug that would prevent the code from running, some high-severity issues related to code correctness, and several medium-severity issues concerning code clarity and maintainability. Addressing these will improve the quality and robustness of the new NPU support.

gemini-code-assist · 2025-07-24T14:07:08Z

python/sglang/srt/layers/attention/npumla_backend.py

+            self.attn_backends.append(
+                NpuMLABackend(
+                    model_runner,
+                    skip_prefill=True,
+                    kv_indptr_buf=self.kv_indptr[i],
+                    kv_last_page_len_buf=None,
+                )
+            )


The NpuMLABackend constructor is being called with kv_indptr_buf and kv_last_page_len_buf arguments, but its __init__ method does not accept them. This will raise a TypeError at runtime.

The __init__ signature for NpuMLABackend is __init__(self, model_runner: ModelRunner, skip_prefill: bool = False).

Please remove the extra arguments from the call to NpuMLABackend.

self.attn_backends.append( NpuMLABackend( model_runner, skip_prefill=True, ) )

gemini-code-assist · 2025-07-24T14:07:08Z

python/sglang/srt/utils.py

+def support_triton(attn_backend: str) -> bool:
+    return attn_backend not in ["torch_native", "intel_amx", "npumla"]


The function support_triton is redefined here. It was previously defined at line 199. This can lead to unexpected behavior depending on import order. The old implementation should be removed, and this new one should be located where the old one was to maintain code organization.

The old implementation is:
return backend not in ["torch_native", "intel_amx", "ascend"]

The new implementation correctly includes "npumla". Please consolidate the definitions.

gemini-code-assist · 2025-07-24T14:07:08Z

python/sglang/srt/layers/attention/npumla_backend.py

+            if False:
+                block_kv_indices = self.indices_updater_decode.update(
+                    forward_batch.req_pool_indices,
+                    forward_batch.seq_lens,
+                    forward_batch.seq_lens_sum,
+                    init_metadata_replay=False,
+                )


This if False: block contains dead code. It seems to be a placeholder or a disabled feature. To improve code clarity and maintainability, please remove this unreachable block.

else: max_seqlen_pad = ( forward_batch.seq_lens.max().item() + PAGE_SIZE - 1 ) // PAGE_SIZE block_kv_indices = torch.full( (bs, max_seqlen_pad), -1, dtype=torch.int32, device=forward_batch.seq_lens.device, ) create_flashmla_kv_indices( bs, self.req_to_token, forward_batch.req_pool_indices, forward_batch.seq_lens, None, block_kv_indices, self.req_to_token.stride(0), max_seqlen_pad, )

gemini-code-assist · 2025-07-24T14:07:08Z

python/sglang/srt/layers/attention/npumla_backend.py

+        if False:
+            attn_weights = (
+                torch.matmul(q, k.transpose(1, 2)) * layer.scaling
+            )  # (bs, n, n)
+            # assert attention_mask is not None
+            # if attention_mask is not None:
+            #     attn_weights += attention_mask
+
+            attn_weights = torch.nn.functional.softmax(
+                attn_weights, dim=-1, dtype=torch.float32
+            ).to(q.dtype)
+
+            # v = v[..., :self.kv_lora_rank]
+            attn_ouput = torch.matmul(attn_weights, v)  # (bs, n, v_dim)
+            # attn_ouput = attn_ouput.transpose(1,2).contiguous()


This if False: block appears to contain a reference implementation and is currently dead code. Please remove it to keep the codebase clean and reduce confusion.

else: bs = forward_batch.batch_size if use_gqa: attn_ouput = torch.empty( bs_qlen, q_heads, v_dim, device=q.device, dtype=q.dtype ) q_len_offset = 0 for q_len in forward_batch.seq_len: attn_ouput[q_len_offset : q_len_offset + q_len] = ( torch.ops.npu.npu_fused_infer_attention_score( q[None, q_len_offset : q_len_offset + q_len], k[None, q_len_offset : q_len_offset + q_len], v[None, q_len_offset : q_len_offset + q_len], num_heads=q_heads, num_key_value_heads=k_heads, input_layout="BSND", # todo, TND not supports q_heads!=k_heads atten_mask=self.attn_mask.unsqueeze(0), sparse_mode=3, scale=layer.scaling, next_tokens=0, )[0] ) q_len_offset += q_len else: # MHA if q_dim != v_dim: q_nope, q_rope = q.split( [self.v_head_dim, self.qk_rope_head_dim], dim=-1 ) k_nope, k_rope = k.split( [self.v_head_dim, self.qk_rope_head_dim], dim=-1 ) attn_ouput, _ = torch.ops.npu.npu_fused_infer_attention_score( q_nope, k_nope, v, query_rope=q_rope, key_rope=k_rope, num_heads=q_heads, input_layout="TND", atten_mask=self.attn_mask, sparse_mode=3, actual_seq_lengths=self.forward_metadata.seq_lens_list_cumsum, actual_seq_lengths_kv=self.forward_metadata.seq_lens_list_cumsum, scale=layer.scaling, next_tokens=0, ) else: attn_ouput, _ = torch.ops.npu.npu_fused_infer_attention_score( q, k, v, num_heads=q_heads, input_layout="TND", atten_mask=self.attn_mask, sparse_mode=3, actual_seq_lengths=self.forward_metadata.seq_lens_list_cumsum, actual_seq_lengths_kv=self.forward_metadata.seq_lens_list_cumsum, scale=layer.scaling, next_tokens=0, ) attn_ouput = attn_ouput[..., : layer.v_head_dim]

gemini-code-assist · 2025-07-24T14:07:08Z

python/sglang/srt/layers/attention/npumla_backend.py

+            # k_nope, k_rope = k_cache.split(
+            #     [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
+            # )  # todo, there is a bug


This commented-out code includes a todo about a bug. Since it's commented out and seems to be replaced by the following lines, it should be removed to avoid confusion and keep the code clean.

gemini-code-assist · 2025-07-24T14:07:09Z

python/sglang/srt/layers/attention/npumla_backend.py

+            if False:
+                req_to_token = forward_batch.req_to_token_pool.req_to_token
+                attn_output = q_nope.new_zeros((b, s, n, k_dim))
+                for seq_idx in range(b):
+                    seq_len_kv = forward_batch.seq_len[seq_idx]
+                    req_pool_idx = forward_batch.req_pool_indices[seq_idx]
+                    per_req_tokens = req_to_token[req_pool_idx, :seq_len_kv]
+                    k = k_cache[per_req_tokens]
+                    v = v_cache[per_req_tokens]
+                    attn_ouput_idx, _ = torch.ops.npu.npu_fused_infer_attention_score(
+                        q_nope[seq_idx].unsqueeze(0),
+                        k.unsqueeze(0),
+                        v.unsqueeze(0),
+                        num_heads=n,
+                        num_key_value_heads=k_heads,
+                        input_layout="BSND",
+                        atten_mask=None,
+                        sparse_mode=0,
+                        scale=layer.scaling,
+                    )
+                    attn_ouput[seq_idx] = attn_ouput_idx.squeeze(0)


This if False: block appears to contain a reference implementation and is currently dead code. Please remove it to improve readability and maintainability.

else: seq_len_kv = forward_batch.seq_lens attn_ouput, _ = torch.ops.npu.npu_fused_infer_attention_score( q_nope, k_cache.view(-1, PAGE_SIZE, k_heads * k_dim), v_cache.view(-1, PAGE_SIZE, k_heads * k_dim), num_heads=n, num_key_value_heads=k_heads, input_layout="BSND", atten_mask=None, block_size=PAGE_SIZE, block_table=self.forward_metadata.block_kv_indices, actual_seq_lengths_kv=self.forward_metadata.seq_lens_list, scale=layer.scaling, )

gemini-code-assist · 2025-07-24T14:07:09Z

python/sglang/srt/layers/moe/ep_moe/layer.py

@@ -53,6 +56,7 @@
 _is_npu = is_npu()
 _is_fp8_fnuz = is_fp8_fnuz()
 _use_aiter = get_bool_env_var("SGLANG_USE_AITER") and _is_hip
+_is_npu = is_npu()


The variable _is_npu is defined twice in this file. This line is redundant as it's already defined on line 56. Please remove this duplicate definition to avoid confusion.

gemini-code-assist · 2025-07-24T14:07:09Z

python/sglang/srt/layers/quantization/w8a8_int8.py

@@ -43,6 +43,7 @@
 _is_cuda = is_cuda()
 _is_cpu_amx_available = cpu_has_amx_support()
 _is_cpu = is_cpu()
+_is_npu = is_npu()
 if _is_cuda:
    from sgl_kernel import int8_scaled_mm
 _is_npu = is_npu()


The variable _is_npu is defined twice. This line is redundant as it's already defined on line 46. Please remove this duplicate definition.

gemini-code-assist · 2025-07-24T14:07:09Z

python/sglang/srt/layers/rotary_embedding.py

+        if (
+            1
+        ):  # get_bool_env_var("SGLANG_ENABLE_TORCH_COMPILE"): TODO: Not support fusion ops
            return self.forward_native(positions, query, key, offsets)


Using if (1): to temporarily disable a code path is unconventional and can be confusing. For better readability and to adhere to common Python style, please use if True: instead. If this is a temporary change for debugging, it should be reverted to the original logic or removed before merging.

Suggested change

if (

1

): # get_bool_env_var("SGLANG_ENABLE_TORCH_COMPILE"): TODO: Not support fusion ops

return self.forward_native(positions, query, key, offsets)

if (

True

): # get_bool_env_var("SGLANG_ENABLE_TORCH_COMPILE"): TODO: Not support fusion ops

gemini-code-assist · 2025-07-24T14:07:09Z

python/sglang/srt/models/deepseek_v2.py

 _is_fp8_fnuz = is_fp8_fnuz()
 _use_aiter = get_bool_env_var("SGLANG_USE_AITER") and _is_hip
 _is_cpu_amx_available = cpu_has_amx_support()
 _is_cpu = is_cpu()
 _device_sm = get_device_sm()
+_is_npu = is_npu()


The variable _is_npu is defined twice. This line is redundant as it's already defined on line 119. Please remove this duplicate definition to improve code clarity.

iforgetmyname · 2025-07-25T04:41:34Z

Hi mate, thank you for your contribution to help sglang runs well on ascend npu. However we have a better supporting plan and we would like to discuss it together with ya, pls contact zl19940307@163.com if you are interested.

ZhengdQin · 2025-07-28T06:15:19Z

Hi mate, thank you for your contribution to help sglang runs well on ascend npu. However we have a better supporting plan and we would like to discuss it together with ya, pls contact zl19940307@163.com if you are interested.

Hi there, in this PR, we have provided a complete "best practice" for the DeepSeek network on NPU, and submitted a CI accuracy baseline. We look forward to future communications and joint contributions.

Alcanderian · 2025-07-29T11:12:18Z

python/sglang/srt/layers/rotary_embedding.py

+        if _is_npu:
+            device = "npu"


remove this because 644 line do the same thing

Thanks, we have fixed it.

Alcanderian · 2025-07-29T11:14:18Z

python/sglang/srt/layers/attention/utils.py

@@ -92,3 +93,63 @@ def create_flashmla_kv_indices_triton(
            data // PAGED_SIZE,
            mask=mask_out,
        )
+
+
+def create_flashinfer_kv_indices(


We only support the PA scenario and no longer need this function, so we have removed it.

Alcanderian · 2025-07-29T11:14:28Z

python/sglang/srt/layers/attention/utils.py

+        ] = data
+
+
+def create_flashmla_kv_indices(


please add ut

Thanks for the comment, create_npumla_kv_indices is different from create_flashmla_kv_indices_triton. We use torch native ops to finish the function, which can run on CPU and NPU directly, and the PAGE_SIZE is only support 128.

Alcanderian · 2025-07-29T11:27:56Z

python/sglang/srt/layers/moe/ep_moe/layer.py

+        n_routed_experts_per_rank=0,
+    ):
+        world_size = get_tensor_model_parallel_world_size()
+        if world_size > 1 and n_routed_experts_per_rank >= 1:


what if world_size = 1 and n_routed_experts_per_rank >= 1?

Thanks, we have fixed it.

Alcanderian · 2025-07-29T11:31:46Z

python/sglang/srt/layers/moe/ep_moe/layer.py

-            params_dtype=params_dtype,
-            weight_loader=self.weight_loader,
-        )
+        kwargs = {


We should modify the keyword arg name of W8A8Int8MoEMethod

Thank you. We will immediately open an issue to resolve this problem, just as we discussed yesterday.

Alcanderian · 2025-07-29T12:05:56Z

python/sglang/srt/mem_cache/memory_pool.py

@@ -766,6 +766,16 @@ def set_mla_kv_buffer_triton(
    )


+def set_mla_kv_buffer_npu(


We have AscendTokenToKVPool

Thanks for the comment, for the npumla backend, using AscendTokenToKVPool is not very convenient. This is because in MLATokenToKVPool, the allocation and usage of KV caches are independent across layers. If a use a contiguous buffer of KV cache, additional slice operations may be required.

Alcanderian · 2025-07-29T12:15:55Z

python/sglang/srt/models/deepseek_v2.py

                layer = self.layers[i]
                hidden_states, residual = layer(
                    positions, hidden_states, forward_batch, residual, zero_allocator
                )
+            else:
+                with get_global_expert_distribution_recorder().with_current_layer(i):


revert this change

Thanks, we have reverted this.

Alcanderian · 2025-07-29T12:17:37Z

python/sglang/srt/speculative/eagle_worker.py

@@ -249,6 +249,16 @@ def init_attention_backend(self):
                self.topk,
                self.speculative_num_steps,
            )
+        elif self.server_args.attention_backend == "npumla":


Split into another PR

Thanks, we have reverted the files refers to MTP.

Alcanderian · 2025-07-29T12:18:11Z

python/sglang/srt/two_batch_overlap.py

revert the changes of this file

Thanks, we have reverted this file.

Alcanderian · 2025-07-29T12:18:32Z

python/sglang/srt/utils.py

-    )
-except:
-    is_intel_amx_backend_available = False
+    return backend not in ["torch_native", "intel_amx", "ascend", "npumla"]


Thanks, we have fixed it.

ssshinigami · 2025-08-25T12:01:16Z

What is motivation of these changes? Could you please make performance measurements with FIA and not FIA? I beleave it is slower

ZhengdQin · 2025-08-25T13:13:51Z

What is motivation of these changes? Could you please make performance measurements with FIA and not FIA? I beleave it is slower

The motivation is:

The later pr will support torch compile, which needs fia kernel;
fia is faster in some cases, especially when the seq is very long;

ZhengdQin · 2025-08-26T03:58:59Z

What is motivation of these changes? Could you please make performance measurements with FIA and not FIA? I beleave it is slower

FIA result: sglang/test/srt/ascend/test_ascend_mla_fia_w8a8int8.py (tp == 2)

Original result: sglang/test/srt/ascend/test_ascend_mla_w8a8int8.py (change tp = 2)

Alisehen · 2025-08-26T16:15:43Z

I think there is a small typo here. It should be: export ASCEND_USE_FIA=true (ture)

ZhengdQin requested review from merrymercy, Ying1123, zhyncs, rkooo567, kssteven418, ispobock, ByronHsu, zhaochenyang20, hnyls2002, xiezhq-hermann, HaiShaw, ch-wan and BBuf as code owners July 24, 2025 14:03

gemini-code-assist bot reviewed Jul 24, 2025

View reviewed changes

ZhengdQin force-pushed the npu_dpsk_v3 branch from 50517fc to bcb6918 Compare July 25, 2025 04:09

ZhengdQin force-pushed the npu_dpsk_v3 branch from bcb6918 to e52a745 Compare July 29, 2025 10:16

ZhengdQin requested a review from kushanam as a code owner July 29, 2025 10:16

Alcanderian reviewed Jul 29, 2025

View reviewed changes

ZhengdQin force-pushed the npu_dpsk_v3 branch from e52a745 to c0717d1 Compare July 30, 2025 18:10

ZhengdQin requested a review from yizhang2077 as a code owner July 30, 2025 18:10

ZhengdQin force-pushed the npu_dpsk_v3 branch 7 times, most recently from fe49ec2 to 900e3d3 Compare August 1, 2025 03:17

ZhengdQin force-pushed the npu_dpsk_v3 branch 2 times, most recently from f22df63 to 5d96e6a Compare August 21, 2025 07:17

ZhengdQin added a commit to ZhengdQin/sglang that referenced this pull request Aug 21, 2025

[new feat] ascend backend support fia fusion kernel; (sgl-project#8328)

5d96e6a

ZhengdQin added a commit to ZhengdQin/sglang that referenced this pull request Aug 21, 2025

[new feat] ascend backend support fia fusion kernel; (sgl-project#8328)

75d00f1

ZhengdQin force-pushed the npu_dpsk_v3 branch from 5d96e6a to 75d00f1 Compare August 21, 2025 10:05

ZhengdQin added a commit to ZhengdQin/sglang that referenced this pull request Aug 21, 2025

[new feat] ascend backend support fia fusion kernel; (sgl-project#8328)

8273eed

ZhengdQin force-pushed the npu_dpsk_v3 branch from 8dfe992 to 8273eed Compare August 21, 2025 12:00

ZhengdQin added a commit to ZhengdQin/sglang that referenced this pull request Aug 21, 2025

[new feat] ascend backend support fia fusion kernel; (sgl-project#8328)

15f3f67

ZhengdQin force-pushed the npu_dpsk_v3 branch from 8919ef2 to 15f3f67 Compare August 21, 2025 12:47

ZhengdQin added a commit to ZhengdQin/sglang that referenced this pull request Aug 22, 2025

[new feat] ascend backend support fia fusion kernel; (sgl-project#8328)

08c981b

ZhengdQin force-pushed the npu_dpsk_v3 branch from 4adcc81 to 08c981b Compare August 22, 2025 04:11

[new feat] ascend backend support fia fusion kernel; (sgl-project#8328)

aa2e01e

ZhengdQin force-pushed the npu_dpsk_v3 branch from 08c981b to aa2e01e Compare August 22, 2025 04:12

iforgetmyname and others added 2 commits August 22, 2025 14:35

Merge branch 'main' into npu_dpsk_v3

04384cb

Merge branch 'main' into npu_dpsk_v3

c5131db

Alcanderian approved these changes Aug 25, 2025

View reviewed changes

iforgetmyname approved these changes Aug 25, 2025

View reviewed changes

Alcanderian added ready-to-merge The PR is ready to merge after the CI is green. npu labels Aug 25, 2025

sgl-project deleted a comment from ssshinigami Aug 25, 2025

iforgetmyname mentioned this pull request Aug 25, 2025

[Feature] Support NPUGraph for DeepSeek on Ascend NPU #9355

Merged

4 tasks

zhyncs changed the title ~~[new feat] ascend backend support fia fusion kernel;~~ [new feat] ascend backend support fia fusion kernel Aug 25, 2025

zhyncs assigned Alcanderian Aug 25, 2025

zhyncs added the high priority label Aug 25, 2025

Merge branch 'main' into npu_dpsk_v3

73de484

zhyncs merged commit f92b729 into sgl-project:main Aug 26, 2025
97 of 98 checks passed

		def support_triton(attn_backend: str) -> bool:
		return attn_backend not in ["torch_native", "intel_amx", "npumla"]

		@@ -766,6 +766,16 @@ def set_mla_kv_buffer_triton(
		)


		def set_mla_kv_buffer_npu(

[new feat] ascend backend support fia fusion kernel #8328

[new feat] ascend backend support fia fusion kernel #8328

Uh oh!

Conversation

ZhengdQin commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Memory Management Advancement

Testing Framework

Checklist

Accuracy and performance result

Pre-commit check

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iforgetmyname commented Jul 25, 2025

Uh oh!

ZhengdQin commented Jul 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhengdQin commented Jul 24, 2025 •

edited

Loading

ZhengdQin Jul 30, 2025 •

edited

Loading