Skip to content

[BUG] MLA example is broken for split-kv and larger Q #2274

@divchenko

Description

@divchenko

Describe the bug
Currently example initializes Q to quite small values (mean -1, stddev 1). If I initialize Q to a bit bigger values (e.g. stddev 100), split-kv stops working. Larger Qs are typical for LLMs.
Looks like is some overflow in LSE compute.

Steps/Code to reproduce bug
Apply the following patch https://gist.github.com/divchenko/10f1991a7a197b706b5c46aaca1a9bd2 to
commit f535c33 (HEAD, tag: v3.9.1)
Then run w/o split-kv

./77_blackwell_mla_2sm_fp16 --verify --split_kv=1
###### B 64 MLA H 128 D_rope 64 D_latent 512 Q 1 K 256 Gen None Split 1 Gen None #SM 148
 [OK]  128x128 fp16 persistent          : 156.768 TFLOPS/s 1.26077 TB/s
 [OK]  128x128 fp16 individual          : 162.176 TFLOPS/s 1.30426 TB/s

But when split-kv is enabled, it fails:


./77_blackwell_mla_2sm_fp16 --verify --split_kv=2
###### B 64 MLA H 128 D_rope 64 D_latent 512 Q 1 K 256 Gen None Split 2 Gen None #SM 148
failed O: max diff 6.26562 mean 1.15391
failed LSE: max diff inf mean inf
Reference check failed
[FAIL] 128x128 fp16 persistent          : 70.6905 TFLOPS/s 0.568513 TB/s
failed O: max diff 6.26562 mean 1.15391
failed LSE: max diff inf mean inf
Reference check failed
[FAIL] 128x128 fp16 individual          : 65.94 TFLOPS/s 0.530308 TB/s

Expected behavior
Verification should pass for larger Q

Environment details (please complete the following information):
B200
NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions