[BUG] MLA example is broken for split-kv and larger Q

**Describe the bug**
Currently example initializes Q to quite small values (mean -1, stddev 1). If I initialize Q to a bit bigger values (e.g. stddev 100), split-kv stops working. Larger Qs are typical for LLMs.
Looks like is some overflow in LSE compute.

**Steps/Code to reproduce bug**
Apply the following patch https://gist.github.com/divchenko/10f1991a7a197b706b5c46aaca1a9bd2  to
commit f535c33634b640a4c0bee131f2f6e9f81877a18c (HEAD, tag: v3.9.1)
Then run w/o split-kv

```
./77_blackwell_mla_2sm_fp16 --verify --split_kv=1
###### B 64 MLA H 128 D_rope 64 D_latent 512 Q 1 K 256 Gen None Split 1 Gen None #SM 148
 [OK]  128x128 fp16 persistent          : 156.768 TFLOPS/s 1.26077 TB/s
 [OK]  128x128 fp16 individual          : 162.176 TFLOPS/s 1.30426 TB/s

```

But when split-kv is enabled, it fails:
```

./77_blackwell_mla_2sm_fp16 --verify --split_kv=2
###### B 64 MLA H 128 D_rope 64 D_latent 512 Q 1 K 256 Gen None Split 2 Gen None #SM 148
failed O: max diff 6.26562 mean 1.15391
failed LSE: max diff inf mean inf
Reference check failed
[FAIL] 128x128 fp16 persistent          : 70.6905 TFLOPS/s 0.568513 TB/s
failed O: max diff 6.26562 mean 1.15391
failed LSE: max diff inf mean inf
Reference check failed
[FAIL] 128x128 fp16 individual          : 65.94 TFLOPS/s 0.530308 TB/s

```

**Expected behavior**
Verification should pass for larger Q

**Environment details (please complete the following information):**
B200
NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8

**Additional context**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] MLA example is broken for split-kv and larger Q #2274

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] MLA example is broken for split-kv and larger Q #2274

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions