-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
Description
Describe the bug
Currently example initializes Q to quite small values (mean -1, stddev 1). If I initialize Q to a bit bigger values (e.g. stddev 100), split-kv stops working. Larger Qs are typical for LLMs.
Looks like is some overflow in LSE compute.
Steps/Code to reproduce bug
Apply the following patch https://gist.github.com/divchenko/10f1991a7a197b706b5c46aaca1a9bd2 to
commit f535c33 (HEAD, tag: v3.9.1)
Then run w/o split-kv
./77_blackwell_mla_2sm_fp16 --verify --split_kv=1
###### B 64 MLA H 128 D_rope 64 D_latent 512 Q 1 K 256 Gen None Split 1 Gen None #SM 148
[OK] 128x128 fp16 persistent : 156.768 TFLOPS/s 1.26077 TB/s
[OK] 128x128 fp16 individual : 162.176 TFLOPS/s 1.30426 TB/s
But when split-kv is enabled, it fails:
./77_blackwell_mla_2sm_fp16 --verify --split_kv=2
###### B 64 MLA H 128 D_rope 64 D_latent 512 Q 1 K 256 Gen None Split 2 Gen None #SM 148
failed O: max diff 6.26562 mean 1.15391
failed LSE: max diff inf mean inf
Reference check failed
[FAIL] 128x128 fp16 persistent : 70.6905 TFLOPS/s 0.568513 TB/s
failed O: max diff 6.26562 mean 1.15391
failed LSE: max diff inf mean inf
Reference check failed
[FAIL] 128x128 fp16 individual : 65.94 TFLOPS/s 0.530308 TB/s
Expected behavior
Verification should pass for larger Q
Environment details (please complete the following information):
B200
NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8
Additional context