Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions docs/backend/server_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,10 +166,11 @@ Please consult the documentation below and [server_args.py](https://github.com/s

## Kernel backend

| Arguments | Description | Defaults |
|----------|-------------|---------|
| `attention_backend` | This argument specifies the backend for attention computation and KV cache management, which can be `fa3`, `flashinfer`, `triton`, `cutlass_mla`, or `torch_native`. When deploying DeepSeek models, use this argument to specify the MLA backend. | None |
| `sampling_backend` | Specifies the backend used for sampling. | None |
| Arguments | Description | Defaults |
|-------------------------|-------------|---------|
| `attention_backend` | This argument specifies the backend for attention computation and KV cache management, which can be `fa3`, `flashinfer`, `triton`, `cutlass_mla`, or `torch_native`. When deploying DeepSeek models, use this argument to specify the MLA backend. | None |
| `sampling_backend` | Specifies the backend used for sampling. | None |
| `disable_flash_attn_for_mm` | Use FlashAttention3 for all non-causal attention of multimodal transformers. This will improve performance, but may lead to minor accuracy variations.

## Constrained Decoding

Expand Down
Loading
Loading