Skip to content

Commit da0f6a2

Browse files
committed
Update spec decoding
Signed-off-by: Sherlock113 <sherlockxu07@gmail.com>
1 parent bc79df2 commit da0f6a2

File tree

3 files changed

+30
-12
lines changed

3 files changed

+30
-12
lines changed
448 KB
Loading

docs/inference-optimization/llm-inference-metrics.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ Key metrics to measure latency:
3535
```
3636

3737
Total latency directly affects perceived responsiveness. A fast TTFT followed by slow token generation still leads to a poor experience.
38+
39+
![llm-inference-ttft-latency.png](./img/llm-inference-ttft-latency.png)
3840

3941
Acceptable latency depends on the use case. For example, a chatbot might require a TTFT under 500 milliseconds to feel responsive, while a code completion tool may need TTFT below 100 milliseconds for seamless developer experience. In contrast, if you're generating long reports that are reviewed once a day, then even a 30-second total latency may be perfectly acceptable. The key is to match latency targets to the pace and expectations of the task at hand.
4042

docs/inference-optimization/speculative-decoding.md

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,27 @@ import LinkList from '@site/src/components/LinkList';
1212

1313
# Speculative decoding
1414

15-
Speculative decoding is an inference-time optimization that speeds up autoregressive generation by combining a fast “draft” model with the target model.
15+
LLMs are powerful, but their text generation is slow. The main bottleneck lies in auto-regressive decoding, where each token is generated one at a time. This sequential loop leads to high latency, as each step depends on the previous token. Additionally, while GPUs are optimized for parallelism, this sequential nature leads to underutilized compute resources during inference.
1616

17-
The core drivers behind this approach:
17+
What if you could parallelize parts of the generation process, even if not all of it?
1818

19-
- Some tokens are easier to predict than others and can be handled by a smaller model.
20-
- In LLM decoding, a sequential token-by-token generation process, the main bottleneck is memory bandwidth, not compute. Speculative decoding leverages spare compute capacity (due to underutilized parallelism in accelerators) to predict multiple tokens at once.
19+
That’s where speculative decoding comes in.
2120

22-
The roles of the two models:
21+
## What is speculative decoding?
2322

24-
- **Draft model**: A smaller, faster model (like a distilled version of the target model) proposes a draft sequence of tokens.
25-
- **Target model**: The main model verifies the draft’s tokens and decides which to accept.
23+
Speculative decoding is an inference-time optimization that combines two models:
24+
25+
- **Draft model:** A smaller, faster model (like a distilled version of the target model) proposes a draft sequence of tokens. A core driver behind this is that some tokens are easier to predict than others and can be easily handled by a smaller model.
26+
- **Target model:** The original larger model verifies the draft’s tokens at once and decides which to accept.
27+
28+
The draft model delivers fast guesses, and the target model ensures accuracy. This method helps shift the generation loop from purely sequential to partially parallel, improving hardware utilization and reducing latency.
29+
30+
Two key metrics in speculative decoding:
31+
32+
- **Acceptance rate**: Number of draft tokens accepted by the target model. A low acceptance rate limits the speedup and can become a major bottleneck.
33+
- **Speculative token count**: Number of tokens proposed by the draft model each step. Most inference frameworks allow you to configure this value when speculative decoding is enabled.
34+
35+
## How it works
2636

2737
Here’s the step-by-step process:
2838

@@ -34,17 +44,23 @@ Here’s the step-by-step process:
3444

3545
![spec-decoding.png](./img/spec-decoding.png)
3646

37-
Key benefits of speculative decoding:
47+
## Benefits and limitations
48+
49+
Key benefits of speculative decoding include:
50+
51+
- **Parallel verification:** Since verification doesn’t depend on previous verifications, it’s faster than generation (which is sequential).
52+
- **High acceptance for easy tokens:** The draft model can often get the next few tokens correct, which speeds up generation.
53+
- **Better use of hardware:** Because verification uses hardware resources that would otherwise be idle, overall throughput improves.
3854

39-
- **Parallel verification**: Since verification doesn’t depend on previous verifications, it’s faster than generation (which is sequential).
40-
- **High acceptance for easy tokens**: The draft model can often get the next few tokens correct, which speeds up generation.
41-
- **Better use of hardware**: Because verification uses hardware resources that would otherwise be idle, overall throughput improves.
55+
However, speculative decoding has its own costs.
4256

43-
However, speculative decoding has its own costs. Because both the draft model and the target model need to be loaded into memory, it increases overall VRAM usage. This reduces the available memory for other tasks (e.g., batch processing), which can limit throughput, especially under high load or when serving large models.
57+
- **Increased memory usage**: Because both the draft model and the target model need to be loaded into memory, it increases overall VRAM usage. This reduces the available memory for other tasks (e.g., batch processing), which can limit throughput, especially under high load or when serving large models.
58+
- **Wasted compute on rejection**: If many draft tokens are rejected (low acceptance rate), compute is wasted on both drafting and verification.
4459

4560
<LinkList>
4661
## Additional resources
4762
* [Looking back at speculative decoding](https://research.google/blog/looking-back-at-speculative-decoding/)
63+
* [EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/pdf/2401.15077)
4864
* [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192)
4965
* [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318)
5066
* [Blockwise Parallel Decoding for Deep Autoregressive Models](https://arxiv.org/abs/1811.03115)

0 commit comments

Comments
 (0)