bentoml · Sherlock113 · Jul 22, 2025 · Jul 22, 2025
diff --git a/docs/inference-optimization/img/llm-inference-ttft-latency.png b/docs/inference-optimization/img/llm-inference-ttft-latency.png
diff --git a/docs/inference-optimization/llm-inference-metrics.md b/docs/inference-optimization/llm-inference-metrics.md
@@ -35,6 +35,8 @@ Key metrics to measure latency:
     ```
 
     Total latency directly affects perceived responsiveness. A fast TTFT followed by slow token generation still leads to a poor experience.
+
+![llm-inference-ttft-latency.png](./img/llm-inference-ttft-latency.png)
 
 Acceptable latency depends on the use case. For example, a chatbot might require a TTFT under 500 milliseconds to feel responsive, while a code completion tool may need TTFT below 100 milliseconds for seamless developer experience. In contrast, if you're generating long reports that are reviewed once a day, then even a 30-second total latency may be perfectly acceptable. The key is to match latency targets to the pace and expectations of the task at hand.
 

diff --git a/docs/inference-optimization/speculative-decoding.md b/docs/inference-optimization/speculative-decoding.md
@@ -12,17 +12,27 @@ import LinkList from '@site/src/components/LinkList';
 
 # Speculative decoding
 
-Speculative decoding is an inference-time optimization that speeds up autoregressive generation by combining a fast “draft” model with the target model.
+LLMs are powerful, but their text generation is slow. The main bottleneck lies in auto-regressive decoding, where each token is generated one at a time. This sequential loop leads to high latency, as each step depends on the previous token. Additionally, while GPUs are optimized for parallelism, this sequential nature leads to underutilized compute resources during inference.
 
-The core drivers behind this approach:
+What if you could parallelize parts of the generation process, even if not all of it?
 
-- Some tokens are easier to predict than others and can be handled by a smaller model.
-- In LLM decoding, a sequential token-by-token generation process, the main bottleneck is memory bandwidth, not compute. Speculative decoding leverages spare compute capacity (due to underutilized parallelism in accelerators) to predict multiple tokens at once.
+That’s where speculative decoding comes in.
 
-The roles of the two models:
+## What is speculative decoding?
 
-- **Draft model**: A smaller, faster model (like a distilled version of the target model) proposes a draft sequence of tokens.
-- **Target model**: The main model verifies the draft’s tokens and decides which to accept.
+Speculative decoding is an inference-time optimization that combines two models:
+
+- **Draft model:** A smaller, faster model (like a distilled version of the target model) proposes a draft sequence of tokens. A core driver behind this is that some tokens are easier to predict than others and can be easily handled by a smaller model.
+- **Target model:** The original larger model verifies the draft’s tokens at once and decides which to accept.
+
+The draft model delivers fast guesses, and the target model ensures accuracy. This method helps shift the generation loop from purely sequential to partially parallel, improving hardware utilization and reducing latency.
+
+Two key metrics in speculative decoding:
+
+- **Acceptance rate**: Number of draft tokens accepted by the target model. A low acceptance rate limits the speedup and can become a major bottleneck.
+- **Speculative token count**: Number of tokens proposed by the draft model each step. Most inference frameworks allow you to configure this value when speculative decoding is enabled.
+
+## How it works
 
 Here’s the step-by-step process:
 
@@ -34,17 +44,23 @@ Here’s the step-by-step process:
 
 ![spec-decoding.png](./img/spec-decoding.png)
 
-Key benefits of speculative decoding:
+## Benefits and limitations
+
+Key benefits of speculative decoding include:
+
+- **Parallel verification:** Since verification doesn’t depend on previous verifications, it’s faster than generation (which is sequential).
+- **High acceptance for easy tokens:** The draft model can often get the next few tokens correct, which speeds up generation.
+- **Better use of hardware:** Because verification uses hardware resources that would otherwise be idle, overall throughput improves.
 
-- **Parallel verification**: Since verification doesn’t depend on previous verifications, it’s faster than generation (which is sequential).
-- **High acceptance for easy tokens**: The draft model can often get the next few tokens correct, which speeds up generation.
-- **Better use of hardware**: Because verification uses hardware resources that would otherwise be idle, overall throughput improves.
+However, speculative decoding has its own costs. 
 
-However, speculative decoding has its own costs. Because both the draft model and the target model need to be loaded into memory, it increases overall VRAM usage. This reduces the available memory for other tasks (e.g., batch processing), which can limit throughput, especially under high load or when serving large models.
+- **Increased memory usage**: Because both the draft model and the target model need to be loaded into memory, it increases overall VRAM usage. This reduces the available memory for other tasks (e.g., batch processing), which can limit throughput, especially under high load or when serving large models.
+- **Wasted compute on rejection**: If many draft tokens are rejected (low acceptance rate), compute is wasted on both drafting and verification.
 
 <LinkList>
   ## Additional resources
   * [Looking back at speculative decoding](https://research.google/blog/looking-back-at-speculative-decoding/)
+  * [EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/pdf/2401.15077)
   * [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192)
   * [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318)
   * [Blockwise Parallel Decoding for Deep Autoregressive Models](https://arxiv.org/abs/1811.03115)
-Original file line number
+Diff line change
@@ Expand Up / @@ -35,6 +35,8 @@ Key metrics to measure latency: @@
         ```
         Total latency directly affects perceived responsiveness. A fast TTFT followed by slow token generation still leads to a poor experience.
+    ![llm-inference-ttft-latency.png](./img/llm-inference-ttft-latency.png)
     Acceptable latency depends on the use case. For example, a chatbot might require a TTFT under 500 milliseconds to feel responsive, while a code completion tool may need TTFT below 100 milliseconds for seamless developer experience. In contrast, if you're generating long reports that are reviewed once a day, then even a 30-second total latency may be perfectly acceptable. The key is to match latency targets to the pace and expectations of the task at hand.
@@ Expand Down @@