Merge pull request #21 from Sherlock113/docs/prefix-caching

Sherlock113 · web-flow · commit 0a42d9526699 · 2025-07-22T22:02:58.000+08:00
docs: Update prefix caching
diff --git a/docs/inference-optimization/data-tensor-pipeline-expert-hybrid-parallelism.md b/docs/inference-optimization/data-tensor-pipeline-expert-hybrid-parallelism.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 8
+sidebar_position: 9
 description: Understand the differences between data, tensor, pipeline, expert and hybrid parallelisms.
 keywords:
     - LLM inference optimization
diff --git a/docs/inference-optimization/kv-cache-utilization-aware-load-balancing.md b/docs/inference-optimization/kv-cache-utilization-aware-load-balancing.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 6
+sidebar_position: 8
 description: Route LLM requests based on KV cache usage for faster, smarter inference.
 keywords:
     - KV cache
diff --git a/docs/inference-optimization/offline-batch-inference.md b/docs/inference-optimization/offline-batch-inference.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 9
+sidebar_position: 10
 description: Run predictions at scale with offline batch inference for efficient, non-real-time processing.
 keywords:
     - Offline batch inference
diff --git a/docs/inference-optimization/prefix-caching-cache-aware-routing.md b/docs/inference-optimization/prefix-caching-cache-aware-routing.md
@@ -0,0 +1,37 @@
+---
+sidebar_position: 7
+description: Challenges in applying prefix caching
+keywords:
+    - Prefix caching, prompt caching, context caching
+    - KV cache, KV caching
+    - Prefix cache-aware routing
+    - Distributed inference, distributed LLM inference
+    - Inference optimization
+    - Dynamo, SGLang, vLLM, llm-d
+---
+
+# Prefix cache-aware routing
+
+In practice, applying prefix caching in a distributed way still has challenges. For example:
+
+- How can a new request be routed to the worker that already has the right prefix cached?
+- How does the router know what’s in each worker’s cache?
+
+![prefix-caching-aware-routing.png](./img/prefix-caching-aware-routing.png)
+
+Different open-source projects are exploring their own approaches to prefix cache-aware routing:
+
+- **Worker-reported prefix status**
+    
+    [Dynamo](https://github.com/ai-dynamo/dynamo) has workers actively report which prefixes they’ve cached. The router then uses this real-time data to make smart routing decisions.
+    
+- **Router-predicted cache status**
+    
+    [SGLang](https://github.com/sgl-project/sglang) maintains an approximate radix tree for each worker based on past requests. This helps the router predict which worker is most likely to have the needed prefix, without constant updates from the workers.
+    
+- **Hybrid efforts**
+    - The Gateway API Inference Extension project is [exploring multiple strategies to implement a routing algorithm on EPP](https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/498):
+        - **Prefix affinity consistent hashing**: Group requests with similar prefixes to the same worker.
+        - **Approximate prefix cache on the router**: Let the router maintain an approximate lookup cache of the prefix caches on all the backend servers.
+        - **Accurate prefix cache on the router**: Gather KV cache information reported by model servers.
+    - The [llm-d](https://github.com/llm-d/llm-d) project uses a component called Inference Scheduler to implement filtering and scoring algorithms, and makes routing decisions based on a combination of factors like cache availability, prefill/decode status, SLA and load.
diff --git a/docs/inference-optimization/prefix-caching.md b/docs/inference-optimization/prefix-caching.md
@@ -1,22 +1,34 @@
 ---
-sidebar_position: 7
+sidebar_position: 6
 description: Prefix caching speeds up LLM inference by reusing shared prompt KV cache across requests.
 keywords:
-    - Prefix caching, prompt caching
-    - KV cache
-    - Prefix cache-aware routing
+    - Prefix caching, prompt caching, context caching
+    - KV cache, KV caching
     - Distributed inference, distributed LLM inference
     - Inference optimization
     - Dynamo, SGLang, vLLM, llm-d
 ---
 
 import LinkList from '@site/src/components/LinkList';
+import Button from '@site/src/components/Button';
 
 # Prefix caching
 
-The term "KV cache" originally described caching within a single inference request. As mentioned previously, LLMs work autoregressively during decode as they output the next new token based on the previously generated tokens (i.e. reusing their KV cache). Without the KV cache, the model needs to recompute everything for the previous tokens in each decode step, which would be a huge waste of resources.
+Prefix caching (also known as prompt caching or context caching) is one of the most effective techniques to reduce latency and cost in LLM inference. It's especially useful in production workloads with repeated prompt structures, such as chat systems, AI agents, and RAG pipelines.
 
-When extending this caching concept across multiple requests, it’s more accurate to call it **prefix caching** or **prompt caching**. The idea is simple: By caching the KV cache of an existing query, a new query that shares the same prefix can skip recomputing that part of the prompt. Instead, it directly reuses the cached results, reducing computational load and speeding up inference.
+The idea is simple: By caching the KV cache of an existing query, a new query that shares the same prefix can skip recomputing that part of the prompt. Instead, it directly reuses the cached results.
+
+Prefix caching is different from simple semantic caching, where the full input and output text are stored in a database and only exact match (or similar queries) can hit the cache and return immediately.
+
+## How does prefix caching work?
+
+1. During prefill, the model performs a forward pass over the entire input and builds up a key-value (KV) cache for attention computation.
+2. During decode, the model generates output tokens one by one, using the cached states from the prefill stage. The attention mechanism computes a matrix of token interactions. The resulting KV pairs for each token are stored in GPU memory.
+3. For a new request with a matching prefix, you can skip the forward pass for the cached part and directly resume from the last token of the prefix.
+
+:::important
+This works only when the prefix is exactly identical, including whitespace and formatting. Even a single character difference breaks the cache.
+:::
 
 For example, consider a chatbot with this system prompt:
 
@@ -26,34 +38,51 @@ You are a helpful AI writer. Please write in a professional manner.
 
 This prompt doesn’t change from one conversation to the next. Instead of recalculating it every time, you store its KV cache once. Then, when new messages come in, you reuse this stored prefix cache, only processing the new part of the prompt.
 
-## Prefix cache-aware routing
+## What is the difference between KV caching and prefix caching?
+
+KV caching is used to store the intermediate attention states of each token in GPU memory. It was originally used to describe caching within a **single inference request**, especially critical for speeding up the decoding stage.
+
+LLMs work autoregressively during decode as they output the next new token based on the previously generated tokens (i.e. reusing their KV cache). Without the KV cache, the model needs to recompute everything for the previous tokens in each decode step (and the context grows with every step), which would be a huge waste of resources.
+
+When extending this caching concept across **multiple requests**, it’s more accurate to call it prefix caching. Since the computation of the KV cache only depends on all previous tokens, different requests with identical prefixes can reuse the same cache of the prefix tokens and avoid recomputing them.
+
+## How to structure prompts for maximum cache hits
 
-In practice, applying prefix caching still has challenges. For example:
+Prefix caching only helps when prompts are consistent. Here are some best practices to maximize cache hit rates:
 
-- How can a new request be routed to the worker that already has the right prefix cached?
-- How does the router know what’s in each worker’s cache?
+- **Front-load static content**: Place any constant or rarely changing information at the beginning of your prompt. This could include system messages, context, or instructions that remain the same across multiple queries. Move dynamic or user-specific content to the end of your prompt.
+- **Batch similar requests**: Group together queries (especially when serving multiple users or agents) that share the same prefix so that cached results can be reused efficiently.
+- **Avoid dynamic elements in the prefix**: Don’t insert timestamps, request IDs, or any other per-request variables early in the prompt. These lower your cache hit rate.
+- **Use deterministic serialization**: Make sure your context or memory serialization (e.g. JSON) is stable in key ordering and structure. Non-deterministic serialization leads to cache misses even if the content is logically the same.
+- **Monitor and analyze cache hit rates**: Regularly review your cache performance to identify opportunities for optimization.
 
-![prefix-caching-aware-routing.png](./img/prefix-caching-aware-routing.png)
+## Adoption and performance gains
+
+Prefix caching can reduce compute and latency by an order of magnitude in some use cases.
+
+- Anthropic Claude Sonnet offers [prompt caching](https://www.anthropic.com/news/prompt-caching) with up to 90% cost savings and 85% latency reduction for long prompts.
+- Google Gemini [discounts cached tokens](https://ai.google.dev/gemini-api/docs/caching?lang=python) and charges for storage separately.
+- Frameworks like vLLM, TensorRT-LLM, and SGLang support automatic prefix caching for different open-source LLMs.
+
+In agent workflows, the benefit is even more pronounced. Some use cases have input-to-output token ratios of 100:1, making the cost of reprocessing large prompts disproportionately high.
+
+## Limitations
+
+For applications with long, repetitive prompts, prefix caching can significantly reduce both latency and cost. Over time, however, your KV cache size can be quite large. GPU memory is finite, and storing long prefixes across many users can eat up space quickly. You’ll need cache eviction strategies or memory tiering.
+
+The open-source community is actively working on distributed serving strategies. See [prefix cache-aware routing](./prefix-caching-cache-aware-routing) for details.
+
+---
 
-Different open-source projects are exploring their own approaches to prefix cache-aware routing:
+Optimizing LLM prefix caching requires flexible customization in your LLM serving and infrastructure stack. At Bento, we provide the infrastructure for dedicated and customizable LLM deployments with fast auto-scaling and scaling-to-zero capabilities to ensure resource efficiency.
 
-- **Worker-reported prefix status**
-    
-    [Dynamo](https://github.com/ai-dynamo/dynamo) has workers actively report which prefixes they’ve cached. The router then uses this real-time data to make smart routing decisions.
-    
-- **Router-predicted cache status**
-    
-    [SGLang](https://github.com/sgl-project/sglang) maintains an approximate radix tree for each worker based on past requests. This helps the router predict which worker is most likely to have the needed prefix, without constant updates from the workers.
-    
-- **Hybrid efforts**
-    - The Gateway API Inference Extension project is [exploring multiple strategies to implement a routing algorithm on EPP](https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/498):
-        - **Prefix affinity consistent hashing**: Group requests with similar prefixes to the same worker.
-        - **Approximate prefix cache on the router**: Let the router maintain an approximate lookup cache of the prefix caches on all the backend servers.
-        - **Accurate prefix cache on the router**: Gather KV cache information reported by model servers.
-    - The [llm-d](https://github.com/llm-d/llm-d) project uses a component called Inference Scheduler to implement filtering and scoring algorithms, and makes routing decisions based on a combination of factors like cache availability, prefill/decode status, SLA and load.
+<div style={{ margin: '3rem 0' }}>
+[<Button>Talk to us</Button>](https://l.bentoml.com/contact-us-llm-inference-handbook)
+</div>
 
 <LinkList>
   ## Additional resources
   * [Prompt Cache: Modular Attention Reuse for Low-Latency Inference](https://arxiv.org/abs/2311.04934)
   * [Prompt Caching in Claude](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
+  * [Design Around the KV-Cache](https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus)
 </LinkList>