Skip to content

Commit e32b6d9

Browse files
authored
Merge pull request #25 from Sherlock113/docs/kv-cache-offloading
docs: Add KV cache offloading doc
2 parents 04f0512 + 5e2f6f5 commit e32b6d9

File tree

5 files changed

+106
-7
lines changed

5 files changed

+106
-7
lines changed

docs/inference-optimization/data-tensor-pipeline-expert-hybrid-parallelism.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
sidebar_position: 9
2+
sidebar_position: 10
33
description: Understand the differences between data, tensor, pipeline, expert and hybrid parallelisms.
44
keywords:
55
- LLM inference optimization, LLM inference optimization techniques​
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
sidebar_position: 9
3+
description: Learn how KV cache offloading improves LLM inference by reducing GPU memory usage, lowering latency, and cutting compute costs.
4+
keywords:
5+
- KV cache offloading, KV cache, KV caching, LMCache
6+
- Distributed inference, distributed LLM inference
7+
- Inference optimization
8+
- LLM inference optimization, LLM inference optimization techniques​
9+
- Speed up LLM inference
10+
---
11+
12+
import LinkList from '@site/src/components/LinkList';
13+
14+
# KV cache offloading
15+
16+
KV cache offloading is the process of moving attention key/value data from GPU memory to lower-cost storage like CPU memory or disk. It frees up GPU resources while preserving the ability to resume inference without recomputation. This helps scale LLM workloads efficiently by balancing performance and memory usage.
17+
18+
## Why KV cache becomes a bottleneck in LLM inference?
19+
20+
LLMs rely heavily on the KV cache to speed up inference. The cache stores attention keys and values for every token in the input sequence, allowing the model to reuse them in future steps instead of recalculating them. Although this saves a significant amount of compute resources and delivers faster inference, it comes with a steep memory cost.
21+
22+
As context windows increase, **the KV cache size grows linearly with sequence length**. This can quickly exhaust available GPU memory, especially in long-context scenarios. Since GPU memory is limited, the KV cache often becomes a bottleneck for running applications that require extended context.
23+
24+
In fact, **not all KV cache data needs to stay in GPU memory at all times**. In many real-world applications, users may not interact with the LLM continuously. For example, a user might pause while typing or leave and return hours later. In such cases, their KV cache remains in GPU memory, even though it’s not actively being used. Similarly, when multiple users/agents access the same conversation, document, or session at different times, the same KV cache might sit idle on the GPU between interactions (and you don't want to waste GPU resources just for recalculation of the same content).
25+
26+
This results in inefficient memory usage, as valuable GPU memory is tied up by inactive sessions instead of being used to serve new requests. Over time, this limits how many concurrent users the system can support and reduces overall throughput.
27+
28+
To solve these problems, KV cache offloading moves inactive or less frequently accessed cache data from GPU memory to lower-cost, higher-capacity storage such as CPU RAM, local SSDs, or remote object storage. When a user resumes interaction or another user accesses the same content, the cache can be reloaded into GPU memory on demand. This avoids costly recomputation while freeing up GPU resources for active workloads.
29+
30+
## When should you offload the KV cache for LLMs?
31+
32+
KV cache offloading is especially useful when:
33+
34+
- You’re deploying LLMs with long context windows, which can cause the KV cache to quickly exceed GPU memory.
35+
- Multiple users or agents need to interact with the same underlying content or context across sessions. For example, developers working in an IDE with LLM integration often interact with the same code snippet repeatedly.
36+
- Your deployment is memory-constrained or you need to optimize for infrastructure cost.
37+
- You’re scaling inference across many distributed workers where GPU resources are limited.
38+
- Your workloads include intermittent or idle user sessions, where keeping the KV cache in GPU memory would be wasteful.
39+
40+
## Benefits of KV cache offloading
41+
42+
Offloading the KV cache offers several important advantages for scaling and optimizing LLM inference:
43+
44+
- **Better resource utilization.** By moving inactive or shared KV data out of GPU memory, you can free up space for new requests. This allows the same GPU to serve more concurrent users or longer input sequences without hitting memory limits.
45+
- **Lower compute costs.** GPU memory is expensive and limited. Offloading allows workloads to take advantage of cheaper storage (e.g., CPU RAM or disk), reducing the need to over-provision high-end GPUs just to manage cache.
46+
- **Reduced latency**: Offloading allows the model to skip redundant KV computations during inference, especially for overlapping context in multi-turn interactions. This significantly reduces TTFT and overall latency. NVIDIA reports that KV cache offloading can [deliver up to 14× faster TTFT](https://developer.nvidia.com/blog/nvidia-gh200-superchip-accelerates-inference-by-2x-in-multiturn-interactions-with-llama-models/) for large input sequences compared to recalculating the KV cache from scratch.
47+
48+
## Performance trade-offs in KV cache offloading
49+
50+
While KV cache offloading can significantly improve memory efficiency and throughput, the speed of the offloading target is critical. If the storage tier (e.g., CPU RAM or disk) is too slow, the overhead of transferring KV data back to the GPU may negate the benefits, especially in latency-sensitive applications.
51+
52+
In short, make sure the cost of transferring data is lower than recomputing the cache from scratch. This is often the case in long, multi-turn conversations, where reusing previous context is crucial and recomputation would be expensive.
53+
54+
## How to calculate the KV cache size
55+
56+
When offloading the KV cache, it’s useful to understand how much memory it actually consumes.
57+
58+
In transformer-based LLMs, each attention layer needs to store two vectors (a key and a value) for every token in the input sequence. Each layer contains multiple attention heads, and all heads typically have the same dimension.
59+
60+
To estimate how much memory the KV cache consumes, use the following formula:
61+
62+
```bash
63+
KV Cache Size (GB) = 2 × B × S × L × H × D × (Q / 8) / (1024^3)
64+
```
65+
66+
:::note
67+
If you already know the model’s dimension, you can simplify the formula by replacing `H × D` with it.
68+
:::
69+
70+
- 2: The factor accounts for both key and value vectors per token
71+
- B: Batch size (number of sequences processed in parallel)
72+
- S: Sequence length (number of tokens per input)
73+
- L: Number of transformer layers
74+
- H: Number of attention heads per layer
75+
- D: Dimension of each attention head
76+
- Q: Bit precision per value (e.g., 16 for FP16, 32 for FP32), division by 8 converts bits to bytes
77+
78+
## Offloading the KV cache with LMCache
79+
80+
[LMCache](https://github.com/LMCache/LMCache) is an LLM serving engine extension designed to optimize LLM inference by reducing TTFT and increasing throughput, especially for long-context workloads. It supports the reuse of KV caches for repeated input content (not just prefixes) across different engine instances.
81+
82+
By storing KV caches in multiple tiers of memory, including GPU, CPU DRAM, and local disk, LMCache significantly reduces redundant computation. This improves response time and saves GPU cycles, making it ideal for workloads like multi-turn QA, RAG, and document-level reasoning.
83+
84+
In benchmarks, combining LMCache with vLLM has resulted in 3×–10× reductions in latency across various use cases.
85+
86+
Several open-source projects have already integrated LMCache to support efficient KV cache offloading and reuse:
87+
88+
- [llm-d](https://www.redhat.com/en/about/press-releases/red-hat-launches-llm-d-community-powering-distributed-gen-ai-inference-scale) offloads KV cache data with LMCache from GPU memory to more cost-effective and abundant storage such as CPU memory and network disks.
89+
- [KServe](https://kserve.github.io/website/0.15/modelserving/v1beta1/llm/huggingface/kv_cache_offloading/) integrates LMCache to reduce inference costs and ensure SLOs for both latency and throughput at scale.
90+
- [vLLM](https://docs.vllm.ai/en/latest/examples/others/lmcache.html) uses LMCache for CPU offloading, cache sharing between requests, and disaggregated prefilling. This enables better memory management and improves resource efficiency.
91+
92+
LMCache currently supports offloading KV cache data to a variety of storage backends, ranging from local options like CPU memory and the file system, to distributed systems such as Mooncake and ValKey.
93+
94+
<LinkList>
95+
## Additional resources
96+
* [LMCache Documentation](https://docs.lmcache.ai/)
97+
* [NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models](https://developer.nvidia.com/blog/nvidia-gh200-superchip-accelerates-inference-by-2x-in-multiturn-interactions-with-llama-models/)
98+
* [5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse](https://developer.nvidia.com/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/)
99+
</LinkList>

docs/inference-optimization/offline-batch-inference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
sidebar_position: 10
2+
sidebar_position: 11
33
description: Run predictions at scale with offline batch inference for efficient, non-real-time processing.
44
keywords:
55
- Offline batch inference, batch inference, batch LLM inference, batch requests, batch processing, LLM inference batching

docs/inference-optimization/prefix-caching-cache-aware-routing.md renamed to docs/inference-optimization/prefix-aware-routing.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,24 @@ description: Challenges in applying prefix caching
44
keywords:
55
- Prefix caching, prompt caching, context caching
66
- KV cache, KV caching
7-
- Prefix cache-aware routing
7+
- Prefix aware routing
88
- Distributed inference, distributed LLM inference
99
- Inference optimization
1010
- Dynamo, SGLang, vLLM, llm-d
1111
- LLM inference optimization, LLM inference optimization techniques​
1212
- Speed up LLM inference
1313
---
1414

15-
# Prefix cache-aware routing
15+
# Prefix-aware routing
1616

17-
In practice, applying prefix caching in a distributed way still has challenges. For example:
17+
In practice, applying [prefix caching](./prefix-caching) in a distributed way still has challenges. For example:
1818

1919
- How can a new request be routed to the worker that already has the right prefix cached?
2020
- How does the router know what’s in each worker’s cache?
2121

2222
![prefix-caching-aware-routing.png](./img/prefix-caching-aware-routing.png)
2323

24-
Different open-source projects are exploring their own approaches to prefix cache-aware routing:
24+
Different open-source projects are exploring their own approaches to prefix-aware routing:
2525

2626
- **Worker-reported prefix status**
2727

docs/inference-optimization/prefix-caching.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ In agent workflows, the benefit is even more pronounced. Some use cases have inp
7272

7373
For applications with long, repetitive prompts, prefix caching can significantly reduce both latency and cost. Over time, however, your KV cache size can be quite large. GPU memory is finite, and storing long prefixes across many users can eat up space quickly. You’ll need cache eviction strategies or memory tiering.
7474

75-
The open-source community is actively working on distributed serving strategies. See [prefix cache-aware routing](./prefix-caching-cache-aware-routing) for details.
75+
The open-source community is actively working on distributed serving strategies. See [prefix-aware routing](./prefix-caching-cache-aware-routing) for details.
7676

7777
---
7878

0 commit comments

Comments
 (0)