Skip to content

docs: Add some internal links for SEO #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/inference-optimization/pagedattention.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ import LinkList from '@site/src/components/LinkList';

[PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is a memory-efficient approach to implementing the attention mechanism in LLMs.

When an LLM is generating a response, it needs to remember past information (i.e. the KV cache) for every token it generates. Normally, the KV cache takes up a big chunk of memory because it’s stored as one giant continuous block. This can lead to memory fragmentation or wasted space because you need to reserve a big block even if you don’t fill it fully.
When an LLM is generating a response, it needs to [remember past information (i.e. the KV cache) for every token it generates](../llm-inference-basics/how-does-llm-inference-work#the-two-phases-of-llm-inference). Normally, the KV cache takes up a big chunk of memory because it’s stored as one giant continuous block. This can lead to memory fragmentation or wasted space because you need to reserve a big block even if you don’t fill it fully.

PagedAttention breaks this big chunk into smaller blocks, kind of like pages in a book. In other words, the KV cache is stored in non-contiguous blocks. It then uses a lookup table to keep track of these blocks. The LLM only loads the blocks it needs, instead of loading everything at once.

Expand Down
2 changes: 1 addition & 1 deletion docs/inference-optimization/speculative-decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import LinkList from '@site/src/components/LinkList';

# Speculative decoding

LLMs are powerful, but their text generation is slow. The main bottleneck lies in auto-regressive decoding, where each token is generated one at a time. This sequential loop leads to high latency, as each step depends on the previous token. Additionally, while GPUs are optimized for parallelism, this sequential nature leads to underutilized compute resources during inference.
LLMs are powerful, but their text generation is slow. The main bottleneck lies in [auto-regressive decoding](../llm-inference-basics/how-does-llm-inference-work#the-two-phases-of-llm-inference), where each token is generated one at a time. This sequential loop leads to high latency, as each step depends on the previous token. Additionally, while GPUs are optimized for parallelism, this sequential nature leads to underutilized compute resources during inference.

What if you could parallelize parts of the generation process, even if not all of it?

Expand Down
2 changes: 1 addition & 1 deletion docs/llm-inference-basics/openai-compatible-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ OpenAI-compatible APIs address these challenges by providing:
- **Seamless migration**: Move between providers or self-hosted deployments with minimal disruption.
- **Consistent integration**: Maintain compatibility with tools and frameworks that rely on the OpenAI API schema (e.g., `chat/completions`, `embeddings` endpoints).

Many inference backends (e.g., vLLM and SGLang) and model serving frameworks (e.g., BentoML) now provide OpenAI-compatible endpoints out of the box. This makes it easy to switch between different models without changing client code.
Many [inference backends](../getting-started/choosing-the-right-inference-framework) (e.g., vLLM and SGLang) and model serving frameworks (e.g., BentoML) now provide OpenAI-compatible endpoints out of the box. This makes it easy to switch between different models without changing client code.

## How to call an OpenAI-compatible API

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ Key benefits of self-hosting include:
- **Data privacy and compliance**: LLMs are widely used in modern applications like RAG and AI agents. These systems often require frequent access to sensitive data (e.g., customer details, medical records, financial information). This is often not an acceptable option for organizations in regulated industries with compliance and privacy requirements. Self-hosting LLMs makes sure your data always stays within your secure environment.
- **Advanced customization and optimization**: With self-hosting, you can tailor your inference process to meet specific needs, such as:
- Adjusting latency and throughput trade-offs precisely.
- Implementing advanced optimizations like prefill-decode disaggregation, prefix caching, KV cache-aware routing.
- Optimizing for long contexts or batch-processing scenarios.
- Implementing advanced optimizations like [prefill-decode disaggregation](../inference-optimization/prefill-decode-disaggregation), [prefix caching](../inference-optimization/prefix-caching), and [speculative decoding](../inference-optimization/speculative-decoding).
- Optimizing for long contexts or [batch-processing](../inference-optimization/static-dynamic-continuous-batching) scenarios.
- Enforcing structured decoding to ensure outputs follow strict schemas
- Fine-tuning models using proprietary data to achieve competitive advantages.
- [Fine-tuning models](../getting-started/llm-fine-tuning) using proprietary data to achieve competitive advantages.
- **Predictable performance and control**: When you self-host your LLMs, you have complete control over how your system behaves and performs. You’re not at the mercy of external API rate limits or sudden policy changes that might impact your application’s performance and availability.

## Comparison summary
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,6 @@ Training is computationally intensive, often requiring expensive GPU or TPU clus

## Inference: Using the model in real-time

LLM inference means applying the trained model to new data to make predictions. Unlike training, inference happens continuously and in real-time, responding immediately to user input or incoming data. It is the phase where the model is actively "in use." Better-trained and more finely-tuned models typically provide more accurate and useful inference.
LLM inference means applying the trained model to new data to make predictions. Unlike training, inference [happens continuously and in real-time](./what-is-llm-inference), responding immediately to user input or incoming data. It is the phase where the model is actively "in use." Better-trained and more finely-tuned models typically provide more accurate and useful inference.

Inference compute needs are ongoing and can become very high, especially as user interactions and traffic grow. Each inference request consumes computational resources such as GPUs. While each inference step may be smaller than training in isolation, the cumulative demand over time can lead to significant operational expenses.