Skip to content

Commit 04f0512

Browse files
authored
Merge pull request #24 from Sherlock113/docs/internal-linking
docs: Add some internal links for SEO
2 parents 5ef516c + e0932f1 commit 04f0512

File tree

5 files changed

+7
-7
lines changed

5 files changed

+7
-7
lines changed

docs/inference-optimization/pagedattention.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ import LinkList from '@site/src/components/LinkList';
1515

1616
[PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is a memory-efficient approach to implementing the attention mechanism in LLMs.
1717

18-
When an LLM is generating a response, it needs to remember past information (i.e. the KV cache) for every token it generates. Normally, the KV cache takes up a big chunk of memory because it’s stored as one giant continuous block. This can lead to memory fragmentation or wasted space because you need to reserve a big block even if you don’t fill it fully.
18+
When an LLM is generating a response, it needs to [remember past information (i.e. the KV cache) for every token it generates](../llm-inference-basics/how-does-llm-inference-work#the-two-phases-of-llm-inference). Normally, the KV cache takes up a big chunk of memory because it’s stored as one giant continuous block. This can lead to memory fragmentation or wasted space because you need to reserve a big block even if you don’t fill it fully.
1919

2020
PagedAttention breaks this big chunk into smaller blocks, kind of like pages in a book. In other words, the KV cache is stored in non-contiguous blocks. It then uses a lookup table to keep track of these blocks. The LLM only loads the blocks it needs, instead of loading everything at once.
2121

docs/inference-optimization/speculative-decoding.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ import LinkList from '@site/src/components/LinkList';
1313

1414
# Speculative decoding
1515

16-
LLMs are powerful, but their text generation is slow. The main bottleneck lies in auto-regressive decoding, where each token is generated one at a time. This sequential loop leads to high latency, as each step depends on the previous token. Additionally, while GPUs are optimized for parallelism, this sequential nature leads to underutilized compute resources during inference.
16+
LLMs are powerful, but their text generation is slow. The main bottleneck lies in [auto-regressive decoding](../llm-inference-basics/how-does-llm-inference-work#the-two-phases-of-llm-inference), where each token is generated one at a time. This sequential loop leads to high latency, as each step depends on the previous token. Additionally, while GPUs are optimized for parallelism, this sequential nature leads to underutilized compute resources during inference.
1717

1818
What if you could parallelize parts of the generation process, even if not all of it?
1919

docs/llm-inference-basics/openai-compatible-api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ OpenAI-compatible APIs address these challenges by providing:
4040
- **Seamless migration**: Move between providers or self-hosted deployments with minimal disruption.
4141
- **Consistent integration**: Maintain compatibility with tools and frameworks that rely on the OpenAI API schema (e.g., `chat/completions`, `embeddings` endpoints).
4242

43-
Many inference backends (e.g., vLLM and SGLang) and model serving frameworks (e.g., BentoML) now provide OpenAI-compatible endpoints out of the box. This makes it easy to switch between different models without changing client code.
43+
Many [inference backends](../getting-started/choosing-the-right-inference-framework) (e.g., vLLM and SGLang) and model serving frameworks (e.g., BentoML) now provide OpenAI-compatible endpoints out of the box. This makes it easy to switch between different models without changing client code.
4444

4545
## How to call an OpenAI-compatible API
4646

docs/llm-inference-basics/serverless-vs-self-hosted-llm-inference.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,10 @@ Key benefits of self-hosting include:
3636
- **Data privacy and compliance**: LLMs are widely used in modern applications like RAG and AI agents. These systems often require frequent access to sensitive data (e.g., customer details, medical records, financial information). This is often not an acceptable option for organizations in regulated industries with compliance and privacy requirements. Self-hosting LLMs makes sure your data always stays within your secure environment.
3737
- **Advanced customization and optimization**: With self-hosting, you can tailor your inference process to meet specific needs, such as:
3838
- Adjusting latency and throughput trade-offs precisely.
39-
- Implementing advanced optimizations like prefill-decode disaggregation, prefix caching, KV cache-aware routing.
40-
- Optimizing for long contexts or batch-processing scenarios.
39+
- Implementing advanced optimizations like [prefill-decode disaggregation](../inference-optimization/prefill-decode-disaggregation), [prefix caching](../inference-optimization/prefix-caching), and [speculative decoding](../inference-optimization/speculative-decoding).
40+
- Optimizing for long contexts or [batch-processing](../inference-optimization/static-dynamic-continuous-batching) scenarios.
4141
- Enforcing structured decoding to ensure outputs follow strict schemas
42-
- Fine-tuning models using proprietary data to achieve competitive advantages.
42+
- [Fine-tuning models](../getting-started/llm-fine-tuning) using proprietary data to achieve competitive advantages.
4343
- **Predictable performance and control**: When you self-host your LLMs, you have complete control over how your system behaves and performs. You’re not at the mercy of external API rate limits or sudden policy changes that might impact your application’s performance and availability.
4444

4545
## Comparison summary

docs/llm-inference-basics/training-inference-differences.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,6 @@ Training is computationally intensive, often requiring expensive GPU or TPU clus
2727

2828
## Inference: Using the model in real-time
2929

30-
LLM inference means applying the trained model to new data to make predictions. Unlike training, inference happens continuously and in real-time, responding immediately to user input or incoming data. It is the phase where the model is actively "in use." Better-trained and more finely-tuned models typically provide more accurate and useful inference.
30+
LLM inference means applying the trained model to new data to make predictions. Unlike training, inference [happens continuously and in real-time](./what-is-llm-inference), responding immediately to user input or incoming data. It is the phase where the model is actively "in use." Better-trained and more finely-tuned models typically provide more accurate and useful inference.
3131

3232
Inference compute needs are ongoing and can become very high, especially as user interactions and traffic grow. Each inference request consumes computational resources such as GPUs. While each inference step may be smaller than training in isolation, the cumulative demand over time can lead to significant operational expenses.

0 commit comments

Comments
 (0)