From e0932f192ee6bc15cfc1633e40dfff46b7b60cb5 Mon Sep 17 00:00:00 2001 From: Sherlock113 Date: Mon, 28 Jul 2025 10:30:37 +0800 Subject: [PATCH] Add some internal links for SEO Signed-off-by: Sherlock113 --- docs/inference-optimization/pagedattention.md | 2 +- docs/inference-optimization/speculative-decoding.md | 2 +- docs/llm-inference-basics/openai-compatible-api.md | 2 +- .../serverless-vs-self-hosted-llm-inference.md | 6 +++--- docs/llm-inference-basics/training-inference-differences.md | 2 +- 5 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/inference-optimization/pagedattention.md b/docs/inference-optimization/pagedattention.md index bfd5772..481999a 100644 --- a/docs/inference-optimization/pagedattention.md +++ b/docs/inference-optimization/pagedattention.md @@ -15,7 +15,7 @@ import LinkList from '@site/src/components/LinkList'; [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is a memory-efficient approach to implementing the attention mechanism in LLMs. -When an LLM is generating a response, it needs to remember past information (i.e. the KV cache) for every token it generates. Normally, the KV cache takes up a big chunk of memory because it’s stored as one giant continuous block. This can lead to memory fragmentation or wasted space because you need to reserve a big block even if you don’t fill it fully. +When an LLM is generating a response, it needs to [remember past information (i.e. the KV cache) for every token it generates](../llm-inference-basics/how-does-llm-inference-work#the-two-phases-of-llm-inference). Normally, the KV cache takes up a big chunk of memory because it’s stored as one giant continuous block. This can lead to memory fragmentation or wasted space because you need to reserve a big block even if you don’t fill it fully. PagedAttention breaks this big chunk into smaller blocks, kind of like pages in a book. In other words, the KV cache is stored in non-contiguous blocks. It then uses a lookup table to keep track of these blocks. The LLM only loads the blocks it needs, instead of loading everything at once. diff --git a/docs/inference-optimization/speculative-decoding.md b/docs/inference-optimization/speculative-decoding.md index 697b911..706652f 100644 --- a/docs/inference-optimization/speculative-decoding.md +++ b/docs/inference-optimization/speculative-decoding.md @@ -13,7 +13,7 @@ import LinkList from '@site/src/components/LinkList'; # Speculative decoding -LLMs are powerful, but their text generation is slow. The main bottleneck lies in auto-regressive decoding, where each token is generated one at a time. This sequential loop leads to high latency, as each step depends on the previous token. Additionally, while GPUs are optimized for parallelism, this sequential nature leads to underutilized compute resources during inference. +LLMs are powerful, but their text generation is slow. The main bottleneck lies in [auto-regressive decoding](../llm-inference-basics/how-does-llm-inference-work#the-two-phases-of-llm-inference), where each token is generated one at a time. This sequential loop leads to high latency, as each step depends on the previous token. Additionally, while GPUs are optimized for parallelism, this sequential nature leads to underutilized compute resources during inference. What if you could parallelize parts of the generation process, even if not all of it? diff --git a/docs/llm-inference-basics/openai-compatible-api.md b/docs/llm-inference-basics/openai-compatible-api.md index 50bfe04..16143ab 100644 --- a/docs/llm-inference-basics/openai-compatible-api.md +++ b/docs/llm-inference-basics/openai-compatible-api.md @@ -40,7 +40,7 @@ OpenAI-compatible APIs address these challenges by providing: - **Seamless migration**: Move between providers or self-hosted deployments with minimal disruption. - **Consistent integration**: Maintain compatibility with tools and frameworks that rely on the OpenAI API schema (e.g., `chat/completions`, `embeddings` endpoints). -Many inference backends (e.g., vLLM and SGLang) and model serving frameworks (e.g., BentoML) now provide OpenAI-compatible endpoints out of the box. This makes it easy to switch between different models without changing client code. +Many [inference backends](../getting-started/choosing-the-right-inference-framework) (e.g., vLLM and SGLang) and model serving frameworks (e.g., BentoML) now provide OpenAI-compatible endpoints out of the box. This makes it easy to switch between different models without changing client code. ## How to call an OpenAI-compatible API diff --git a/docs/llm-inference-basics/serverless-vs-self-hosted-llm-inference.md b/docs/llm-inference-basics/serverless-vs-self-hosted-llm-inference.md index d6c5de8..4e02cce 100644 --- a/docs/llm-inference-basics/serverless-vs-self-hosted-llm-inference.md +++ b/docs/llm-inference-basics/serverless-vs-self-hosted-llm-inference.md @@ -36,10 +36,10 @@ Key benefits of self-hosting include: - **Data privacy and compliance**: LLMs are widely used in modern applications like RAG and AI agents. These systems often require frequent access to sensitive data (e.g., customer details, medical records, financial information). This is often not an acceptable option for organizations in regulated industries with compliance and privacy requirements. Self-hosting LLMs makes sure your data always stays within your secure environment. - **Advanced customization and optimization**: With self-hosting, you can tailor your inference process to meet specific needs, such as: - Adjusting latency and throughput trade-offs precisely. - - Implementing advanced optimizations like prefill-decode disaggregation, prefix caching, KV cache-aware routing. - - Optimizing for long contexts or batch-processing scenarios. + - Implementing advanced optimizations like [prefill-decode disaggregation](../inference-optimization/prefill-decode-disaggregation), [prefix caching](../inference-optimization/prefix-caching), and [speculative decoding](../inference-optimization/speculative-decoding). + - Optimizing for long contexts or [batch-processing](../inference-optimization/static-dynamic-continuous-batching) scenarios. - Enforcing structured decoding to ensure outputs follow strict schemas - - Fine-tuning models using proprietary data to achieve competitive advantages. + - [Fine-tuning models](../getting-started/llm-fine-tuning) using proprietary data to achieve competitive advantages. - **Predictable performance and control**: When you self-host your LLMs, you have complete control over how your system behaves and performs. You’re not at the mercy of external API rate limits or sudden policy changes that might impact your application’s performance and availability. ## Comparison summary diff --git a/docs/llm-inference-basics/training-inference-differences.md b/docs/llm-inference-basics/training-inference-differences.md index 336286b..b7b773b 100644 --- a/docs/llm-inference-basics/training-inference-differences.md +++ b/docs/llm-inference-basics/training-inference-differences.md @@ -27,6 +27,6 @@ Training is computationally intensive, often requiring expensive GPU or TPU clus ## Inference: Using the model in real-time -LLM inference means applying the trained model to new data to make predictions. Unlike training, inference happens continuously and in real-time, responding immediately to user input or incoming data. It is the phase where the model is actively "in use." Better-trained and more finely-tuned models typically provide more accurate and useful inference. +LLM inference means applying the trained model to new data to make predictions. Unlike training, inference [happens continuously and in real-time](./what-is-llm-inference), responding immediately to user input or incoming data. It is the phase where the model is actively "in use." Better-trained and more finely-tuned models typically provide more accurate and useful inference. Inference compute needs are ongoing and can become very high, especially as user interactions and traffic grow. Each inference request consumes computational resources such as GPUs. While each inference step may be smaller than training in isolation, the cumulative demand over time can lead to significant operational expenses.