Merge pull request #16 from Sherlock113/docs/update-handbook-content

Sherlock113 · web-flow · commit fb0abc8a3429 · 2025-07-18T16:05:44.000+08:00
docs: Update metrics, hardware and openai-api content
diff --git a/docs/inference-optimization/llm-inference-metrics.md b/docs/inference-optimization/llm-inference-metrics.md
@@ -18,14 +18,45 @@ Before exploring optimization techniques, let’s understand the key metrics the
 
 ## Latency
 
-Latency measures how quickly a model responds to a request. For a single request, latency is the time from sending the request to receiving the final token on the user end. It’s crucial for user experience, especially in real-time applications.
+Latency measures how quickly a model responds to a request. It’s crucial for user experience, especially in interactive, real-time applications.
 
-There are two key metrics to measure latency:
+Key metrics to measure latency:
 
-- **Time to First Token (TTFT)**: The time it takes to generate the first token after sending a request. It reflects how fast the model can start responding. Different applications usually have different expectations for TTFT. For example, when summarizing a long document, users are usually willing to wait longer for the first token since the task is more demanding.
+- **Time to First Token (TTFT)**: The time it takes to generate the first token after sending a request. It reflects how fast the model can start responding.
 - **Time per Output Token (TPOT)**: Also known as Inter-Token Latency (ITL), TPOT measures the time between generating each subsequent token. A lower TPOT means the model can produce tokens faster, leading to higher tokens per second.
     
-  In streaming scenarios where users see text appear word-by-word (like ChatGPT's interface), TPOT determines how smooth the experience feels. It should be fast enough to keep pace with human reading speed.
+  In streaming scenarios where users see text appear word-by-word (like ChatGPT's interface), TPOT determines how smooth the experience feels. The system should ideally keep up with or exceed human reading speed to ensure a smooth experience.
+
+- **Token Generation Time**: The time between receiving the first and the final token. This measures how long it takes the model to stream out the full response.
+- **Total Latency (E2EL)**: The time from sending the request to receiving the final token on the user end. Note that:
+    
+    ```bash
+    Total Latency = TTFT + Token Generation Time
+    ```
+    
+    Total latency directly affects perceived responsiveness. A fast TTFT followed by slow token generation still leads to a poor experience.
+    
+Acceptable latency depends on the use case. For example, a chatbot might require a TTFT under 500 milliseconds to feel responsive, while a code completion tool may need TTFT below 100 milliseconds for seamless developer experience. In contrast, if you're generating long reports that are reviewed once a day, then even a 30-second total latency may be perfectly acceptable. The key is to match latency targets to the pace and expectations of the task at hand.
+
+### Understanding mean, median, and P99 latency
+
+When analyzing LLM performance, especially latency, it’s not enough to look at just one number. Metrics like mean, median, and P99 each tell a different part of the story.
+
+- **Mean (Average)**: This is the sum of all values divided by the number of values. Mean gives a general sense of average performance, but it can be skewed by extreme values (outliers). For example, if the TTFT of one request is unusually slow, it inflates the mean.
+- **Median**: The middle value when all values are sorted. Median shows what a "typical" user experience. It’s more stable and resistant to outliers than the mean. If your median TTFT is 30 seconds, most users are seeing very slow first responses, which might be unacceptable for real-time use cases.
+- **P99 (99th Percentile)**: The value below which 99% of requests fall. P99 reveals worst-case performance for the slowest 1% of requests. This is important when users expect consistency, or when your SLAs guarantee fast responses for 99% of cases. If your P99 TTFT is nearly 100 seconds, it suggests a small but significant portion of users face very long waits.
+
+  :::note
+  You may also see P90 or P95, which show the 90th and 95th percentile latencies, respectively. These are useful for understanding near-worst-case performance and are often used in cases where P99 may be too strict or sensitive to noise.
+  :::
+
+Together, these metrics give you a complete view:
+
+- **Mean** helps monitor trends over time.
+- **Median** reflects the experience of the majority of users.
+- **P99** captures tail latency, which can make or break user experience in production.
+
+You’ll often see these metrics in LLM performance benchmarks, such as mean TTFT, median TPOT, and P99 E2EL, to capture different aspects of latency and user experience.
     
 ## Throughput
 
@@ -105,4 +136,5 @@ Using a serverless API can abstract away these optimizations, leaving you with l
   * [NVIDIA NIM LLMs Benchmarking - Metrics](https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html)
   * [Mastering LLM Techniques: Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)
   * [LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators](https://arxiv.org/pdf/2411.00136)
+  * [Throughput is Not All You Need](https://hao-ai-lab.github.io/blogs/distserve/)
 </LinkList>
diff --git a/docs/llm-inference-basics/cpu-vs-gpu-vs-tpu.md b/docs/llm-inference-basics/cpu-vs-gpu-vs-tpu.md
@@ -7,6 +7,8 @@ keywords:
     - TPUs
 ---
 
+import LinkList from '@site/src/components/LinkList';
+
 # Where is LLM inference run?
 
 When deploying LLMs into production, choosing the right hardware is crucial. Different hardware types offer varied levels of performance and cost-efficiency. The three primary options are CPUs, GPUs, and TPUs. Understanding their strengths and weaknesses helps you optimize your inference workloads effectively.
@@ -29,4 +31,20 @@ TPUs are behind some of the most advanced AI applications today: agents, recomme
 
 ## Choosing the right hardware for your LLM inference
 
-Selecting the appropriate hardware requires you to understand your model size, inference volume, latency requirements, cost constraints, and available infrastructure. GPUs remain the most popular choice due to their versatility and broad support, while TPUs offer compelling advantages for certain specialized scenarios, and CPUs still have a place for lightweight, budget-conscious workloads.
+Selecting the appropriate hardware requires you to understand your model size, inference volume, latency requirements, cost constraints, and available infrastructure. GPUs remain the most popular choice due to their versatility and broad support, while TPUs offer compelling advantages for certain specialized scenarios, and CPUs still have a place for lightweight, budget-conscious workloads.
+
+## Choosing the deployment environment
+
+The deployment environment shapes everything from latency and scalability to privacy and cost. Each environment suits different operational needs for enterprises.
+
+- **Cloud**: The cloud is the most popular environment for LLM inference today. It offers on-demand access to high-performance GPUs and TPUs, along with a rich ecosystem of managed services, autoscaling, and monitoring tools.
+- **On-Prem**: On-premises deployments means running LLM inference on your own infrastructure, typically within a private data center. It offers full control over data, performance, and compliance, but requires more operational overhead.
+- **Edge**: In edge deployments, the model runs directly on user devices or local edge nodes, closer to where data is generated. This reduces network latency and increases data privacy, especially for time-sensitive or offline use cases. Edge inference usually uses smaller, optimized models due to limited compute resources.
+
+More details will be covered in the [infrastructure and management](../infrastructure-and-operations) chapter.
+
+<LinkList>
+  ## Additional resources
+  * [How to Beat the GPU CAP Theorem in AI Inference](https://www.bentoml.com/blog/how-to-beat-the-gpu-cap-theorem-in-ai-inference)
+  * [State of AI Inference Infrastructure Survey Highlights](https://www.bentoml.com/blog/2024-ai-infra-survey-highlights)
+</LinkList>
diff --git a/docs/llm-inference-basics/openai-compatible-api.md b/docs/llm-inference-basics/openai-compatible-api.md
@@ -2,9 +2,9 @@
 sidebar_position: 6
 description: Learn the concept of OpenAI-compatible API and why you need it.
 keywords:
-    - OpenAI-compatible API
+    - OpenAI-compatible API, OpenAI-compatible endpoint
+    - OpenAI API, OpenAI compatibility
     - ChatGPT
-    - OpenAI
 ---
 
 import LinkList from '@site/src/components/LinkList';
@@ -26,14 +26,22 @@ As a result, it sees rapid adoption and ecosystem growth across various industri
 
 ## Why does compatibility matter?
 
-While OpenAI’s APIs helped kickstart the AI application development, their widespread adoption created ecosystem lock-in. Many developer tools and frameworks now only support the OpenAI API schema. Switching models or providers often requires rewriting significant parts of your application logic.
+While OpenAI’s APIs helped kickstart the AI application development, their widespread adoption created ecosystem lock-in. Many developer tools, frameworks, and SDKs are now built specifically around the OpenAI schema. That becomes a problem if you want to:
+
+- Switch to a different model
+- Move to a self-hosted deployment
+- Try a new inference provider
+
+In these cases, rewriting application logic to fit a new API can be tedious and error-prone.
 
 OpenAI-compatible APIs address these challenges by providing:
 
 - **Drop-in replacement**: Swap out OpenAI’s hosted API for your own self-hosted or open-source model, without changing your application code.
 - **Seamless migration**: Move between providers or self-hosted deployments with minimal disruption.
 - **Consistent integration**: Maintain compatibility with tools and frameworks that rely on the OpenAI API schema (e.g., `chat/completions`, `embeddings` endpoints).
 
+Many inference backends (e.g., vLLM and SGLang) and model serving frameworks (e.g., BentoML) now provide OpenAI-compatible endpoints out of the box. This makes it easy to switch between different models without changing client code.
+
 ## How to call an OpenAI-compatible API
 
 Here’s a quick example of how easy it is to point your existing OpenAI client to a self-hosted or alternative provider’s endpoint:
@@ -78,4 +86,5 @@ If you’re already using OpenAI’s SDKs or REST interface, you can simply redi
 <LinkList>
   ## Additional resources
   * [OpenAI documentation](https://platform.openai.com/docs/quickstart?api-mode=chat)
+  * [Examples: Serving LLMs with OpenAI-compatible APIs](https://github.com/bentoml/BentoVLLM)
 </LinkList>