Skip to content

Commit fb0abc8

Browse files
authored
Merge pull request #16 from Sherlock113/docs/update-handbook-content
docs: Update metrics, hardware and openai-api content
2 parents 776dcea + 7704942 commit fb0abc8

File tree

3 files changed

+67
-8
lines changed

3 files changed

+67
-8
lines changed

docs/inference-optimization/llm-inference-metrics.md

Lines changed: 36 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,45 @@ Before exploring optimization techniques, let’s understand the key metrics the
1818

1919
## Latency
2020

21-
Latency measures how quickly a model responds to a request. For a single request, latency is the time from sending the request to receiving the final token on the user end. It’s crucial for user experience, especially in real-time applications.
21+
Latency measures how quickly a model responds to a request. It’s crucial for user experience, especially in interactive, real-time applications.
2222

23-
There are two key metrics to measure latency:
23+
Key metrics to measure latency:
2424

25-
- **Time to First Token (TTFT)**: The time it takes to generate the first token after sending a request. It reflects how fast the model can start responding. Different applications usually have different expectations for TTFT. For example, when summarizing a long document, users are usually willing to wait longer for the first token since the task is more demanding.
25+
- **Time to First Token (TTFT)**: The time it takes to generate the first token after sending a request. It reflects how fast the model can start responding.
2626
- **Time per Output Token (TPOT)**: Also known as Inter-Token Latency (ITL), TPOT measures the time between generating each subsequent token. A lower TPOT means the model can produce tokens faster, leading to higher tokens per second.
2727

28-
In streaming scenarios where users see text appear word-by-word (like ChatGPT's interface), TPOT determines how smooth the experience feels. It should be fast enough to keep pace with human reading speed.
28+
In streaming scenarios where users see text appear word-by-word (like ChatGPT's interface), TPOT determines how smooth the experience feels. The system should ideally keep up with or exceed human reading speed to ensure a smooth experience.
29+
30+
- **Token Generation Time**: The time between receiving the first and the final token. This measures how long it takes the model to stream out the full response.
31+
- **Total Latency (E2EL)**: The time from sending the request to receiving the final token on the user end. Note that:
32+
33+
```bash
34+
Total Latency = TTFT + Token Generation Time
35+
```
36+
37+
Total latency directly affects perceived responsiveness. A fast TTFT followed by slow token generation still leads to a poor experience.
38+
39+
Acceptable latency depends on the use case. For example, a chatbot might require a TTFT under 500 milliseconds to feel responsive, while a code completion tool may need TTFT below 100 milliseconds for seamless developer experience. In contrast, if you're generating long reports that are reviewed once a day, then even a 30-second total latency may be perfectly acceptable. The key is to match latency targets to the pace and expectations of the task at hand.
40+
41+
### Understanding mean, median, and P99 latency
42+
43+
When analyzing LLM performance, especially latency, it’s not enough to look at just one number. Metrics like mean, median, and P99 each tell a different part of the story.
44+
45+
- **Mean (Average)**: This is the sum of all values divided by the number of values. Mean gives a general sense of average performance, but it can be skewed by extreme values (outliers). For example, if the TTFT of one request is unusually slow, it inflates the mean.
46+
- **Median**: The middle value when all values are sorted. Median shows what a "typical" user experience. It’s more stable and resistant to outliers than the mean. If your median TTFT is 30 seconds, most users are seeing very slow first responses, which might be unacceptable for real-time use cases.
47+
- **P99 (99th Percentile)**: The value below which 99% of requests fall. P99 reveals worst-case performance for the slowest 1% of requests. This is important when users expect consistency, or when your SLAs guarantee fast responses for 99% of cases. If your P99 TTFT is nearly 100 seconds, it suggests a small but significant portion of users face very long waits.
48+
49+
:::note
50+
You may also see P90 or P95, which show the 90th and 95th percentile latencies, respectively. These are useful for understanding near-worst-case performance and are often used in cases where P99 may be too strict or sensitive to noise.
51+
:::
52+
53+
Together, these metrics give you a complete view:
54+
55+
- **Mean** helps monitor trends over time.
56+
- **Median** reflects the experience of the majority of users.
57+
- **P99** captures tail latency, which can make or break user experience in production.
58+
59+
You’ll often see these metrics in LLM performance benchmarks, such as mean TTFT, median TPOT, and P99 E2EL, to capture different aspects of latency and user experience.
2960
3061
## Throughput
3162
@@ -105,4 +136,5 @@ Using a serverless API can abstract away these optimizations, leaving you with l
105136
* [NVIDIA NIM LLMs Benchmarking - Metrics](https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html)
106137
* [Mastering LLM Techniques: Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)
107138
* [LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators](https://arxiv.org/pdf/2411.00136)
139+
* [Throughput is Not All You Need](https://hao-ai-lab.github.io/blogs/distserve/)
108140
</LinkList>

docs/llm-inference-basics/cpu-vs-gpu-vs-tpu.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ keywords:
77
- TPUs
88
---
99

10+
import LinkList from '@site/src/components/LinkList';
11+
1012
# Where is LLM inference run?
1113

1214
When deploying LLMs into production, choosing the right hardware is crucial. Different hardware types offer varied levels of performance and cost-efficiency. The three primary options are CPUs, GPUs, and TPUs. Understanding their strengths and weaknesses helps you optimize your inference workloads effectively.
@@ -29,4 +31,20 @@ TPUs are behind some of the most advanced AI applications today: agents, recomme
2931

3032
## Choosing the right hardware for your LLM inference
3133

32-
Selecting the appropriate hardware requires you to understand your model size, inference volume, latency requirements, cost constraints, and available infrastructure. GPUs remain the most popular choice due to their versatility and broad support, while TPUs offer compelling advantages for certain specialized scenarios, and CPUs still have a place for lightweight, budget-conscious workloads.
34+
Selecting the appropriate hardware requires you to understand your model size, inference volume, latency requirements, cost constraints, and available infrastructure. GPUs remain the most popular choice due to their versatility and broad support, while TPUs offer compelling advantages for certain specialized scenarios, and CPUs still have a place for lightweight, budget-conscious workloads.
35+
36+
## Choosing the deployment environment
37+
38+
The deployment environment shapes everything from latency and scalability to privacy and cost. Each environment suits different operational needs for enterprises.
39+
40+
- **Cloud**: The cloud is the most popular environment for LLM inference today. It offers on-demand access to high-performance GPUs and TPUs, along with a rich ecosystem of managed services, autoscaling, and monitoring tools.
41+
- **On-Prem**: On-premises deployments means running LLM inference on your own infrastructure, typically within a private data center. It offers full control over data, performance, and compliance, but requires more operational overhead.
42+
- **Edge**: In edge deployments, the model runs directly on user devices or local edge nodes, closer to where data is generated. This reduces network latency and increases data privacy, especially for time-sensitive or offline use cases. Edge inference usually uses smaller, optimized models due to limited compute resources.
43+
44+
More details will be covered in the [infrastructure and management](../infrastructure-and-operations) chapter.
45+
46+
<LinkList>
47+
## Additional resources
48+
* [How to Beat the GPU CAP Theorem in AI Inference](https://www.bentoml.com/blog/how-to-beat-the-gpu-cap-theorem-in-ai-inference)
49+
* [State of AI Inference Infrastructure Survey Highlights](https://www.bentoml.com/blog/2024-ai-infra-survey-highlights)
50+
</LinkList>

docs/llm-inference-basics/openai-compatible-api.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
sidebar_position: 6
33
description: Learn the concept of OpenAI-compatible API and why you need it.
44
keywords:
5-
- OpenAI-compatible API
5+
- OpenAI-compatible API, OpenAI-compatible endpoint
6+
- OpenAI API, OpenAI compatibility
67
- ChatGPT
7-
- OpenAI
88
---
99

1010
import LinkList from '@site/src/components/LinkList';
@@ -26,14 +26,22 @@ As a result, it sees rapid adoption and ecosystem growth across various industri
2626

2727
## Why does compatibility matter?
2828

29-
While OpenAI’s APIs helped kickstart the AI application development, their widespread adoption created ecosystem lock-in. Many developer tools and frameworks now only support the OpenAI API schema. Switching models or providers often requires rewriting significant parts of your application logic.
29+
While OpenAI’s APIs helped kickstart the AI application development, their widespread adoption created ecosystem lock-in. Many developer tools, frameworks, and SDKs are now built specifically around the OpenAI schema. That becomes a problem if you want to:
30+
31+
- Switch to a different model
32+
- Move to a self-hosted deployment
33+
- Try a new inference provider
34+
35+
In these cases, rewriting application logic to fit a new API can be tedious and error-prone.
3036

3137
OpenAI-compatible APIs address these challenges by providing:
3238

3339
- **Drop-in replacement**: Swap out OpenAI’s hosted API for your own self-hosted or open-source model, without changing your application code.
3440
- **Seamless migration**: Move between providers or self-hosted deployments with minimal disruption.
3541
- **Consistent integration**: Maintain compatibility with tools and frameworks that rely on the OpenAI API schema (e.g., `chat/completions`, `embeddings` endpoints).
3642

43+
Many inference backends (e.g., vLLM and SGLang) and model serving frameworks (e.g., BentoML) now provide OpenAI-compatible endpoints out of the box. This makes it easy to switch between different models without changing client code.
44+
3745
## How to call an OpenAI-compatible API
3846

3947
Here’s a quick example of how easy it is to point your existing OpenAI client to a self-hosted or alternative provider’s endpoint:
@@ -78,4 +86,5 @@ If you’re already using OpenAI’s SDKs or REST interface, you can simply redi
7886
<LinkList>
7987
## Additional resources
8088
* [OpenAI documentation](https://platform.openai.com/docs/quickstart?api-mode=chat)
89+
* [Examples: Serving LLMs with OpenAI-compatible APIs](https://github.com/bentoml/BentoVLLM)
8190
</LinkList>

0 commit comments

Comments
 (0)