You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/getting-started/tool-integration/function-calling.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ import LinkList from '@site/src/components/LinkList';
11
11
12
12
Function calling is like giving an LLM a toolbox with specific tools it can use to help you. When you ask the model to do something that requires one of these tools, it can "call" or use that tool to get the job done.
Copy file name to clipboardExpand all lines: docs/getting-started/tool-integration/model-context-protocol.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ MCP uses a client-server architecture with the following components:
17
17
-**MCP servers**: The connectors that expose different capabilities and data sources. Each server can connect to various backends like databases, third-party APIs, GitHub repositories, local files, or any other data source. Multiple servers can be running simultaneously on your local machine or connected to remotes services.
18
18
-**MCP protocol**: This is the transport layer that enables communication between the host and servers, regardless of how many servers are connected.
Copy file name to clipboardExpand all lines: docs/inference-optimization/data-tensor-pipeline-expert-hybrid-parallelism.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,25 +14,25 @@ Parallelism strategies are essential for achieving high-performance computing in
14
14
15
15
Data parallelism is a common technique to accelerate computation. In this approach, the model's weights are replicated across multiple GPU devices, and the global batch of input data is divided into smaller microbatches. Each device only processes the microbatch assigned to it and this process happens in parallel for all devices. This delivers faster execution by allowing larger batches to be processed simultaneously.
16
16
17
-

17
+

18
18
19
19
## Tensor parallelism
20
20
21
21
Tensor parallelism slices individual layers of the model into smaller blocks. These blocks are computed independently and in parallel across different devices. For example, during matrix multiplication, different slices of the matrix can be processed simultaneously on different GPUs.
22
22
23
-

23
+

24
24
25
25
This approach delivers faster computation and allows serving LLMs that do not fit into the memory of a single device. However, because it involves extra communication between devices, you need to balance the performance gain against this overhead.
26
26
27
27
## Pipeline parallelism
28
28
29
29
[Pipeline parallelism](https://arxiv.org/pdf/1811.06965) divides the model’s layers into sequential chunks, each assigned to a separate device. Data flows through these chunks like an assembly line, with the output of one device becoming the input for the next. For instance, in a four-way pipeline, each device processes a quarter of the model’s layers.
30
30
31
-

31
+

32
32
33
33
However, because each device depends on the output of the previous one, some devices may be idle at times, which means resource underutilization. To reduce these idle periods, the input batch can be split into smaller microbatches. Each microbatch flows through the pipeline one by one, and gradients are accumulated at the end. This microbatching improves GPU utilization, though it does not completely eliminate idle time.
34
34
35
-

35
+

36
36
37
37
Note that pipeline parallelism can increase the total latency for each request because of
38
38
communication between different pipeline stages.
@@ -41,7 +41,7 @@ communication between different pipeline stages.
41
41
42
42
Expert parallelism is a specialized parallelism strategy used in Mixture of Experts (MoE) models. In these models, only a subset of the model’s experts is activated for each token. Instead of duplicating all experts across every device (e.g., GPU), expert parallelism splits the experts themselves across different devices.
43
43
44
-

44
+

45
45
46
46
Each GPU holds the full weights of only some experts, not all. This means that each GPU processes only the tokens assigned to the experts stored on that GPU. In contrast, if you apply tensor parallelism for MoE models, it simply slices the weight matrices of all experts and distributes these slices across all devices.
47
47
@@ -53,15 +53,15 @@ For certain models, relying on a single parallelism strategy is often not enough
53
53
54
54
A typical hybrid setup might look like this (combining data parallelism and tensor parallelism):
55
55
56
-

56
+

57
57
58
58
If you have 8 GPUs, you could apply tensor parallelism across the first four GPUs (TP=4), then replicate that setup to the remaining ones using data parallelism (DP=2).
59
59
60
60
Note that this is only one of the possible combinations, each with advantages and disadvantages. In the above example, tensor parallelism introduces communication overhead between GPUs, especially during inference. Therefore, using a high TP degree doesn't always translate to better performance.
61
61
62
62
An alternative configuration is to reduce tensor parallelism and increase data parallelism. For example, you can set TP=2 and DP=4:
63
63
64
-

64
+

65
65
66
66
This reduces cross-GPU communication, which may help lower latency during inference. However, there’s a catch: model weights consume a large portion of GPU memory, especially for large models. Lowering tensor parallelism means fewer GPUs share the model, leaving less room for KV cache. This can degrade inference optimizations like prefix caching.
0 commit comments