Skip to content

Commit 673c965

Browse files
committed
fix: resolve issue with image path
1 parent 47ae220 commit 673c965

29 files changed

+21
-20
lines changed

docs/getting-started/tool-integration/function-calling.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ import LinkList from '@site/src/components/LinkList';
1111

1212
Function calling is like giving an LLM a toolbox with specific tools it can use to help you. When you ask the model to do something that requires one of these tools, it can "call" or use that tool to get the job done.
1313

14-
![function-calling-diagram.png](/img/docs/function-calling-diagram.png)
14+
![function-calling-diagram.png](./img/function-calling-diagram.png)
1515

1616
Here is a specific example:
1717

docs/getting-started/tool-integration/model-context-protocol.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ MCP uses a client-server architecture with the following components:
1717
- **MCP servers**: The connectors that expose different capabilities and data sources. Each server can connect to various backends like databases, third-party APIs, GitHub repositories, local files, or any other data source. Multiple servers can be running simultaneously on your local machine or connected to remotes services.
1818
- **MCP protocol**: This is the transport layer that enables communication between the host and servers, regardless of how many servers are connected.
1919

20-
![mcp-architecture.png](/img/docs/mcp-architecture.png)
20+
![mcp-architecture.png](./img/mcp-architecture.png)
2121

2222
When your AI assistant needs to access external data or tools, here's what happens at a high level:
2323

docs/inference-optimization/data-tensor-pipeline-expert-hybrid-parallelism.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,25 +14,25 @@ Parallelism strategies are essential for achieving high-performance computing in
1414

1515
Data parallelism is a common technique to accelerate computation. In this approach, the model's weights are replicated across multiple GPU devices, and the global batch of input data is divided into smaller microbatches. Each device only processes the microbatch assigned to it and this process happens in parallel for all devices. This delivers faster execution by allowing larger batches to be processed simultaneously.
1616

17-
![dp.png](/img/docs/dp.png)
17+
![dp.png](./img/dp.png)
1818

1919
## Tensor parallelism
2020

2121
Tensor parallelism slices individual layers of the model into smaller blocks. These blocks are computed independently and in parallel across different devices. For example, during matrix multiplication, different slices of the matrix can be processed simultaneously on different GPUs.
2222

23-
![tp-inference.png](/img/docs/tp-inference.png)
23+
![tp-inference.png](./img/tp-inference.png)
2424

2525
This approach delivers faster computation and allows serving LLMs that do not fit into the memory of a single device. However, because it involves extra communication between devices, you need to balance the performance gain against this overhead.
2626

2727
## Pipeline parallelism
2828

2929
[Pipeline parallelism](https://arxiv.org/pdf/1811.06965) divides the model’s layers into sequential chunks, each assigned to a separate device. Data flows through these chunks like an assembly line, with the output of one device becoming the input for the next. For instance, in a four-way pipeline, each device processes a quarter of the model’s layers.
3030

31-
![pp-diagram.png](/img/docs/pp-diagram.png)
31+
![pp-diagram.png](./img/pp-diagram.png)
3232

3333
However, because each device depends on the output of the previous one, some devices may be idle at times, which means resource underutilization. To reduce these idle periods, the input batch can be split into smaller microbatches. Each microbatch flows through the pipeline one by one, and gradients are accumulated at the end. This microbatching improves GPU utilization, though it does not completely eliminate idle time.
3434

35-
![pp-batching.png](/img/docs/pp-batching.png)
35+
![pp-batching.png](./img/pp-batching.png)
3636

3737
Note that pipeline parallelism can increase the total latency for each request because of
3838
communication between different pipeline stages.
@@ -41,7 +41,7 @@ communication between different pipeline stages.
4141

4242
Expert parallelism is a specialized parallelism strategy used in Mixture of Experts (MoE) models. In these models, only a subset of the model’s experts is activated for each token. Instead of duplicating all experts across every device (e.g., GPU), expert parallelism splits the experts themselves across different devices.
4343

44-
![ep-inference.png](/img/docs/ep-inference.png)
44+
![ep-inference.png](./img/ep-inference.png)
4545

4646
Each GPU holds the full weights of only some experts, not all. This means that each GPU processes only the tokens assigned to the experts stored on that GPU. In contrast, if you apply tensor parallelism for MoE models, it simply slices the weight matrices of all experts and distributes these slices across all devices.
4747

@@ -53,15 +53,15 @@ For certain models, relying on a single parallelism strategy is often not enough
5353

5454
A typical hybrid setup might look like this (combining data parallelism and tensor parallelism):
5555

56-
![dp+tp.png](/img/docs/dptp.png)
56+
![dp+tp.png](./img/dptp.png)
5757

5858
If you have 8 GPUs, you could apply tensor parallelism across the first four GPUs (TP=4), then replicate that setup to the remaining ones using data parallelism (DP=2).
5959

6060
Note that this is only one of the possible combinations, each with advantages and disadvantages. In the above example, tensor parallelism introduces communication overhead between GPUs, especially during inference. Therefore, using a high TP degree doesn't always translate to better performance.
6161

6262
An alternative configuration is to reduce tensor parallelism and increase data parallelism. For example, you can set TP=2 and DP=4:
6363

64-
![dp4tp2.png](/img/docs/dp4tp2.png)
64+
![dp4tp2.png](./img/dp4tp2.png)
6565

6666
This reduces cross-GPU communication, which may help lower latency during inference. However, there’s a catch: model weights consume a large portion of GPU memory, especially for large models. Lowering tensor parallelism means fewer GPUs share the model, leaving less room for KV cache. This can degrade inference optimizations like prefix caching.
6767

File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)