Skip to content

[GCU] Enable gcu CI #3190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 16 additions & 8 deletions .github/workflows/ci_gcu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,9 @@ jobs:
REPO_NAME="${FULL_REPO##*/}"
BASE_BRANCH="${{ github.base_ref }}"
# Clean the repository directory before starting
docker run --rm --net=host -v $(pwd):/workspace -w /workspace \
docker run --rm --net=host -v $(pwd):/workspace \
-v ${{ github.workspace }}/../../..:${{ github.workspace }}/../../.. \
-w /workspace \
-e "REPO_NAME=${REPO_NAME}" \
-e "BASE_BRANCH=${BASE_BRANCH}" \
${docker_image} /bin/bash -c '
Expand All @@ -40,6 +42,7 @@ jobs:
'
git config --global user.name "FastDeployCI"
git config --global user.email "fastdeploy_ci@example.com"
source ${{ github.workspace }}/../../../proxy
git clone ${REPO} ${REPO_NAME} -b ${BASE_BRANCH}
cd FastDeploy
if [ "${{ github.event_name }}" = "pull_request" ]; then
Expand All @@ -50,6 +53,9 @@ jobs:
git checkout ${{ github.sha }}
git log -n 3 --oneline
fi
echo "Copy models..."
sudo mkdir -p ci_models && sudo cp -r /work/deps/ERNIE-4.5-21B-A3B-Paddle ci_models
echo "Copy models done."
- name: Run CI unittest
env:
Expand All @@ -71,19 +77,21 @@ jobs:
echo "PARENT_DIR:$PARENT_DIR"
echo "Install drivers..."
cd /work/deps
bash TopsRider_i3x_*_deb_amd64.run --driver --no-auto-load -y
sudo bash TopsRider_i3x_*_deb_amd64.run --driver --no-auto-load -y
cd -
docker run --rm --network=host --ipc=host -it --privileged \
-v $(pwd):/workspace -w /workspace \
-v "/home:/home" \
-v "/work:/work" \
-e "MODEL_PATH=/work/models" \
echo "Create docker..."
docker run --rm --network=host --ipc=host --privileged \
-v $(pwd):/workspace \
-v /home:/home \
-v /work:/work \
-w /workspace \
-e "MODEL_PATH=./ci_models" \
-e "http_proxy=$(git config --global --get http.proxy)" \
-e "https_proxy=$(git config --global --get https.proxy)" \
-e "FD_API_PORT=${FD_API_PORT}" \
-e "FD_ENGINE_QUEUE_PORT=${FD_ENGINE_QUEUE_PORT}" \
-e "FD_METRICS_PORT=${FD_METRICS_PORT}" \
${docker_image} /bin/bash -c "
${docker_image} /bin/bash -c "
git config --global --add safe.directory /workspace/FastDeploy
cd FastDeploy
bash scripts/run_ci_gcu.sh
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ def __init__(
kv_num_heads: int,
num_heads: int,
head_dim: int,
encoder_block_shape_q: int = -1,
decoder_block_shape_q: int = -1,
):
"""
GCUFlashAttnBackend __init__
Expand All @@ -94,7 +96,7 @@ def __init__(
self.head_dim = head_dim
self.scaling = 1.0 / (self.head_dim**0.5)
self.num_layers = fd_config.model_config.num_hidden_layers
self.position_ids_base = paddle.arange(self.max_seq_len)
self.position_ids_base = np.arange(self.max_seq_len)

# TODO(zhengjun): Need to adapt the allocation logic and
# temporarily allocate according to fixed size
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@ def __init__(
kv_num_heads: int,
num_heads: int,
head_dim: int,
encoder_block_shape_q: int = -1,
decoder_block_shape_q: int = -1,
):
"""
GCUMemEfficientAttnBackend __init__
Expand All @@ -92,7 +94,7 @@ def __init__(
self.head_dim = head_dim
self.scaling = 1.0 / (self.head_dim**0.5)
self.num_layers = fd_config.model_config.num_hidden_layers
self.position_ids_base = paddle.arange(self.max_seq_len)
self.position_ids_base = np.arange(self.max_seq_len)

# TODO(zhengjun): Need to adapt the allocation logic and
# temporarily allocate according to fixed size
Expand Down
8 changes: 3 additions & 5 deletions fastdeploy/worker/gcu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,7 @@ def get_attr_from_request(request, attr, default_value=None):

if self.speculative_method in ["mtp"]:
self.proposer.insert_prefill_inputs(req_dicts)
self.share_inputs["seq_lens_this_time"] = self.seq_lens_this_time_buffer[:num_running_requests]
self.share_inputs["seq_lens_this_time"] = self.seq_lens_this_time_buffer

def _dummy_prefill_inputs(self, num_tokens: int, batch_size: int, expected_decode_len: int):
"""Set dummy prefill inputs to share_inputs"""
Expand Down Expand Up @@ -675,7 +675,7 @@ def initialize_attn_backend(self) -> None:
)
self.share_inputs["decoder_batch_ids"] = paddle.full([int(decode_max_tile_size)], 0, dtype="int32")
self.share_inputs["decoder_tile_ids_per_batch"] = paddle.full([int(decode_max_tile_size)], 0, dtype="int32")
self.share_inputs["decoder_num_blocks_cpu"] = paddle.full([1], 0, dtype="int32").pin_memory()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么直接删掉了,删掉后变成一个gpu tensor 了。不使用pinned memory 的话也应该加上.cpu() ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 tensor 最终会在 get_block_shape_and_split_kv_block kernel 中被用到

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self.share_inputs["decoder_num_blocks_cpu"] = paddle.full([1], 0, dtype="int32").cpu()
self.share_inputs["max_len_tensor_cpu"] = paddle.full([8], 0, dtype="int32").cpu()

# Get the attention backend
Expand Down Expand Up @@ -1062,9 +1062,7 @@ class at the server level, which is too granular for ModelRunner.

self._update_chunked_prefill(model_forward_batch)
self._add_cache(model_forward_batch)
self.seq_lens_this_time_buffer[:num_running_requests].copy_(
self.share_inputs["seq_lens_this_time"][:num_running_requests], False
)
self.seq_lens_this_time_buffer.copy_(self.share_inputs["seq_lens_this_time"], False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一块改动的原因是啥?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这里主要是为了配合 第300行 的修改,更新完整的数据:
    self.share_inputs["seq_lens_this_time"] = self.seq_lens_this_time_buffer

  2. 暂时在GCU上没有采用real_bsz的原因:
    AttentionBackend以及预处理后处理算子(update_inputs_gcu/set_value_by_flags_and_idx_gcu等)使用了seq_lens_this_timeshape做了一些操作,应该需要统一整改。

  3. real_bsz这一改动对与GCU可能带来的影响:

  • 对调度系统,应该需要保证本次调度到的num_running_requests个请求集中排在整个task列表的前面?
  • 对自定义算子等需要使用到seq_lens_this_time的地方有了约束的变更,需要排查整改?

return None

def _add_cache(self, model_forward_batch) -> None:
Expand Down
78 changes: 52 additions & 26 deletions scripts/run_ci_gcu.sh
Original file line number Diff line number Diff line change
@@ -1,33 +1,39 @@
#!/bin/bash
#!/usr/bin/env bash
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
echo "$DIR"
echo "Current directory: ${DIR}"

#先kill一遍
ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
lsof -t -i :8188 | xargs kill -9 || true
function stop_processes() {
ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
lsof -t -i :8188 | xargs kill -9 || true
}

export model_path=${MODEL_PATH}/paddle/ERNIE-4.5-21B-A3B-Paddle
echo "Clean up processes..."
stop_processes
echo "Clean up completed."

export model_path=${MODEL_PATH}/ERNIE-4.5-21B-A3B-Paddle

echo "pip install requirements"
python -m pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
echo "uninstall org"
python -m pip uninstall paddlepaddle -y
python -m pip uninstall paddle-custom-gcu -y
python -m pip install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
python -m pip install --pre paddle-custom-gcu==3.0.0.dev20250801 -i https://www.paddlepaddle.org.cn/packages/nightly/gcu/
echo "build whl"
bash build.sh 1 || exit 1

unset http_proxy
unset https_proxy
unset no_proxy

# 起服务
rm -rf log/*
rm -f core*
# pkill -9 python #流水线不执行这个
#清空消息队列

# Empty the message queue
ipcrm --all=msg
echo "Start server..."
python -m fastdeploy.entrypoints.openai.api_server \
--model ${model_path} \
--port 8188 \
Expand All @@ -38,21 +44,40 @@ python -m fastdeploy.entrypoints.openai.api_server \
--max-num-seqs 8 \
--quantization wint4 > server.log 2>&1 &

sleep 60
# 探活
TIMEOUT=$((5 * 60))
INTERVAL=10 # 检查间隔(秒)
echo "Waiting 90 seconds..."
sleep 90

if grep -q "Failed to launch worker processes" server.log; then
echo "Failed to launch worker processes..."
stop_processes
cat server.log
cat log/workerlog.0
exit 1
fi

if grep -q "Traceback (most recent call last):" server.log; then
echo "Some errors occurred..."
stop_processes
cat server.log
cat log/workerlog.0
exit 1
fi

# Health check
TIMEOUT=$((11 * 60))
INTERVAL=30 # Check interval (seconds)
ENDPOINT="http://0.0.0.0:8188/health"
START_TIME=$(date +%s) # 记录开始时间戳
echo "开始服务健康检查,最长等待时间:${TIMEOUT}"
START_TIME=$(date +%s) # Record the start timestamp
echo "Start the server health check, maximum waiting time: ${TIMEOUT} seconds..."
while true; do
# 计算已耗时
# Used to calculate the time cost
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))

# 超时判断
# Timeout
if [ $ELAPSED -ge $TIMEOUT ]; then
echo -e "\n服务启动超时:经过 $((TIMEOUT/60)) 分钟服务仍未启动!"
echo -e "\nServer start timeout: After $((TIMEOUT/60)) minutes, the service still doesn't start!"
stop_processes
cat server.log
cat log/workerlog.0
exit 1
Expand All @@ -61,26 +86,27 @@ while true; do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" -m 2 "$ENDPOINT" || true)

if [ "$HTTP_CODE" = "200" ]; then
echo -e "\n服务启动成功!耗时 ${ELAPSED}"
echo -e "\nThe server was successfully launched! Totally takes $((ELAPSED+90)) seconds."
break
else
sleep $INTERVAL
fi
done

cat server.log
echo -e "\n"

# 执行服务化推理
echo "Start inference..."
python test/ci_use/GCU/run_ernie.py
exit_code=$?
echo exit_code is ${exit_code}
echo -e "exit_code is ${exit_code}.\n"

ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
lsof -t -i :8188 | xargs kill -9 || true
echo "Stop server..."
stop_processes
echo "Stop server done."

if [ ${exit_code} -ne 0 ]; then
echo "log/workerlog.0"
echo "Exit with error, please refer to log/workerlog.0"
cat log/workerlog.0
exit 1
fi
14 changes: 10 additions & 4 deletions test/ci_use/GCU/run_ernie.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,24 @@
import openai

ip = "0.0.0.0"
service_http_port = "8188" # 服务配置的
service_http_port = "8188"
client = openai.Client(base_url=f"http://{ip}:{service_http_port}/v1", api_key="EMPTY_API_KEY")

# 非流式对话
response = client.chat.completions.create(
model="default",
messages=[
{"role": "user", "content": "The largest ocean is"},
],
temperature=1,
top_p=0,
max_tokens=64,
max_tokens=256,
stream=False,
)
print(response)
print(f"response is: {response}", flush=True)

generate_context = response.choices[0].message.content
print(f"\ngenerate_context is: {generate_context}", flush=True)

assert "pacific ocean" in generate_context.lower(), "The answer was incorrect!"

print("Test successfully!", flush=True)
Loading