Skip to content

[Feature] Models api #3073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
1c2e05a
add v1/models interface related
Yzc216 Jul 29, 2025
f421568
add model parameters
Yzc216 Jul 29, 2025
5446d4a
default model verification
Yzc216 Jul 29, 2025
a57341c
unit test
Yzc216 Jul 29, 2025
81ce789
check model err_msg
Yzc216 Jul 29, 2025
2cba65d
unit test
Yzc216 Jul 30, 2025
a1fe0fc
Merge branch 'develop' into modelAPI
Yzc216 Jul 30, 2025
0cf44da
type annotation
Yzc216 Jul 30, 2025
a214565
model parameter in response
Yzc216 Jul 30, 2025
bba5c0e
modify document description
Yzc216 Jul 30, 2025
ca08ee1
modify document description
Yzc216 Jul 30, 2025
5fb3957
Merge branch 'develop' into modelAPI
Yzc216 Jul 30, 2025
be13064
Merge remote-tracking branch 'upstream/develop' into modelAPI
Yzc216 Aug 6, 2025
b775512
Merge branch 'develop' into modelAPI
Yzc216 Aug 6, 2025
c1039f4
Merge branch 'develop' into modelAPI
Yzc216 Aug 7, 2025
08ef40f
unit test
Yzc216 Aug 7, 2025
7a5b686
Merge branch 'develop' into modelAPI
Yzc216 Aug 7, 2025
8653948
Merge branch 'develop' into modelAPI
Yzc216 Aug 7, 2025
a1090d6
Merge branch 'develop' into modelAPI
Yzc216 Aug 7, 2025
eaea619
Merge remote-tracking branch 'upstream/develop' into modelAPI
Yzc216 Aug 11, 2025
5ad7d0f
verification
Yzc216 Aug 11, 2025
99515ef
Merge branch 'develop' into modelAPI
Yzc216 Aug 11, 2025
748abc4
Merge branch 'develop' into modelAPI
Yzc216 Aug 11, 2025
ce47277
verification update
Yzc216 Aug 11, 2025
e2940c0
model_name
Yzc216 Aug 11, 2025
e5f4890
Merge branch 'develop' into modelAPI
Yzc216 Aug 11, 2025
24984ee
Merge branch 'develop' into modelAPI
Yzc216 Aug 11, 2025
bcf252e
Merge branch 'develop' into modelAPI
Yzc216 Aug 11, 2025
2205ec3
Merge branch 'develop' into modelAPI
Yzc216 Aug 12, 2025
2cc63c1
Merge branch 'develop' into modelAPI
Yzc216 Aug 12, 2025
4359200
Merge branch 'develop' into modelAPI
Yzc216 Aug 14, 2025
ac1de94
Merge branch 'develop' into modelAPI
Yzc216 Aug 15, 2025
25e2250
Merge branch 'develop' into modelAPI
LiqinruiG Aug 19, 2025
8a282da
Merge branch 'develop' into modelAPI
LiqinruiG Aug 19, 2025
7c51a55
Merge branch 'develop' into modelAPI
LiqinruiG Aug 19, 2025
ca68e24
pre-commit
LiqinruiG Aug 19, 2025
6472ef7
update test case
LiqinruiG Aug 19, 2025
ab550e7
resolve conflict
LiqinruiG Aug 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ When using FastDeploy to deploy models (including offline inference and service
| ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 |
| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output |
| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default: False |
```graph_optimization_config``` | `str` | Parameters related to graph optimization can be configured, with default values of'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }' |
|```graph_optimization_config``` | `str` | Parameters related to graph optimization can be configured, with default values of'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }' |
| ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False |
| ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] |
| ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None |
Expand All @@ -44,7 +44,8 @@ When using FastDeploy to deploy models (including offline inference and service
| ```dynamic_load_weight``` | `int` | Whether to enable dynamic weight loading, default: 0 |
| ```enable_expert_parallel``` | `bool` | Whether to enable expert parallel |
| ```enable_logprob``` | `bool` | Whether to enable return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.If logrpob is not used, this parameter can be omitted when starting |

| ```served_model_name```| `str`| The model name used in the API. If not specified, the model name will be the same as the --model argument |
| ```revision``` | `str` | The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. |
## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?

During FastDeploy inference, GPU memory is occupied by ```model weights```, ```preallocated KVCache blocks``` and ```model computation intermediate activation values```. The preallocated KVCache blocks are determined by ```num_gpu_blocks_override```, with ```block_size``` (default: 64) as its unit, meaning one block can store KVCache for 64 Tokens.
Expand Down
4 changes: 3 additions & 1 deletion docs/zh/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
| ```static_decode_blocks``` | `int` | 推理过程中,每条请求强制从Prefill的KVCache分配对应块数给Decode使用,默认2|
| ```reasoning_parser``` | `str` | 指定要使用的推理解析器,以便从模型输出中提取推理内容 |
| ```use_cudagraph``` | `bool` | 是否使用cuda graph,默认False |
```graph_optimization_config``` | `str` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }' |
|```graph_optimization_config``` | `str` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }' |
| ```enable_custom_all_reduce``` | `bool` | 开启Custom all-reduce,默认False |
| ```splitwise_role``` | `str` | 是否开启splitwise推理,默认值mixed, 支持参数为["mixed", "decode", "prefill"] |
| ```innode_prefill_ports``` | `str` | prefill 实例内部引擎启动端口 (仅单机PD分离需要),默认值None |
Expand All @@ -42,6 +42,8 @@
| ```dynamic_load_weight``` | `int` | 是否动态加载权重,默认0 |
| ```enable_expert_parallel``` | `bool` | 是否启用专家并行 |
| ```enable_logprob``` | `bool` | 是否启用输出token返回logprob。如果未使用 logrpob,则在启动时可以省略此参数。 |
| ```served_model_name``` | `str` | API 中使用的模型名称,如果未指定,模型名称将与--model参数相同 |
| ```revision``` | `str` | 自动下载模型时,用于指定模型的Git版本,分支名或tag |

## 1. KVCache分配与```num_gpu_blocks_override```、```block_size```的关系?

Expand Down
10 changes: 10 additions & 0 deletions fastdeploy/engine/args_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@ class EngineArgs:
"""
The name or path of the model to be used.
"""
served_model_name: Optional[str] = None
"""
The name of the model being served.
"""
revision: Optional[str] = "master"
"""
The revision for downloading models.
Expand Down Expand Up @@ -358,6 +362,12 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
default=EngineArgs.model,
help="Model name or path to be used.",
)
model_group.add_argument(
"--served-model-name",
type=nullable_str,
default=EngineArgs.served_model_name,
help="Served model name",
)
model_group.add_argument(
"--revision",
type=nullable_str,
Expand Down
46 changes: 44 additions & 2 deletions fastdeploy/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,11 @@
CompletionResponse,
ControlSchedulerRequest,
ErrorResponse,
ModelList,
)
from fastdeploy.entrypoints.openai.serving_chat import OpenAIServingChat
from fastdeploy.entrypoints.openai.serving_completion import OpenAIServingCompletion
from fastdeploy.entrypoints.openai.serving_models import ModelPath, OpenAIServingModels
from fastdeploy.metrics.metrics import (
EXCLUDE_LABELS,
cleanup_prometheus_files,
Expand Down Expand Up @@ -105,6 +107,13 @@ async def lifespan(app: FastAPI):
else:
pid = os.getpid()
api_server_logger.info(f"{pid}")

if args.served_model_name is not None:
served_model_names = args.served_model_name
else:
served_model_names = args.model
model_paths = [ModelPath(name=served_model_names, model_path=args.model)]

engine_client = EngineClient(
args.model,
args.tokenizer,
Expand All @@ -119,8 +128,24 @@ async def lifespan(app: FastAPI):
args.enable_logprob,
)
app.state.dynamic_load_weight = args.dynamic_load_weight
chat_handler = OpenAIServingChat(engine_client, pid, args.ips)
completion_handler = OpenAIServingCompletion(engine_client, pid, args.ips)
model_handler = OpenAIServingModels(
model_paths,
args.max_model_len,
args.ips,
)
app.state.model_handler = model_handler
chat_handler = OpenAIServingChat(
engine_client,
app.state.model_handler,
pid,
args.ips,
)
completion_handler = OpenAIServingCompletion(
engine_client,
app.state.model_handler,
pid,
args.ips,
)
engine_client.create_zmq_client(model=pid, mode=zmq.PUSH)
engine_client.pid = pid
app.state.engine_client = engine_client
Expand Down Expand Up @@ -235,6 +260,23 @@ async def create_completion(request: CompletionRequest):
return StreamingResponse(content=generator, media_type="text/event-stream")


@app.get("/v1/models")
async def list_models() -> Response:
"""
List all available models.
"""
if app.state.dynamic_load_weight:
status, msg = app.state.engine_client.is_workers_alive()
if not status:
return JSONResponse(content={"error": "Worker Service Not Healthy"}, status_code=304)

models = await app.state.model_handler.list_models()
if isinstance(models, ErrorResponse):
return JSONResponse(content=models.model_dump(), status_code=models.code)
elif isinstance(models, ModelList):
return JSONResponse(content=models.model_dump())


@app.get("/update_model_weight")
def update_model_weight(request: Request) -> Response:
"""
Expand Down
32 changes: 32 additions & 0 deletions fastdeploy/entrypoints/openai/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

import json
import time
import uuid
from typing import Any, Dict, List, Literal, Optional, Union

from pydantic import BaseModel, Field, model_validator
Expand Down Expand Up @@ -55,6 +56,37 @@ class UsageInfo(BaseModel):
prompt_tokens_details: Optional[PromptTokenUsageInfo] = None


class ModelPermission(BaseModel):
id: str = Field(default_factory=lambda: f"modelperm-{str(uuid.uuid4().hex)}")
object: str = "model_permission"
created: int = Field(default_factory=lambda: int(time.time()))
allow_create_engine: bool = False
allow_sampling: bool = True
allow_logprobs: bool = True
allow_search_indices: bool = False
allow_view: bool = True
allow_fine_tuning: bool = False
organization: str = "*"
group: Optional[str] = None
is_blocking: bool = False


class ModelInfo(BaseModel):
id: str
object: str = "model"
created: int = Field(default_factory=lambda: int(time.time()))
owned_by: str = "FastDeploy"
root: Optional[str] = None
parent: Optional[str] = None
max_model_len: Optional[int] = None
permission: list[ModelPermission] = Field(default_factory=list)


class ModelList(BaseModel):
object: str = "list"
data: list[ModelInfo] = Field(default_factory=list)


class FunctionCall(BaseModel):
"""
Function call.
Expand Down
20 changes: 17 additions & 3 deletions fastdeploy/entrypoints/openai/serving_chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,14 @@
import time
import traceback
import uuid
from typing import List, Optional
from typing import List, Optional, Union

import aiozmq
import msgpack
import numpy as np
from aiozmq import zmq

from fastdeploy.entrypoints.engine_client import EngineClient
from fastdeploy.entrypoints.openai.protocol import (
ChatCompletionRequest,
ChatCompletionResponse,
Expand All @@ -39,6 +40,7 @@
PromptTokenUsageInfo,
UsageInfo,
)
from fastdeploy.entrypoints.openai.serving_models import OpenAIServingModels
from fastdeploy.metrics.work_metrics import work_process_metrics
from fastdeploy.utils import api_server_logger, get_host_ip
from fastdeploy.worker.output import LogprobsLists
Expand All @@ -49,8 +51,15 @@ class OpenAIServingChat:
OpenAI-style chat completions serving
"""

def __init__(self, engine_client, pid, ips):
def __init__(
self,
engine_client: EngineClient,
models: OpenAIServingModels,
pid: int,
ips: Union[List[str], str],
):
self.engine_client = engine_client
self.models = models
self.pid = pid
self.master_ip = ips
self.host_ip = get_host_ip()
Expand All @@ -76,7 +85,12 @@ async def create_chat_completion(self, request: ChatCompletionRequest):
err_msg = f"Only master node can accept completion request, please send request to master node: {self.pod_ips[0]}"
api_server_logger.error(err_msg)
return ErrorResponse(message=err_msg, code=400)

if request.model == "default":
request.model = self.models.model_name()
if not self.models.is_supported_model(request.model):
err_msg = f"Unsupported model: {request.model}, support {', '.join([x.name for x in self.models.model_paths])} or default"
api_server_logger.error(err_msg)
return ErrorResponse(message=err_msg, code=400)
if request.user is not None:
request_id = f"chatcmpl-{request.user}-{uuid.uuid4()}"
else:
Expand Down
19 changes: 17 additions & 2 deletions fastdeploy/entrypoints/openai/serving_completion.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,15 @@
import asyncio
import time
import uuid
from typing import List, Optional
from typing import List, Optional, Union

import aiozmq
import msgpack
import numpy as np
from aiozmq import zmq

from fastdeploy.engine.request import RequestOutput
from fastdeploy.entrypoints.engine_client import EngineClient
from fastdeploy.entrypoints.openai.protocol import (
CompletionLogprobs,
CompletionRequest,
Expand All @@ -35,13 +36,21 @@
ErrorResponse,
UsageInfo,
)
from fastdeploy.entrypoints.openai.serving_models import OpenAIServingModels
from fastdeploy.utils import api_server_logger, get_host_ip
from fastdeploy.worker.output import LogprobsLists


class OpenAIServingCompletion:
def __init__(self, engine_client, pid, ips):
def __init__(
self,
engine_client: EngineClient,
models: OpenAIServingModels,
pid: int,
ips: Union[List[str], str],
):
self.engine_client = engine_client
self.models = models
self.pid = pid
self.master_ip = ips
self.host_ip = get_host_ip()
Expand All @@ -66,6 +75,12 @@ async def create_completion(self, request: CompletionRequest):
err_msg = f"Only master node can accept completion request, please send request to master node: {self.pod_ips[0]}"
api_server_logger.error(err_msg)
return ErrorResponse(message=err_msg, code=400)
if request.model == "default":
request.model = self.models.model_name()
if not self.models.is_supported_model(request.model):
err_msg = f"Unsupported model: {request.model}, support {', '.join([x.name for x in self.models.model_paths])} or default"
api_server_logger.error(err_msg)
return ErrorResponse(message=err_msg, code=400)
created_time = int(time.time())
if request.user is not None:
request_id = f"cmpl-{request.user}-{uuid.uuid4()}"
Expand Down
93 changes: 93 additions & 0 deletions fastdeploy/entrypoints/openai/serving_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""

from dataclasses import dataclass
from typing import List, Union

from fastdeploy.entrypoints.openai.protocol import (
ErrorResponse,
ModelInfo,
ModelList,
ModelPermission,
)
from fastdeploy.utils import api_server_logger, get_host_ip


@dataclass
class ModelPath:
name: str
model_path: str


class OpenAIServingModels:
"""
OpenAI-style models serving
"""

def __init__(
self,
model_paths: list[ModelPath],
max_model_len: int,
ips: Union[List[str], str],
):
self.model_paths = model_paths
self.max_model_len = max_model_len
self.master_ip = ips
self.host_ip = get_host_ip()
if self.master_ip is not None:
if isinstance(self.master_ip, list):
self.master_ip = self.master_ip[0]
else:
self.master_ip = self.master_ip.split(",")[0]

def _check_master(self):
if self.master_ip is None:
return True
if self.host_ip == self.master_ip:
return True
return False

def is_supported_model(self, model_name) -> bool:
"""
Check whether the specified model is supported.
"""
if model_name == "default":
return True
return any(model.name == model_name for model in self.model_paths)

def model_name(self) -> str:
"""
Returns the current model name.
"""
return self.model_paths[0].name

async def list_models(self) -> ModelList:
"""
Show available models.
"""
if not self._check_master():
err_msg = (
f"Only master node can accept models request, please send request to master node: {self.pod_ips[0]}"
)
api_server_logger.error(err_msg)
return ErrorResponse(message=err_msg, code=400)
model_infos = [
ModelInfo(
id=model.name, max_model_len=self.max_model_len, root=model.model_path, permission=[ModelPermission()]
)
for model in self.model_paths
]
return ModelList(data=model_infos)
Loading
Loading