Skip to content

add custom chat template #3251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Aug 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
1a569e5
add custom chat_template
luukunn Jul 29, 2025
3b4326a
add custom chat_template
luukunn Jul 29, 2025
53d8beb
Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …
luukunn Jul 29, 2025
db34a06
Resolve merge conflicts
luukunn Aug 6, 2025
23095e5
add unittest
luukunn Aug 6, 2025
dc42a72
fix
luukunn Aug 6, 2025
d44cb51
add docs
luukunn Aug 6, 2025
0b4db9a
fix comment
luukunn Aug 6, 2025
8927f4b
add offline chat
luukunn Aug 6, 2025
4ad201c
fix unit test
luukunn Aug 6, 2025
f5f2c1f
fix unit test
luukunn Aug 7, 2025
b77da03
fix
luukunn Aug 11, 2025
238149e
Merge branch 'develop' into develop
luukunn Aug 11, 2025
ca72e35
Merge branch 'develop' into develop
luukunn Aug 11, 2025
bc9fb4b
fix pre commit
luukunn Aug 11, 2025
78f5804
Merge branch 'develop' of https://github.com/luukunn/FastDeploy into …
luukunn Aug 11, 2025
7b3f43f
fix unit test
luukunn Aug 11, 2025
227ddba
add unit test
luukunn Aug 11, 2025
8a124ad
add unit test
luukunn Aug 12, 2025
43e70c5
add unit test
luukunn Aug 12, 2025
c2189c6
Merge branch 'develop' into develop
luukunn Aug 12, 2025
1502081
fix pre_commit
luukunn Aug 12, 2025
573d7fa
Merge branch 'develop' into develop
luukunn Aug 12, 2025
23beb89
fix enable_thinking
luukunn Aug 18, 2025
f13d985
Merge branch 'develop' of https://github.com/luukunn/FastDeploy into …
luukunn Aug 18, 2025
3f75173
Merge branch 'develop' into develop
luukunn Aug 18, 2025
cb3eae6
fix pre commit
luukunn Aug 18, 2025
9f98538
fix pre commit
luukunn Aug 18, 2025
a9f3bc0
fix unit test
luukunn Aug 18, 2025
24c59e9
add requirements
luukunn Aug 18, 2025
38c73ff
Merge branch 'develop' into develop
luukunn Aug 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/online_serving/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,9 @@ The following extra parameters are supported:
chat_template_kwargs: Optional[dict] = None
# Additional parameters passed to the chat template, used for customizing dialogue formats (default None).

chat_template: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/zh/parameters.md 和 docs/parameters.md 中补充启动参数说明。 中补充启动参数说明。

# Custom chat template will override the model's default chat template (default None).

reasoning_max_tokens: Optional[int] = None
# Maximum number of tokens to generate during reasoning (e.g., CoT, chain of thought) (default None means using global max_tokens).

Expand Down
1 change: 1 addition & 0 deletions docs/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ When using FastDeploy to deploy models (including offline inference and service
| ```dynamic_load_weight``` | `int` | Whether to enable dynamic weight loading, default: 0 |
| ```enable_expert_parallel``` | `bool` | Whether to enable expert parallel |
| ```enable_logprob``` | `bool` | Whether to enable return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.If logrpob is not used, this parameter can be omitted when starting |
| ```chat_template``` | `str` | Specify the template used for model concatenation, It supports both string input and file path input. The default value is None. If not specified, the model's default template will be used. |

## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?

Expand Down
3 changes: 3 additions & 0 deletions docs/zh/online_serving/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,9 @@ repetition_penalty: Optional[float] = None
chat_template_kwargs: Optional[dict] = None
# 传递给聊天模板(chat template)的额外参数,用于自定义对话格式(默认 None)。

chat_template: Optional[str] = None
# 自定义聊天模板,会覆盖模型默认的聊天模板,(默认 None)。

reasoning_max_tokens: Optional[int] = None
# 推理(如 CoT, 思维链)过程中生成的最大 token 数(默认 None 表示使用全局 max_tokens)。

Expand Down
1 change: 1 addition & 0 deletions docs/zh/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
| ```dynamic_load_weight``` | `int` | 是否动态加载权重,默认0 |
| ```enable_expert_parallel``` | `bool` | 是否启用专家并行 |
| ```enable_logprob``` | `bool` | 是否启用输出token返回logprob。如果未使用 logrpob,则在启动时可以省略此参数。 |
| ```chat_template``` | `str` | 指定模型拼接使用的模板,支持字符串与文件路径,默认为None,如未指定,则使用模型默认模板 |

## 1. KVCache分配与```num_gpu_blocks_override```、```block_size```的关系?

Expand Down
10 changes: 10 additions & 0 deletions fastdeploy/engine/args_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@ class EngineArgs:
"""
specifies the reasoning parser to use for extracting reasoning content from the model output
"""
chat_template: str = None
"""
chat template or chat template file path
"""
tool_call_parser: str = None
"""
specifies the tool call parser to use for extracting tool call from the model output
Expand Down Expand Up @@ -442,6 +446,12 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
help="Flag specifies the reasoning parser to use for extracting "
"reasoning content from the model output",
)
model_group.add_argument(
"--chat-template",
type=str,
default=EngineArgs.chat_template,
help="chat template or chat template file path",
)
model_group.add_argument(
"--tool-call-parser",
type=str,
Expand Down
5 changes: 5 additions & 0 deletions fastdeploy/engine/request.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ def __init__(
guided_json_object: Optional[bool] = None,
enable_thinking: Optional[bool] = True,
trace_carrier: dict = dict(),
chat_template: Optional[str] = None,
) -> None:
self.request_id = request_id
self.prompt = prompt
Expand Down Expand Up @@ -111,6 +112,8 @@ def __init__(
self.enable_thinking = enable_thinking
self.trace_carrier = trace_carrier

self.chat_template = chat_template

# token num
self.block_tables = []
self.output_token_ids = []
Expand Down Expand Up @@ -152,6 +155,7 @@ def from_dict(cls, d: dict):
guided_json_object=d.get("guided_json_object", None),
enable_thinking=d.get("enable_thinking", True),
trace_carrier=d.get("trace_carrier", {}),
chat_template=d.get("chat_template", None),
)

@property
Expand Down Expand Up @@ -191,6 +195,7 @@ def to_dict(self) -> dict:
"draft_token_ids": self.draft_token_ids,
"enable_thinking": self.enable_thinking,
"trace_carrier": self.trace_carrier,
"chat_template": self.chat_template,
}
add_params = [
"guided_json",
Expand Down
35 changes: 34 additions & 1 deletion fastdeploy/entrypoints/chat_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@

import uuid
from copy import deepcopy
from typing import List, Literal, Union
from pathlib import Path
from typing import List, Literal, Optional, Union
from urllib.parse import urlparse

import requests
Expand Down Expand Up @@ -159,5 +160,37 @@ def parse_chat_messages(messages):
return conversation


def load_chat_template(
chat_template: Union[Path, str],
is_literal: bool = False,
) -> Optional[str]:
if chat_template is None:
return None
if is_literal:
if isinstance(chat_template, Path):
raise TypeError("chat_template is expected to be read directly " "from its value")

return chat_template

try:
with open(chat_template) as f:
return f.read()
except OSError as e:
if isinstance(chat_template, Path):
raise
JINJA_CHARS = "{}\n"
if not any(c in chat_template for c in JINJA_CHARS):
msg = (
f"The supplied chat template ({chat_template}) "
f"looks like a file path, but it failed to be "
f"opened. Reason: {e}"
)
raise ValueError(msg) from e

# If opening a file fails, set chat template to be args to
# ensure we decode so our escape are interpreted correctly
return load_chat_template(chat_template, is_literal=True)


def random_tool_call_id() -> str:
return f"chatcmpl-tool-{str(uuid.uuid4().hex)}"
8 changes: 8 additions & 0 deletions fastdeploy/entrypoints/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
from fastdeploy.engine.args_utils import EngineArgs
from fastdeploy.engine.engine import LLMEngine
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.chat_utils import load_chat_template
from fastdeploy.entrypoints.openai.tool_parsers import ToolParserManager
from fastdeploy.plugins.model_register import load_model_register_plugins
from fastdeploy.utils import (
Expand Down Expand Up @@ -74,6 +75,7 @@ def __init__(
revision: Optional[str] = "master",
tokenizer: Optional[str] = None,
enable_logprob: Optional[bool] = False,
chat_template: Optional[str] = None,
**kwargs,
):
deprecated_kwargs_warning(**kwargs)
Expand Down Expand Up @@ -102,6 +104,7 @@ def __init__(
self.master_node_ip = self.llm_engine.cfg.master_ip
self._receive_output_thread = threading.Thread(target=self._receive_output, daemon=True)
self._receive_output_thread.start()
self.chat_template = load_chat_template(chat_template)

def _check_master(self):
"""
Expand Down Expand Up @@ -196,6 +199,7 @@ def chat(
sampling_params: Optional[Union[SamplingParams, list[SamplingParams]]] = None,
use_tqdm: bool = True,
chat_template_kwargs: Optional[dict[str, Any]] = None,
chat_template: Optional[str] = None,
):
"""
Args:
Expand Down Expand Up @@ -229,13 +233,17 @@ def chat(
if sampling_params_len != 1 and len(messages) != sampling_params_len:
raise ValueError("messages and sampling_params must be the same length.")

if chat_template is None:
chat_template = self.chat_template

messages_len = len(messages)
for i in range(messages_len):
messages[i] = {"messages": messages[i]}
req_ids = self._add_request(
prompts=messages,
sampling_params=sampling_params,
chat_template_kwargs=chat_template_kwargs,
chat_template=chat_template,
)

topk_logprobs = sampling_params[0].logprobs if sampling_params_len > 1 else sampling_params.logprobs
Expand Down
4 changes: 3 additions & 1 deletion fastdeploy/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@

from fastdeploy.engine.args_utils import EngineArgs
from fastdeploy.engine.engine import LLMEngine
from fastdeploy.entrypoints.chat_utils import load_chat_template
from fastdeploy.entrypoints.engine_client import EngineClient
from fastdeploy.entrypoints.openai.protocol import (
ChatCompletionRequest,
Expand Down Expand Up @@ -75,6 +76,7 @@
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args()
args.model = retrive_model_from_server(args.model, args.revision)
chat_template = load_chat_template(args.chat_template)
if args.tool_parser_plugin:
ToolParserManager.import_tool_parser(args.tool_parser_plugin)
llm_engine = None
Expand Down Expand Up @@ -139,7 +141,7 @@ async def lifespan(app: FastAPI):
args.tool_call_parser,
)
app.state.dynamic_load_weight = args.dynamic_load_weight
chat_handler = OpenAIServingChat(engine_client, pid, args.ips, args.max_waiting_time)
chat_handler = OpenAIServingChat(engine_client, pid, args.ips, args.max_waiting_time, chat_template)
completion_handler = OpenAIServingCompletion(engine_client, pid, args.ips, args.max_waiting_time)
engine_client.create_zmq_client(model=pid, mode=zmq.PUSH)
engine_client.pid = pid
Expand Down
1 change: 1 addition & 0 deletions fastdeploy/entrypoints/openai/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -524,6 +524,7 @@ class ChatCompletionRequest(BaseModel):

# doc: start-completion-extra-params
chat_template_kwargs: Optional[dict] = None
chat_template: Optional[str] = None
reasoning_max_tokens: Optional[int] = None
structural_tag: Optional[str] = None
guided_json: Optional[Union[str, dict, BaseModel]] = None
Expand Down
5 changes: 4 additions & 1 deletion fastdeploy/entrypoints/openai/serving_chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,13 @@ class OpenAIServingChat:
OpenAI-style chat completions serving
"""

def __init__(self, engine_client, pid, ips, max_waiting_time):
def __init__(self, engine_client, pid, ips, max_waiting_time, chat_template):
self.engine_client = engine_client
self.pid = pid
self.master_ip = ips
self.max_waiting_time = max_waiting_time
self.host_ip = get_host_ip()
self.chat_template = chat_template
if self.master_ip is not None:
if isinstance(self.master_ip, list):
self.master_ip = self.master_ip[0]
Expand Down Expand Up @@ -86,6 +87,8 @@ async def create_chat_completion(self, request: ChatCompletionRequest):
text_after_process = None
try:
current_req_dict = request.to_dict_for_infer(request_id)
if "chat_template" not in current_req_dict:
current_req_dict["chat_template"] = self.chat_template
current_req_dict["arrival_time"] = time.time()
prompt_token_ids = self.engine_client.format_and_add_data(current_req_dict)
text_after_process = current_req_dict.get("text_after_process")
Expand Down
2 changes: 2 additions & 0 deletions fastdeploy/input/ernie_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ def process_request(self, request, max_model_len=None, **kwargs):
bool: Whether preprocessing is successful
str: error message
"""
request.chat_template = kwargs.get("chat_template")
request = self._apply_default_parameters(request)
if request.get("eos_token_ids") is None or len(request.eos_token_ids) == 0:
request.eos_token_ids = self.eos_token_ids
Expand Down Expand Up @@ -342,6 +343,7 @@ def messages2ids(self, request_or_messages):
tokenize=False,
split_special_tokens=False,
add_special_tokens=False,
chat_template=request_or_messages.get("chat_template", None),
)
request_or_messages["text_after_process"] = spliced_message
req_id = None
Expand Down
1 change: 1 addition & 0 deletions fastdeploy/input/ernie_vl_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ def set_value(req, key, value):

def process_request(self, request, max_model_len=None, **kwargs):
"""process the input data"""
request.chat_template = kwargs.get("chat_template")
task = request.to_dict()
task["enable_thinking"] = kwargs.get("enable_thinking", True)
self.process_request_dict(task, max_model_len)
Expand Down
2 changes: 2 additions & 0 deletions fastdeploy/input/mm_processor/process.py
Original file line number Diff line number Diff line change
Expand Up @@ -494,10 +494,12 @@ def apply_chat_template(self, request):
"""
if self.tokenizer.chat_template is None:
raise ValueError("This model does not support chat_template.")

prompt_token_template = self.tokenizer.apply_chat_template(
request,
tokenize=False,
add_generation_prompt=request.get("add_generation_prompt", True),
chat_template=request.get("chat_template", None),
)
prompt_token_str = prompt_token_template.replace("<|image@placeholder|>", "").replace(
"<|video@placeholder|>", ""
Expand Down
2 changes: 2 additions & 0 deletions fastdeploy/input/text_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,7 @@ def process_request(self, request, max_model_len=None, **kwargs):
bool: Whether preprocessing is successful
str: error message
"""
request.chat_template = kwargs.get("chat_template")
request = self._apply_default_parameters(request)
if request.get("eos_token_ids") is None or len(request.eos_token_ids) == 0:
request.eos_token_ids = self.eos_token_ids
Expand Down Expand Up @@ -486,6 +487,7 @@ def messages2ids(self, request):
split_special_tokens=False,
add_special_tokens=False,
return_tensors="pd",
chat_template=request.get("chat_template", None),
)
request["text_after_process"] = spliced_message
req_id = None
Expand Down
1 change: 1 addition & 0 deletions requirements_dcu.txt
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,4 @@ opentelemetry-instrumentation-mysql
opentelemetry-distro 
opentelemetry-exporter-otlp
opentelemetry-instrumentation-fastapi
partial_json_parser
1 change: 1 addition & 0 deletions requirements_iluvatar.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,4 @@ opentelemetry-instrumentation-mysql
opentelemetry-distro
opentelemetry-exporter-otlp
opentelemetry-instrumentation-fastapi
partial_json_parser
1 change: 1 addition & 0 deletions requirements_metaxgpu.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ opentelemetry-instrumentation-mysql
opentelemetry-distro 
opentelemetry-exporter-otlp
opentelemetry-instrumentation-fastapi
partial_json_parser
Loading
Loading