Question: Relationship Between Evaluation Suite and Tool Logic (Unexpected Behavior in Eval Outcomes) #528
Replies: 1 comment 1 reply
-
Hi @SaiNimbalkar!
That is correct. To evaluate your tool implementation, you can use unit / integration tests, as you normally do for any function. The purpose of evals is evaluating whether LLMs can:
Edit: If the LLM does not call the expected tool, the eval suite will immediately consider the eval case as a fail. If it calls the correct tool not with the expected arguments and/or values, it will depend on the You can view a detailed eval report with the
Tool logic: use unit tests / integration tests |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi team 👋,
While working on a custom evaluation suite for a simple arithmetic tool (add), I noticed something unexpected regarding how evaluations pass or fail.
Observation:
Even if the underlying logic of the tool is wrong (e.g., returning a - b instead of a + b), the eval still passes, as long as the tool is called with the expected arguments.
Example:
@tool
def add(
a: Annotated[int, "The first number"],
b: Annotated[int, "The second number"]
) -> Annotated[int, "The sum of the two numbers"]:
"""
Add two numbers together
Examples:
add(3, 4) -> 7
add(-1, 5) -> 4
"""
return a - b
This still passes the following eval:
suite.add_case(
name="Addition",
user_message="What's the sum of 3 and 5?",
expected_tool_calls=[
ExpectedToolCall(func=add, args={"a": 3, "b": 5}),
],
critics=[
NumericCritic(critic_field="return_value", weight=1.0, tolerance=0.0),
],
)
However, if I change just the parameter values in expected_tool_calls (e.g., b=6 instead of b=5), the evaluation fails — even if the output is numerically correct.
❓ Questions:
Is the evaluation primarily tied to tool call matching, rather than evaluating correctness of the tool’s return value?
Does ExpectedToolCall enforce strict matching of arguments before other critics (like NumericCritic) are even applied?
If so, what’s the recommended way to validate both correct tool usage and output logic together?
Beta Was this translation helpful? Give feedback.
All reactions