Question: Relationship Between Evaluation Suite and Tool Logic (Unexpected Behavior in Eval Outcomes) #528

SaiNimbalkar · 2025-08-06T10:35:05Z

SaiNimbalkar
Aug 6, 2025

Hi team 👋,

While working on a custom evaluation suite for a simple arithmetic tool (add), I noticed something unexpected regarding how evaluations pass or fail.

Observation:
Even if the underlying logic of the tool is wrong (e.g., returning a - b instead of a + b), the eval still passes, as long as the tool is called with the expected arguments.

Example:

@tool
def add(
a: Annotated[int, "The first number"],
b: Annotated[int, "The second number"]
) -> Annotated[int, "The sum of the two numbers"]:
"""
Add two numbers together

Examples:
add(3, 4) -> 7
add(-1, 5) -> 4
"""
return a - b

This still passes the following eval:

suite.add_case(
name="Addition",
user_message="What's the sum of 3 and 5?",
expected_tool_calls=[
ExpectedToolCall(func=add, args={"a": 3, "b": 5}),
],
critics=[
NumericCritic(critic_field="return_value", weight=1.0, tolerance=0.0),
],
)
However, if I change just the parameter values in expected_tool_calls (e.g., b=6 instead of b=5), the evaluation fails — even if the output is numerically correct.

❓ Questions:
Is the evaluation primarily tied to tool call matching, rather than evaluating correctness of the tool’s return value?

Does ExpectedToolCall enforce strict matching of arguments before other critics (like NumericCritic) are even applied?

If so, what’s the recommended way to validate both correct tool usage and output logic together?

byrro · 2025-08-06T15:48:37Z

byrro
Aug 6, 2025
Maintainer

Hi @SaiNimbalkar!

Is the evaluation primarily tied to tool call matching, rather than evaluating correctness of the tool’s return value?

That is correct. To evaluate your tool implementation, you can use unit / integration tests, as you normally do for any function.

The purpose of evals is evaluating whether LLMs can:

Select the appropriate tool given a prompt;
Call the tool with the correct arguments and values.

Does ExpectedToolCall enforce strict matching of arguments before other critics (like NumericCritic) are even applied?

Edit: If the LLM does not call the expected tool, the eval suite will immediately consider the eval case as a fail. If it calls the correct tool not with the expected arguments and/or values, it will depend on the fail_threshold value set in the EvalRubric and the weight given to each critic. If the failed critics' weights are high enough to go trigger the fail_threshold, the eval case will fail. If not, it will pass, but the eval report will highlight the mismatch.

You can view a detailed eval report with the --details/-d flag in the arcade evals command.

If so, what’s the recommended way to validate both correct tool usage and output logic together?

Tool logic: use unit tests / integration tests
Tool interface: use evals

1 reply

evantahler Aug 6, 2025
Maintainer

@SaiNimbalkar - If you have any suggestions on how to improve our docs (e.g.), we welcome PRs too!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: Relationship Between Evaluation Suite and Tool Logic (Unexpected Behavior in Eval Outcomes) #528

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question: Relationship Between Evaluation Suite and Tool Logic (Unexpected Behavior in Eval Outcomes) #528

Uh oh!

SaiNimbalkar Aug 6, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

byrro Aug 6, 2025 Maintainer

Uh oh!

evantahler Aug 6, 2025 Maintainer

SaiNimbalkar
Aug 6, 2025

Replies: 1 comment 1 reply

byrro
Aug 6, 2025
Maintainer

evantahler Aug 6, 2025
Maintainer