Skip to content

Pydantic Typing #268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/demo/sequence_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def generate_itinerary(destination, start_date, end_date):
judgment.assert_test(
project_name="travel_agent_demo",
examples=[example],
scorers=[ToolOrderScorer(threshold=0.5)],
scorers=[ToolOrderScorer()],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It seems the threshold parameter was removed from the ToolOrderScorer instantiation. The ToolOrderScorer class (defined in src/judgeval/scorers/judgeval_scorers/api_scorers/tool_order.py) has a default threshold of 1.0. Previously, it was explicitly set to 0.5 here.

This change means the scorer will now be stricter (requiring a perfect score of 1.0 by default, instead of 0.5).

Could you clarify if this change in scoring behavior for the test is intentional? If this test was designed to pass or measure behavior at a 0.5 threshold, changing it to 1.0 might alter the test's purpose or outcome.

model="gpt-4.1-mini",
function=generate_itinerary,
tracer=tracer,
Expand Down
13 changes: 7 additions & 6 deletions src/judgeval/data/example.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from pydantic import BaseModel, Field, field_validator
from enum import Enum
from datetime import datetime
from judgeval.data.tool import Tool
import time


Expand All @@ -31,7 +32,7 @@ class Example(BaseModel):
retrieval_context: Optional[List[str]] = None
additional_metadata: Optional[Dict[str, Any]] = None
tools_called: Optional[List[str]] = None
expected_tools: Optional[List[Dict[str, Any]]] = None
expected_tools: Optional[List[Tool]] = None
name: Optional[str] = None
example_id: str = Field(default_factory=lambda: str(uuid4()))
example_index: Optional[int] = None
Expand Down Expand Up @@ -82,17 +83,17 @@ def validate_expected_output(cls, v):
raise ValueError(f"All items in expected_output must be strings but got {v}")
return v

@field_validator('expected_tools', mode='before')
@field_validator('expected_tools')
@classmethod
def validate_expected_tools(cls, v):
if v is not None:
if not isinstance(v, list):
raise ValueError(f"Expected tools must be a list of dictionaries or None but got {v} of type {type(v)}")
raise ValueError(f"Expected tools must be a list of Tools or None but got {v} of type {type(v)}")

# Check that each item in the list is a dictionary
# Check that each item in the list is a Tool
for i, item in enumerate(v):
if not isinstance(item, dict):
raise ValueError(f"Expected tools must be a list of dictionaries, but item at index {i} is {item} of type {type(item)}")
if not isinstance(item, Tool):
raise ValueError(f"Expected tools must be a list of Tools, but item at index {i} is {item} of type {type(item)}")

return v

Expand Down
19 changes: 19 additions & 0 deletions src/judgeval/data/tool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from pydantic import BaseModel, field_validator
from typing import Dict, Any, Optional
import warnings

class Tool(BaseModel):
tool_name: str
parameters: Optional[Dict[str, Any]] = None

@field_validator('tool_name')
def validate_tool_name(cls, v):
if not v:
warnings.warn("Tool name is empty or None", UserWarning)
return v
Comment on lines +9 to +13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The validate_tool_name validator currently issues a UserWarning if the tool_name is empty or None.

Is an empty or missing tool_name a permissible state for a Tool object? If a tool_name is essential for the Tool to be valid and usable, it might be more appropriate to raise a ValueError here. This would prevent the creation of Tool objects with invalid names and provide a clearer error signal to the developer.

What are your thoughts on making this validation stricter?

Suggested change
@field_validator('tool_name')
def validate_tool_name(cls, v):
if not v:
warnings.warn("Tool name is empty or None", UserWarning)
return v
@field_validator('tool_name')
def validate_tool_name(cls, v):
if not v:
raise ValueError("Tool name cannot be empty or None")
return v


@field_validator('parameters')
def validate_parameters(cls, v):
if v is not None and not isinstance(v, dict):
warnings.warn(f"Parameters should be a dictionary, got {type(v)}", UserWarning)
return v
Comment on lines +15 to +19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This validator checks if parameters (aliased as v here) is not a dictionary when it's not None, and issues a warning.

Since Pydantic field validators default to mode='after', this validator runs after Pydantic has attempted its own type validation/coercion for the parameters field (which is typed as Optional[Dict[str, Any]]).

If a non-dictionary value (that Pydantic cannot coerce to a dict) is provided for parameters, Pydantic itself would typically raise a ValidationError before this custom validator even runs. For example, Tool(tool_name="test", parameters="not-a-dict") would likely fail Pydantic's initial parsing.

Therefore, the condition v is not None and not isinstance(v, dict) might rarely (if ever) be true when this validator executes in 'after' mode, as Pydantic would have already ensured v is a dict or raised an error.

Could you clarify the intended behavior here?
If the goal is to provide a custom warning before Pydantic's error, mode='before' might be needed. Otherwise, this specific check for isinstance(v, dict) might be redundant with Pydantic's built-in type enforcement.

3 changes: 2 additions & 1 deletion src/judgeval/data/trace.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from pydantic import BaseModel
from typing import Optional, Dict, Any, List
from judgeval.evaluation_run import EvaluationRun
from judgeval.data.tool import Tool
import json
from datetime import datetime, timezone

Expand All @@ -17,7 +18,7 @@ class TraceSpan(BaseModel):
duration: Optional[float] = None
annotation: Optional[List[Dict[str, Any]]] = None
evaluation_runs: Optional[List[EvaluationRun]] = []
expected_tools: Optional[List[Dict[str, Any]]] = None
expected_tools: Optional[List[Tool]] = None
additional_metadata: Optional[Dict[str, Any]] = None

def model_dump(self, **kwargs):
Expand Down
6 changes: 3 additions & 3 deletions src/tests/data/test_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from datetime import datetime
from pydantic import ValidationError
from judgeval.data import Example

from judgeval.data.tool import Tool

def test_basic_example_creation():
example = Example(
Expand All @@ -30,7 +30,7 @@ def test_full_example_creation():
retrieval_context=["retrieval1", "retrieval2"],
additional_metadata={"key": "value"},
tools_called=["tool1", "tool2"],
expected_tools=[{"tool_name": "expected_tool1"}, {"tool_name": "expected_tool2"}],
expected_tools=[Tool(tool_name="expected_tool1"), Tool(tool_name="expected_tool2")],
name="test example",
example_id="123",
timestamp="20240101_120000",
Expand All @@ -43,7 +43,7 @@ def test_full_example_creation():
assert example.retrieval_context == ["retrieval1", "retrieval2"]
assert example.additional_metadata == {"key": "value"}
assert example.tools_called == ["tool1", "tool2"]
assert example.expected_tools == [{"tool_name": "expected_tool1"}, {"tool_name": "expected_tool2"}]
assert example.expected_tools == [Tool(tool_name="expected_tool1"), Tool(tool_name="expected_tool2")]
assert example.name == "test example"
assert example.example_id == "123"
assert example.timestamp == "20240101_120000"
Expand Down