-
Notifications
You must be signed in to change notification settings - Fork 82
Sequence to Trace Conversion #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @alanzhang25, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request. This PR focuses on a significant refactor within the judgeval
library, specifically transitioning from the Sequence
concept for evaluation runs to using Trace
objects. This involves renaming core evaluation functions and data structures, updating how evaluation data is handled and sent to the backend API, and removing deprecated Sequence
-specific logic from dataset operations and tests. Additionally, the tracer has been updated to support an offline mode.
Highlights
- Sequence Deprecation: The
Sequence
data structure and related functionality for defining evaluation runs have been removed or replaced. - Trace-based Evaluation: Evaluation runs now utilize
Trace
objects (specificallyTraceSpan
s) as the primary unit for structured evaluation, particularly when evaluating agent functions that produce traces. - API Endpoint Updates: The API endpoint for running sequence evaluations (
/evaluate_sequence/
) has been replaced with a new endpoint for trace evaluations (/evaluate_trace/
). The endpoint for retrieving sequences from traces has also been removed. - Dataset Simplification: Dataset operations and tests related to storing and appending
Sequence
objects have been removed, simplifying the dataset model to primarily handleExample
s. - Tracer Offline Mode: The
Tracer
now includes anoffline_mode
to control whether trace data is sent to the backend API during execution, which is utilized in the new trace evaluation flow.
Changelog
Click here to see the changelog
- src/demo/sequence_test.py
- Removed old sequence evaluation logic (lines 1-16).
- Removed another old sequence evaluation logic block (lines 1-46).
- Modified
search_tavily
to return a hardcoded string instead of calling the Tavily API (lines 21-24). - Commented out the OpenAI API call in
create_travel_plan
and returned a hardcoded string (lines 84-92). - Added
project_name="travel_agent_demo"
to thejudgment.assert_test
call (line 145).
- src/e2etests/test_all_scorers.py
- Removed import of
Sequence
(line 26). - Removed the
test_derailment_scorer
function which usedSequence
(lines 536-560).
- Removed import of
- src/e2etests/test_dataset_operations.py
- Removed import of
Sequence
(line 11). - Removed the
test_dataset_with_sequence
function (lines 41-59). - Removed the
test_append_sequence_dataset
function (lines 136-179).
- Removed import of
- src/judgeval/common/tracer.py
- Added
offline_mode
to the trace data payload inTraceSpan.save
(line 1057). - Added a check in
TraceSpan.save
to only callsave_trace
if not in offline mode (line 1062). - Added
offline_mode
parameter toTracer.__init__
(line 1317). - Stored
offline_mode
as an instance variable inTracer.__init__
(line 1358).
- Added
- src/judgeval/constants.py
- Renamed
JUDGMENT_SEQUENCE_EVAL_API_URL
toJUDGMENT_TRACE_EVAL_API_URL
(line 43). - Removed
JUDGMENT_DATASETS_APPEND_SEQUENCES_API_URL
(line 46). - Removed
JUDGMENT_RETRIEVE_SEQUENCE_FROM_TRACE_API_URL
(line 53).
- Renamed
- src/judgeval/data/init.py
- Removed import of
Sequence
(line 5). - Removed
Sequence
from the__all__
list (line 17).
- Removed import of
- src/judgeval/data/datasets/dataset.py
- Removed import of
Sequence
(line 10). - Removed
sequences
field fromEvalDataset
dataclass (line 16). - Removed
sequences
parameter and assignment fromEvalDataset.__init__
(lines 24, 31). - Removed
add_sequence
method (lines 227-228). - Removed
sequences
from__str__
representation (line 316).
- Removed import of
- src/judgeval/data/datasets/eval_dataset_client.py
- Removed import of
Sequence
(line 17). - Removed
JUDGMENT_DATASETS_APPEND_SEQUENCES_API_URL
constant import (line 10). - Removed
sequences
field andis_sequence
from the payload inpush
method (lines 62-63). - Removed
append_sequences
method (lines 155-210). - Removed assignment of
sequences
from the pulled payload inpull
method (line 265).
- Removed import of
- src/judgeval/data/example.py
- Removed
sequence_order
field fromExample
model (line 40).
- Removed
- src/judgeval/data/result.py
- Replaced import of
Sequence
withTraceSpan
(line 6). - Updated the type hint for
data_object
to includeTraceSpan
instead ofSequence
(line 27). - Updated the type hint for
data_object
ingenerate_scoring_result
to includeTraceSpan
instead ofSequence
(line 52).
- Replaced import of
- src/judgeval/data/sequence.py
- Deleted the entire file defining the
Sequence
class.
- Deleted the entire file defining the
- src/judgeval/data/trace.py
- Added
expected_tools
andadditional_metadata
fields toTraceSpan
(lines 16-17).
- Added
- src/judgeval/data/trace_run.py
- Replaced import of
Sequence
withTrace
(line 4). - Renamed
SequenceRun
class toTraceRun
(line 10). - Updated docstring to refer to
traces
instead ofsequences
(line 17). - Renamed
sequences
field totraces
(line 32).
- Replaced import of
- src/judgeval/judgment_client.py
- Replaced import of
Sequence
withTrace
(line 15). - Replaced import of
SequenceRun
withTraceRun
(line 28). - Renamed
run_sequence_eval
torun_trace_eval
in imports (line 26). - Renamed
run_sequence_evaluation
method torun_trace_evaluation
(line 108). - Updated parameter name from
sequences
totraces
inrun_trace_evaluation
(line 112). - Updated default
eval_run_name
inrun_trace_evaluation
(line 117). - Updated parameter name from
sequences
totraces
in validation checks withinrun_trace_evaluation
(lines 137, 140). - Renamed
sequence_run
variable totrace_run
inrun_trace_evaluation
(line 143). - Updated parameter name from
sequences
totraces
when creatingTraceRun
object (line 146). - Updated call to
run_trace_eval
and variable name fromsequence_run
totrace_run
(line 155). - Updated error message to refer to
TraceRun
(line 157). - Removed
append_sequence_dataset
method (lines 248-252). - Changed call in
assert_test
fromrun_sequence_evaluation
torun_trace_evaluation
(line 520).
- Replaced import of
- src/judgeval/run_evaluation.py
- Removed import of
Sequence
(line 16). - Removed
JUDGMENT_SEQUENCE_EVAL_API_URL
andJUDGMENT_RETRIEVE_SEQUENCE_FROM_TRACE_API_URL
constants (lines 28, 32). - Replaced import of
SequenceRun
withTraceRun
(line 40). - Renamed
execute_api_sequence_eval
function toexecute_api_trace_eval
(line 99). - Renamed parameter from
sequence_run
totrace_run
inexecute_api_trace_eval
(line 99). - Updated payload creation and API URL in
execute_api_trace_eval
(lines 106-114). - Updated type hint for
run
parameter inlog_evaluation_results
to includeTraceRun
instead ofSequenceRun
(line 283). - Removed
retrieve_sequence_from_trace
function (lines 330-372). - Renamed
run_sequence_eval
function torun_trace_eval
(line 371). - Renamed parameter from
sequence_run
totrace_run
inrun_trace_eval
(line 371). - Updated checks for
override
,log_results
, andappend
to usetrace_run
(lines 373, 381). - Renamed
new_sequences
list tonew_traces
(line 392). - Set
tracer.offline_mode = True
whenfunction
andtracer
are provided (line 393). - Modified logic to iterate over
tracer.traces
, createTrace
objects, setexpected_tools
on the root span, and append tonew_traces
(lines 399-403). - Assigned
new_traces
totrace_run.traces
(line 404). - Removed loop setting scorers on sequences (lines 453-454).
- Updated call to
execute_api_trace_eval
and variable name fromsequence_run
totrace_run
(line 410). - Updated error message to refer to
TraceRun
(line 417). - Updated call to
log_evaluation_results
and variable name fromsequence_run
totrace_run
(line 423).
- Removed import of
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
From sequence to trace,
A new evaluation space.
Code flows, spans bloom,
Dispelling the gloom,
For bugs, there's no hiding place.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. β©
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request successfully transitions the evaluation framework from using Sequence
objects to Trace
objects, particularly for evaluating the execution of a function. The changes involve renaming classes and methods, updating data models, and removing deprecated Sequence
-specific functionality across the codebase. The implementation appears consistent and well-executed.
Summary of Findings
- Test Coverage for Trace Evaluation: Several tests related to
Sequence
evaluation and datasets containingSequence
objects have been removed. While this is correct given the PR's goal, there appear to be gaps in test coverage for the newTrace
-based evaluation functionality, particularly for specific scorers likeDerailmentScorer
and for dataset operations involving traces.
Merge Readiness
The core logic changes for converting from Sequence to Trace evaluation seem correct and well-implemented. However, the absence of corresponding tests for the new Trace
-based functionality, especially for dataset operations and specific scorers, means there's reduced confidence in the completeness and correctness of the new features. I recommend addressing the test coverage gaps for Trace
evaluation and dataset operations before merging. I am unable to approve this pull request directly; please have other reviewers review and approve this code before merging.
6de92f3
to
eb78eb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick questions - looks good to me! Gonna run some tests with your PR's
@@ -79,7 +79,7 @@ def validate_eval_name(cls, v, values): | |||
raise ValueError("Eval name is required when log_results is True. Please include the eval_run_name argument.") | |||
return v | |||
|
|||
@field_validator('examples', mode='before') | |||
@field_validator('examples') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can we remove the mode='before'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows for building the pydantic object(with the **trace) so that it doesnt validate before converting it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think if this is best practice
Can you Tool calling metric to the docs? Work off of Galen's branch since it's not merged in yet: |
3684de3
to
9f6480d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
π Summary
π― Purpose
π₯ Demo of Changes
π§ͺ Testing
β Checklist
π Linear Issue
βοΈ Additional Notes