Skip to content

Sequence to Trace Conversion #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 19, 2025
Merged

Sequence to Trace Conversion #254

merged 6 commits into from
May 19, 2025

Conversation

alanzhang25
Copy link
Collaborator

πŸ“ Summary

🎯 Purpose

πŸŽ₯ Demo of Changes

πŸ§ͺ Testing

βœ… Checklist

  • Self-review
  • Video demo of changes
  • Unit Tests and CI/CD tests are passing
  • Reviewers assigned

πŸ“Œ Linear Issue

✏️ Additional Notes

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @alanzhang25, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR focuses on a significant refactor within the judgeval library, specifically transitioning from the Sequence concept for evaluation runs to using Trace objects. This involves renaming core evaluation functions and data structures, updating how evaluation data is handled and sent to the backend API, and removing deprecated Sequence-specific logic from dataset operations and tests. Additionally, the tracer has been updated to support an offline mode.

Highlights

  • Sequence Deprecation: The Sequence data structure and related functionality for defining evaluation runs have been removed or replaced.
  • Trace-based Evaluation: Evaluation runs now utilize Trace objects (specifically TraceSpans) as the primary unit for structured evaluation, particularly when evaluating agent functions that produce traces.
  • API Endpoint Updates: The API endpoint for running sequence evaluations (/evaluate_sequence/) has been replaced with a new endpoint for trace evaluations (/evaluate_trace/). The endpoint for retrieving sequences from traces has also been removed.
  • Dataset Simplification: Dataset operations and tests related to storing and appending Sequence objects have been removed, simplifying the dataset model to primarily handle Examples.
  • Tracer Offline Mode: The Tracer now includes an offline_mode to control whether trace data is sent to the backend API during execution, which is utilized in the new trace evaluation flow.

Changelog

Click here to see the changelog
  • src/demo/sequence_test.py
    • Removed old sequence evaluation logic (lines 1-16).
    • Removed another old sequence evaluation logic block (lines 1-46).
    • Modified search_tavily to return a hardcoded string instead of calling the Tavily API (lines 21-24).
    • Commented out the OpenAI API call in create_travel_plan and returned a hardcoded string (lines 84-92).
    • Added project_name="travel_agent_demo" to the judgment.assert_test call (line 145).
  • src/e2etests/test_all_scorers.py
    • Removed import of Sequence (line 26).
    • Removed the test_derailment_scorer function which used Sequence (lines 536-560).
  • src/e2etests/test_dataset_operations.py
    • Removed import of Sequence (line 11).
    • Removed the test_dataset_with_sequence function (lines 41-59).
    • Removed the test_append_sequence_dataset function (lines 136-179).
  • src/judgeval/common/tracer.py
    • Added offline_mode to the trace data payload in TraceSpan.save (line 1057).
    • Added a check in TraceSpan.save to only call save_trace if not in offline mode (line 1062).
    • Added offline_mode parameter to Tracer.__init__ (line 1317).
    • Stored offline_mode as an instance variable in Tracer.__init__ (line 1358).
  • src/judgeval/constants.py
    • Renamed JUDGMENT_SEQUENCE_EVAL_API_URL to JUDGMENT_TRACE_EVAL_API_URL (line 43).
    • Removed JUDGMENT_DATASETS_APPEND_SEQUENCES_API_URL (line 46).
    • Removed JUDGMENT_RETRIEVE_SEQUENCE_FROM_TRACE_API_URL (line 53).
  • src/judgeval/data/init.py
    • Removed import of Sequence (line 5).
    • Removed Sequence from the __all__ list (line 17).
  • src/judgeval/data/datasets/dataset.py
    • Removed import of Sequence (line 10).
    • Removed sequences field from EvalDataset dataclass (line 16).
    • Removed sequences parameter and assignment from EvalDataset.__init__ (lines 24, 31).
    • Removed add_sequence method (lines 227-228).
    • Removed sequences from __str__ representation (line 316).
  • src/judgeval/data/datasets/eval_dataset_client.py
    • Removed import of Sequence (line 17).
    • Removed JUDGMENT_DATASETS_APPEND_SEQUENCES_API_URL constant import (line 10).
    • Removed sequences field and is_sequence from the payload in push method (lines 62-63).
    • Removed append_sequences method (lines 155-210).
    • Removed assignment of sequences from the pulled payload in pull method (line 265).
  • src/judgeval/data/example.py
    • Removed sequence_order field from Example model (line 40).
  • src/judgeval/data/result.py
    • Replaced import of Sequence with TraceSpan (line 6).
    • Updated the type hint for data_object to include TraceSpan instead of Sequence (line 27).
    • Updated the type hint for data_object in generate_scoring_result to include TraceSpan instead of Sequence (line 52).
  • src/judgeval/data/sequence.py
    • Deleted the entire file defining the Sequence class.
  • src/judgeval/data/trace.py
    • Added expected_tools and additional_metadata fields to TraceSpan (lines 16-17).
  • src/judgeval/data/trace_run.py
    • Replaced import of Sequence with Trace (line 4).
    • Renamed SequenceRun class to TraceRun (line 10).
    • Updated docstring to refer to traces instead of sequences (line 17).
    • Renamed sequences field to traces (line 32).
  • src/judgeval/judgment_client.py
    • Replaced import of Sequence with Trace (line 15).
    • Replaced import of SequenceRun with TraceRun (line 28).
    • Renamed run_sequence_eval to run_trace_eval in imports (line 26).
    • Renamed run_sequence_evaluation method to run_trace_evaluation (line 108).
    • Updated parameter name from sequences to traces in run_trace_evaluation (line 112).
    • Updated default eval_run_name in run_trace_evaluation (line 117).
    • Updated parameter name from sequences to traces in validation checks within run_trace_evaluation (lines 137, 140).
    • Renamed sequence_run variable to trace_run in run_trace_evaluation (line 143).
    • Updated parameter name from sequences to traces when creating TraceRun object (line 146).
    • Updated call to run_trace_eval and variable name from sequence_run to trace_run (line 155).
    • Updated error message to refer to TraceRun (line 157).
    • Removed append_sequence_dataset method (lines 248-252).
    • Changed call in assert_test from run_sequence_evaluation to run_trace_evaluation (line 520).
  • src/judgeval/run_evaluation.py
    • Removed import of Sequence (line 16).
    • Removed JUDGMENT_SEQUENCE_EVAL_API_URL and JUDGMENT_RETRIEVE_SEQUENCE_FROM_TRACE_API_URL constants (lines 28, 32).
    • Replaced import of SequenceRun with TraceRun (line 40).
    • Renamed execute_api_sequence_eval function to execute_api_trace_eval (line 99).
    • Renamed parameter from sequence_run to trace_run in execute_api_trace_eval (line 99).
    • Updated payload creation and API URL in execute_api_trace_eval (lines 106-114).
    • Updated type hint for run parameter in log_evaluation_results to include TraceRun instead of SequenceRun (line 283).
    • Removed retrieve_sequence_from_trace function (lines 330-372).
    • Renamed run_sequence_eval function to run_trace_eval (line 371).
    • Renamed parameter from sequence_run to trace_run in run_trace_eval (line 371).
    • Updated checks for override, log_results, and append to use trace_run (lines 373, 381).
    • Renamed new_sequences list to new_traces (line 392).
    • Set tracer.offline_mode = True when function and tracer are provided (line 393).
    • Modified logic to iterate over tracer.traces, create Trace objects, set expected_tools on the root span, and append to new_traces (lines 399-403).
    • Assigned new_traces to trace_run.traces (line 404).
    • Removed loop setting scorers on sequences (lines 453-454).
    • Updated call to execute_api_trace_eval and variable name from sequence_run to trace_run (line 410).
    • Updated error message to refer to TraceRun (line 417).
    • Updated call to log_evaluation_results and variable name from sequence_run to trace_run (line 423).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


From sequence to trace,
A new evaluation space.
Code flows, spans bloom,
Dispelling the gloom,
For bugs, there's no hiding place.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully transitions the evaluation framework from using Sequence objects to Trace objects, particularly for evaluating the execution of a function. The changes involve renaming classes and methods, updating data models, and removing deprecated Sequence-specific functionality across the codebase. The implementation appears consistent and well-executed.

Summary of Findings

  • Test Coverage for Trace Evaluation: Several tests related to Sequence evaluation and datasets containing Sequence objects have been removed. While this is correct given the PR's goal, there appear to be gaps in test coverage for the new Trace-based evaluation functionality, particularly for specific scorers like DerailmentScorer and for dataset operations involving traces.

Merge Readiness

The core logic changes for converting from Sequence to Trace evaluation seem correct and well-implemented. However, the absence of corresponding tests for the new Trace-based functionality, especially for dataset operations and specific scorers, means there's reduced confidence in the completeness and correctness of the new features. I recommend addressing the test coverage gaps for Trace evaluation and dataset operations before merging. I am unable to approve this pull request directly; please have other reviewers review and approve this code before merging.

@alanzhang25 alanzhang25 force-pushed the az-sequence-to-trace branch from 6de92f3 to eb78eb3 Compare May 18, 2025 18:28
@alanzhang25 alanzhang25 marked this pull request as ready for review May 18, 2025 18:29
@JCamyre JCamyre self-requested a review May 19, 2025 02:57
@JCamyre JCamyre added the In Review Task is currently being reviewed label May 19, 2025
Copy link
Collaborator

@JCamyre JCamyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick questions - looks good to me! Gonna run some tests with your PR's

@@ -79,7 +79,7 @@ def validate_eval_name(cls, v, values):
raise ValueError("Eval name is required when log_results is True. Please include the eval_run_name argument.")
return v

@field_validator('examples', mode='before')
@field_validator('examples')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we remove the mode='before'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows for building the pydantic object(with the **trace) so that it doesnt validate before converting it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ‘

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think if this is best practice

@JCamyre
Copy link
Collaborator

JCamyre commented May 19, 2025

Can you Tool calling metric to the docs? Work off of Galen's branch since it's not merged in yet: doc_update-ethan
Thanks

@alanzhang25 alanzhang25 force-pushed the az-sequence-to-trace branch from 3684de3 to 9f6480d Compare May 19, 2025 18:30
Copy link
Collaborator

@JCamyre JCamyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JCamyre JCamyre merged commit 186a33b into main May 19, 2025
3 of 4 checks passed
@JCamyre JCamyre deleted the az-sequence-to-trace branch May 19, 2025 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
In Review Task is currently being reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants