Skip to content

Staging -> Prod #303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jun 6, 2025
Merged

Staging -> Prod #303

merged 31 commits into from
Jun 6, 2025

Conversation

justinsheu
Copy link
Contributor

No description provided.

justinsheu and others added 30 commits May 22, 2025 13:51
* add auto package release

* add staging stuff

* add test trigger

* remove test trigger

* timeout + minor error handling

* gemini sugggestions
* update litellm pyproject

* add e2etest fix
* Agent Names added

* Fixed agent names for deep tracing

* Minor fixes

* Dummy commit to trigger staging checks
* fix assert test e2e test

* Update src/judgeval/judgment_client.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* add evaluation link

* fix run eval

* add e2etest for yaml
* updated logos and socials badges

* Update README.md

* Delete package-lock.json

---------

Co-authored-by: Minh Pham <62208322+Mandolaro@users.noreply.github.com>
* updates and s3 e2e tests

* clear singleton before tests

* coverage, fix http request blocker

* test trigger

* if always block

* change working directory

* fix

* remove test trigger

* gemini suggestions
* Sequence to Trace Conversion

* Add agent names with a new decorator

* Sequence to Trace Conversion

* trace save

* comment out;

* Changes to yaml and agent file

* pydantic

* Added tool dependency metric

* Changed yaml

* Add support for multiple assert test calls

* Added parameter checking to ToolDependency metric

* Agent Names added

* Fixed tool dependency scorer with pydantic change

* Added internal scorer for parameter checking

* Support for action dependencies added

* Changed multi agent test to better show case where param checking helps

* Added error check for providing no tools when using param checking

* Remove unused parameter

---------

Co-authored-by: Alan <alanzhang2021@gmail.com>
Co-authored-by: Minh Pham <62208322+Mandolaro@users.noreply.github.com>
Co-authored-by: JCamyre <jwcamry03@gmail.com>
* Refactor agent names

* Minor fix

* Minor Gemini fix

* Minor fixes to e2etest and TraceSpan datatype
* fix logging for assert test trace

* fix agent
* classifierscorer refactor, other e2es + uts additions

* removed e2e

* small updates

* fix uts

* remove print
* Raise warning for missing Org ID or API Key

* fix tests

* fix tests
* agent state into tracer

* model dump state before and after

* revert return

* add field mappings

* fix feedback
* feat: tracer http errors as separate part of error output

* Update src/judgeval/common/tracer.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* chore: cleanup commented code

---------

Co-authored-by: Joseph S Camyre <68767176+JCamyre@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update Docs link

* Add updated Discord link and add links to subsections.
* Example Check and User Prompt

* Test Fixes

* Required Params for Custom Scorers
* Trace Datasets

* comment fix

* python class fix

---------

Co-authored-by: Aaryan Divate <44125685+adivate2021@users.noreply.github.com>
Co-authored-by: JCamyre <jwcamry03@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @justinsheu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here with a summary of this pull request titled "Staging -> Prod". Based on the changes included, this PR appears to be a merge from a staging branch containing several functional updates rather than just a simple production sync. The core changes involve significant enhancements to the tracing functionality, including the ability to track the state of objects before and after function calls, and improved error reporting. Additionally, the dataset handling has been updated to support storing and retrieving traces alongside examples. A new validation step has been added to the evaluation process to ensure examples have the necessary data for the selected scorers, and the JudgmentClient now enforces the presence of API key and organization ID during initialization. Various test files have been updated to reflect these changes.

Highlights

  • Tracing Enhancements: Added the capability to record the state of tracked objects (state_before, state_after) within trace spans. Improved error capture in the tracer, including the ability to record HTTP-specific details like URL and status code for relevant exceptions.
  • Dataset Trace Support: Modified the EvalDataset structure and the EvalDatasetClient to allow datasets to contain and manage traces in addition to evaluation examples. This enables associating traces directly with the data used for evaluation.
  • Evaluation Parameter Validation: Implemented a check in the run_evaluation workflow that verifies if examples contain all parameters required by the configured scorers. If missing parameters are found, a warning is displayed, and the user is prompted to continue or exit.
  • JudgmentClient Initialization: Updated the JudgmentClient constructor to explicitly check for the presence of judgment_api_key and organization_id, raising a ValueError if either is missing.
  • Documentation and Test Updates: Updated README links and Table of Contents. Modified various end-to-end and unit tests to align with the changes in tracing, dataset handling, scorer requirements, and client initialization.

Changelog

Click here to see the changelog
  • README.md
    • Updated Docs link from judgment.mintlify.app/getting_started to docs.judgmentlabs.ai/introduction (line 12).
    • Updated Discord invite link from FMxHkYTtFE to ZCnSXYug (line 16).
    • Restructured and expanded the Table of Contents (lines 31-52).
  • src/e2etests/test_all_scorers.py
    • Imported ExampleParams from judgeval.data.example (line 29).
    • Added required_params=[ExampleParams.INPUT, ExampleParams.ACTUAL_OUTPUT] to SentimentScorer initialization in test_sentiment_scorer (line 686).
  • src/e2etests/test_eval_operations.py
    • Replaced HallucinationScorer with AnswerRelevancyScorer in run_eval_helper (line 58).
    • Replaced FaithfulnessScorer and AnswerRelevancyScorer with a single AnswerRelevancyScorer in test_assert_test (lines 167-168), and updated the scorers list passed to assert_test (line 174).
  • src/e2etests/test_judgee_traces_update.py
    • Renamed the key entries to trace_spans in trace data dictionaries used in tests (lines 184, 275, 466).
    • Updated comment to reflect the change from entries to trace_spans (line 357).
    • Updated keys within local_trace_data from entries to trace_spans (lines 491, 492).
  • src/e2etests/test_tracer.py
    • Removed unused helper function print_trace_hierarchy (lines 594-603).
    • Updated comment about nested spans from entries to trace_spans (line 138).
  • src/judgeval/common/tracer.py
    • Moved import json to a different section (lines 8, 18).
    • Removed unused imports Type, TypeVar, Set, BaseModel, ScoringResult (lines 32-37, 45).
    • Imported ExcInfo from judgeval.common.utils (line 63).
    • Changed default value for project_name in TraceClient.__init__ from "default_project" to None (line 305).
    • Initialized self.project_name with a new UUID if the provided project_name is None (line 315).
    • Added record_state_before and record_state_after methods to TraceClient to record agent state (lines 505-525).
    • Changed the type hint for the error parameter in TraceClient.record_error from Any to Dict[str, Any] (line 560).
    • Renamed the key entries to trace_spans in the dictionary returned by TraceClient.save (line 599).
    • Updated the type hint for exc_info in _capture_exception_for_trace to use the new ExcInfo type alias (line 619).
    • Added logic in _capture_exception_for_trace to extract and record HTTP details (URL, status code) from exceptions if available (lines 639-647).
    • Changed default value for project_name in Tracer.__init__ from "default_project" to None (line 948).
    • Initialized self.project_name with a new UUID if the provided project_name is None (line 976).
    • Modified the identify decorator to accept track_state, track_attributes, and field_mappings parameters, storing them in self.class_identifiers (lines 1109-1143).
    • Added helper methods _capture_instance_state, _get_instance_state_if_tracked, and _conditionally_capture_and_record_state to handle state capture based on the identify decorator configuration (lines 1148-1193).
    • Integrated calls to _conditionally_capture_and_record_state before and after the execution of the wrapped function/coroutine in the observe decorator wrappers (lines 1273-1274, 1286-1287, 1306-1307, 1319-1320, 1371-1372, 1384-1385, 1406-1407, 1419-1420).
    • Removed the _handle_error helper function (lines 1372-1377).
    • Integrated error capture logic directly into the traced async and sync functions (traced_create_async, traced_response_create_async, traced_create_sync, traced_response_create_sync) using _capture_exception_for_trace (lines 1508, 1524, 1556, 1571).
    • Updated get_instance_prefixed_name to access the identifier from the class configuration dictionary stored by the identify decorator (lines 2113-2114, 2120).
  • src/judgeval/common/utils.py
    • Imported TracebackType from types (line 15).
    • Added TypeAlias definitions for ExcInfo and OptExcInfo (lines 787-788).
  • src/judgeval/data/datasets/dataset.py
    • Imported Optional from typing (line 8).
    • Imported Trace from judgeval.data (line 10).
    • Added traces: List[Trace] field to EvalDataset dataclass (line 16).
    • Updated __init__ to accept an optional traces list and initialize self.traces (lines 25, 30).
    • Removed debug print statement from __init__ (line 25).
    • Added add_trace method to append a Trace object to the traces list (lines 226-227).
    • Included traces in the __str__ representation (line 315).
  • src/judgeval/data/datasets/eval_dataset_client.py
    • Imported Trace from judgeval.data (line 16).
    • Included traces in the payload sent to the insert API endpoint (line 61).
    • Deserialized traces from the API response in the pull method (line 206).
  • src/judgeval/data/trace.py
    • Added state_before and state_after fields (Optional[Dict[str, Any]]) to TraceSpan (lines 36-37).
    • Included state_before and state_after in the dictionary returned by TraceSpan.model_dump (lines 56-57).
    • Renamed the field entries to trace_spans in the Trace model (line 120).
  • src/judgeval/judgment_client.py
    • Updated type hints for judgment_api_key and organization_id in __init__ to Optional[str] (line 66).
    • Added validation checks in __init__ to raise ValueError if judgment_api_key or organization_id are None (lines 68-73).
  • src/judgeval/run_evaluation.py
    • Imported json (line 4).
    • Added prompt_user flag and logic to check_examples to prompt the user if missing parameters are found (lines 366-385).
    • Updated warning message formatting in check_examples using rprint and json.dumps (lines 374-377).
    • Updated trace.entries[0] to trace.trace_spans[0] in run_trace_eval to access the root span (line 423).
    • Added a call to check_examples before starting the API evaluation workflow in _async_evaluation_workflow (line 910).
  • src/judgeval/scorers/judgeval_scorer.py
    • Imported ExampleParams from judgeval.data.example (line 15).
    • Added required_params: Optional[List[ExampleParams]] field to JudgevalScorer (line 42).
    • Added required_params parameter to JudgevalScorer.__init__ and assigned it to self.required_params (lines 55, 92).
  • src/judgeval/scorers/prompt_scorer.py
    • Imported ExampleParams from judgeval.data.example (line 33).
    • Added required_params: Optional[List[ExampleParams]] = None parameter to PromptScorer.__init__ (line 68).
    • Passed required_params to the parent JudgevalScorer constructor (line 90).
  • src/tests/common/test_tracer.py
    • Updated comment in test_trace_client_nested_spans from entries to trace_spans (line 138).
  • src/tests/notification/test_notification_integration.py
    • Modified mock_validate_api_key fixture to also set the JUDGMENT_ORG_ID environment variable (line 24).
    • Updated JudgmentClient initialization in test_notification_with_multiple_methods and test_notification_integration_direct to explicitly include organization_id="test_org_id" (lines 274, 385).
    • Replaced FaithfulnessScorer with AnswerRelevancyScorer in rule conditions and scorer lists in tests (lines 294, 303, 405, 414).
    • Commented out payload verification logic related to rules/notifications in test_notification_with_multiple_methods (lines 432-442).
  • src/tests/notification/test_notification_unit.py
    • Added a new mock_validate_api_key fixture that also sets JUDGMENT_ORG_ID (lines 21-29).
    • Added @patch('judgeval.judgment_client.validate_api_key') decorator to TestNotificationWithJudgmentClient (line 309).
    • Updated test_judgment_client_with_rules_and_notification to use the patched validate_api_key and explicitly include organization_id="test_org_id" in JudgmentClient initialization (lines 318, 340).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several significant enhancements and refactorings across the codebase. Key changes include the introduction of state tracking within traces, the ability to specify required parameters for scorers, and dataset support for traces. The data model for traces has been updated by renaming entries to trace_spans, and this change has been consistently applied. Additionally, validation for API keys and organization IDs has been improved in the JudgmentClient.

The PR title Staging -> Prod and the empty description body provide limited context about the overall goals of these changes. It would be beneficial to have more descriptive PR titles and bodies in the future to aid reviewers.

Overall, the changes are well-implemented, but there are a couple of areas in the tests and core library code that could be improved.

Summary of Findings

  • Abrupt Exit in Library Code: The check_examples function in src/judgeval/run_evaluation.py uses sys.exit(0), which can terminate the entire Python process. This is generally discouraged in library code as it can be an unexpected side effect for users. Raising a custom exception would be more idiomatic.
  • Commented-Out Test Assertions: In src/tests/notification/test_notification_integration.py, several assertions related to payload verification are commented out with a TODO. These should be addressed to ensure full test coverage and prevent potential regressions.
  • Default Project Name Generation: The change to generate a UUID for project_name in src/judgeval/common/tracer.py if not provided ensures uniqueness, which is good. A minor consideration for developer experience could be if a more human-readable default (e.g., including a timestamp) might be beneficial for quick local tests, though UUIDs are functionally sound.
  • Improved Input Validation: The added ValueError checks for None API key or organization ID in src/judgeval/judgment_client.py significantly improve the robustness and user-friendliness of the client initialization.
  • Data Model Refactoring: The renaming of entries to trace_spans in the trace data model is a notable change and has been consistently applied across tests and an E2E test file. This improves clarity in the data model.
  • New Feature: State Tracking in Traces: The addition of record_state_before and record_state_after methods and the associated identify decorator enhancements in src/judgeval/common/tracer.py introduce valuable state tracking capabilities.
  • New Feature: Required Parameters for Scorers: The introduction of required_params in scorers allows for better validation of example data, which is a good enhancement for scorer robustness.

Merge Readiness

This pull request introduces valuable features and improvements. However, there are a couple of medium-severity issues that should be addressed before merging:

  1. The use of sys.exit(0) in src/judgeval/run_evaluation.py should be reconsidered in favor of raising an exception for better library behavior.
  2. The commented-out assertions in src/tests/notification/test_notification_integration.py need to be fixed or have a clear plan for resolution.

Once these points are addressed, the PR should be in good shape for merging. I am unable to approve pull requests, so please ensure these changes are reviewed and approved by another maintainer.

Comment on lines +380 to +385
if prompt_user:
user_input = input("Do you want to continue? (y/n)")
if user_input.lower() != "y":
sys.exit(0)
else:
rprint("[green]Continuing...[/green]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of sys.exit(0) here might be too abrupt if run_evaluation (and by extension, check_examples) is intended to be used as a library function. Exiting the entire Python process can be an unexpected side effect for library users.

Would it be more appropriate to raise a custom exception (e.g., MissingExampleParametersError) that the caller can catch and handle? This would give more control to the user of the library.

Comment on lines +432 to +442
# assert "rules" in payload
# rule_data = payload["rules"][0]
# assert "notification" in rule_data
# assert rule_data["notification"]["enabled"] is True
# assert len(rule_data["notification"]["communication_methods"]) == 4
# assert "slack" in rule_data["notification"]["communication_methods"]
# assert "email" in rule_data["notification"]["communication_methods"]
# assert "broadcast_slack" in rule_data["notification"]["communication_methods"]
# assert "broadcast_email" in rule_data["notification"]["communication_methods"]
# assert len(rule_data["notification"]["email_addresses"]) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These assertions for payload verification are commented out with a TODO. Could you please clarify the reason for this?

If these assertions are important for verifying the correctness of the notification payload, it would be ideal to have them active. If there's an issue preventing them from passing, it might be good to create a ticket to track fixing it. Leaving assertions commented out can lead to them being forgotten and potentially mask issues in the future.

@@ -307,7 +302,7 @@ def __init__(
tracer: Optional["Tracer"],
trace_id: Optional[str] = None,
name: str = "default",
project_name: str = "default_project",
project_name: str = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the default project_name from a fixed string to None (and then to a UUID if None) is a good improvement for ensuring uniqueness when a project name isn't explicitly provided. This helps avoid unintentional overwrites or mixing of data if users forget to specify a project name.

Just a thought: while UUIDs ensure uniqueness, they are not very human-readable. For local development or quick tests where a user might not specify a project name, perhaps a more descriptive default like default_project_<timestamp> or similar could be considered if human-readability of default project names in the backend/UI is a concern. However, UUIDs are perfectly fine for ensuring data integrity.

Comment on lines +67 to +73
# Check if API key is None
if judgment_api_key is None:
raise ValueError("JUDGMENT_API_KEY cannot be None. Please provide a valid API key or set the JUDGMENT_API_KEY environment variable.")

# Check if organization ID is None
if organization_id is None:
raise ValueError("JUDGMENT_ORG_ID cannot be None. Please provide a valid organization ID or set the JUDGMENT_ORG_ID environment variable.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Adding these explicit checks for judgment_api_key and organization_id being None is a great improvement for robustness. Raising a ValueError provides clear feedback to the user if these critical parameters are missing.

This proactive validation will help prevent downstream errors that might be harder to debug.

Copy link
Contributor

@adivate2021 adivate2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adivate2021 adivate2021 merged commit 611bc0a into main Jun 6, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants