-
Notifications
You must be signed in to change notification settings - Fork 82
Staging -> Prod #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Staging -> Prod #303
Conversation
* add auto package release * add staging stuff * add test trigger * remove test trigger * timeout + minor error handling * gemini sugggestions
* update litellm pyproject * add e2etest fix
* Agent Names added * Fixed agent names for deep tracing * Minor fixes * Dummy commit to trigger staging checks
* fix assert test e2e test * Update src/judgeval/judgment_client.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* add evaluation link * fix run eval * add e2etest for yaml
* updated logos and socials badges * Update README.md * Delete package-lock.json --------- Co-authored-by: Minh Pham <62208322+Mandolaro@users.noreply.github.com>
* updates and s3 e2e tests * clear singleton before tests * coverage, fix http request blocker * test trigger * if always block * change working directory * fix * remove test trigger * gemini suggestions
* Sequence to Trace Conversion * Add agent names with a new decorator * Sequence to Trace Conversion * trace save * comment out; * Changes to yaml and agent file * pydantic * Added tool dependency metric * Changed yaml * Add support for multiple assert test calls * Added parameter checking to ToolDependency metric * Agent Names added * Fixed tool dependency scorer with pydantic change * Added internal scorer for parameter checking * Support for action dependencies added * Changed multi agent test to better show case where param checking helps * Added error check for providing no tools when using param checking * Remove unused parameter --------- Co-authored-by: Alan <alanzhang2021@gmail.com> Co-authored-by: Minh Pham <62208322+Mandolaro@users.noreply.github.com> Co-authored-by: JCamyre <jwcamry03@gmail.com>
* Refactor agent names * Minor fix * Minor Gemini fix * Minor fixes to e2etest and TraceSpan datatype
* fix logging for assert test trace * fix agent
* classifierscorer refactor, other e2es + uts additions * removed e2e * small updates * fix uts * remove print
* Raise warning for missing Org ID or API Key * fix tests * fix tests
* agent state into tracer * model dump state before and after * revert return * add field mappings * fix feedback
* feat: tracer http errors as separate part of error output * Update src/judgeval/common/tracer.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * chore: cleanup commented code --------- Co-authored-by: Joseph S Camyre <68767176+JCamyre@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update Docs link * Add updated Discord link and add links to subsections.
* Example Check and User Prompt * Test Fixes * Required Params for Custom Scorers
* Trace Datasets * comment fix * python class fix --------- Co-authored-by: Aaryan Divate <44125685+adivate2021@users.noreply.github.com>
Co-authored-by: JCamyre <jwcamry03@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @justinsheu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here with a summary of this pull request titled "Staging -> Prod". Based on the changes included, this PR appears to be a merge from a staging branch containing several functional updates rather than just a simple production sync. The core changes involve significant enhancements to the tracing functionality, including the ability to track the state of objects before and after function calls, and improved error reporting. Additionally, the dataset handling has been updated to support storing and retrieving traces alongside examples. A new validation step has been added to the evaluation process to ensure examples have the necessary data for the selected scorers, and the JudgmentClient now enforces the presence of API key and organization ID during initialization. Various test files have been updated to reflect these changes.
Highlights
- Tracing Enhancements: Added the capability to record the state of tracked objects (
state_before,state_after) within trace spans. Improved error capture in the tracer, including the ability to record HTTP-specific details like URL and status code for relevant exceptions. - Dataset Trace Support: Modified the
EvalDatasetstructure and theEvalDatasetClientto allow datasets to contain and manage traces in addition to evaluation examples. This enables associating traces directly with the data used for evaluation. - Evaluation Parameter Validation: Implemented a check in the
run_evaluationworkflow that verifies if examples contain all parameters required by the configured scorers. If missing parameters are found, a warning is displayed, and the user is prompted to continue or exit. - JudgmentClient Initialization: Updated the
JudgmentClientconstructor to explicitly check for the presence ofjudgment_api_keyandorganization_id, raising aValueErrorif either is missing. - Documentation and Test Updates: Updated README links and Table of Contents. Modified various end-to-end and unit tests to align with the changes in tracing, dataset handling, scorer requirements, and client initialization.
Changelog
Click here to see the changelog
- README.md
- Updated Docs link from
judgment.mintlify.app/getting_startedtodocs.judgmentlabs.ai/introduction(line 12). - Updated Discord invite link from
FMxHkYTtFEtoZCnSXYug(line 16). - Restructured and expanded the Table of Contents (lines 31-52).
- Updated Docs link from
- src/e2etests/test_all_scorers.py
- Imported
ExampleParamsfromjudgeval.data.example(line 29). - Added
required_params=[ExampleParams.INPUT, ExampleParams.ACTUAL_OUTPUT]toSentimentScorerinitialization intest_sentiment_scorer(line 686).
- Imported
- src/e2etests/test_eval_operations.py
- Replaced
HallucinationScorerwithAnswerRelevancyScorerinrun_eval_helper(line 58). - Replaced
FaithfulnessScorerandAnswerRelevancyScorerwith a singleAnswerRelevancyScorerintest_assert_test(lines 167-168), and updated the scorers list passed toassert_test(line 174).
- Replaced
- src/e2etests/test_judgee_traces_update.py
- Renamed the key
entriestotrace_spansin trace data dictionaries used in tests (lines 184, 275, 466). - Updated comment to reflect the change from
entriestotrace_spans(line 357). - Updated keys within
local_trace_datafromentriestotrace_spans(lines 491, 492).
- Renamed the key
- src/e2etests/test_tracer.py
- Removed unused helper function
print_trace_hierarchy(lines 594-603). - Updated comment about nested spans from
entriestotrace_spans(line 138).
- Removed unused helper function
- src/judgeval/common/tracer.py
- Moved
import jsonto a different section (lines 8, 18). - Removed unused imports
Type,TypeVar,Set,BaseModel,ScoringResult(lines 32-37, 45). - Imported
ExcInfofromjudgeval.common.utils(line 63). - Changed default value for
project_nameinTraceClient.__init__from"default_project"toNone(line 305). - Initialized
self.project_namewith a new UUID if the providedproject_nameisNone(line 315). - Added
record_state_beforeandrecord_state_aftermethods toTraceClientto record agent state (lines 505-525). - Changed the type hint for the
errorparameter inTraceClient.record_errorfromAnytoDict[str, Any](line 560). - Renamed the key
entriestotrace_spansin the dictionary returned byTraceClient.save(line 599). - Updated the type hint for
exc_infoin_capture_exception_for_traceto use the newExcInfotype alias (line 619). - Added logic in
_capture_exception_for_traceto extract and record HTTP details (URL, status code) from exceptions if available (lines 639-647). - Changed default value for
project_nameinTracer.__init__from"default_project"toNone(line 948). - Initialized
self.project_namewith a new UUID if the providedproject_nameisNone(line 976). - Modified the
identifydecorator to accepttrack_state,track_attributes, andfield_mappingsparameters, storing them inself.class_identifiers(lines 1109-1143). - Added helper methods
_capture_instance_state,_get_instance_state_if_tracked, and_conditionally_capture_and_record_stateto handle state capture based on theidentifydecorator configuration (lines 1148-1193). - Integrated calls to
_conditionally_capture_and_record_statebefore and after the execution of the wrapped function/coroutine in theobservedecorator wrappers (lines 1273-1274, 1286-1287, 1306-1307, 1319-1320, 1371-1372, 1384-1385, 1406-1407, 1419-1420). - Removed the
_handle_errorhelper function (lines 1372-1377). - Integrated error capture logic directly into the traced async and sync functions (
traced_create_async,traced_response_create_async,traced_create_sync,traced_response_create_sync) using_capture_exception_for_trace(lines 1508, 1524, 1556, 1571). - Updated
get_instance_prefixed_nameto access the identifier from the class configuration dictionary stored by theidentifydecorator (lines 2113-2114, 2120).
- Moved
- src/judgeval/common/utils.py
- Imported
TracebackTypefromtypes(line 15). - Added
TypeAliasdefinitions forExcInfoandOptExcInfo(lines 787-788).
- Imported
- src/judgeval/data/datasets/dataset.py
- Imported
Optionalfromtyping(line 8). - Imported
Tracefromjudgeval.data(line 10). - Added
traces: List[Trace]field toEvalDatasetdataclass (line 16). - Updated
__init__to accept an optionaltraceslist and initializeself.traces(lines 25, 30). - Removed debug print statement from
__init__(line 25). - Added
add_tracemethod to append aTraceobject to thetraceslist (lines 226-227). - Included
tracesin the__str__representation (line 315).
- Imported
- src/judgeval/data/datasets/eval_dataset_client.py
- Imported
Tracefromjudgeval.data(line 16). - Included
tracesin the payload sent to the insert API endpoint (line 61). - Deserialized
tracesfrom the API response in thepullmethod (line 206).
- Imported
- src/judgeval/data/trace.py
- Added
state_beforeandstate_afterfields (Optional[Dict[str, Any]]) toTraceSpan(lines 36-37). - Included
state_beforeandstate_afterin the dictionary returned byTraceSpan.model_dump(lines 56-57). - Renamed the field
entriestotrace_spansin theTracemodel (line 120).
- Added
- src/judgeval/judgment_client.py
- Updated type hints for
judgment_api_keyandorganization_idin__init__toOptional[str](line 66). - Added validation checks in
__init__to raiseValueErrorifjudgment_api_keyororganization_idareNone(lines 68-73).
- Updated type hints for
- src/judgeval/run_evaluation.py
- Imported
json(line 4). - Added
prompt_userflag and logic tocheck_examplesto prompt the user if missing parameters are found (lines 366-385). - Updated warning message formatting in
check_examplesusingrprintandjson.dumps(lines 374-377). - Updated
trace.entries[0]totrace.trace_spans[0]inrun_trace_evalto access the root span (line 423). - Added a call to
check_examplesbefore starting the API evaluation workflow in_async_evaluation_workflow(line 910).
- Imported
- src/judgeval/scorers/judgeval_scorer.py
- Imported
ExampleParamsfromjudgeval.data.example(line 15). - Added
required_params: Optional[List[ExampleParams]]field toJudgevalScorer(line 42). - Added
required_paramsparameter toJudgevalScorer.__init__and assigned it toself.required_params(lines 55, 92).
- Imported
- src/judgeval/scorers/prompt_scorer.py
- Imported
ExampleParamsfromjudgeval.data.example(line 33). - Added
required_params: Optional[List[ExampleParams]] = Noneparameter toPromptScorer.__init__(line 68). - Passed
required_paramsto the parentJudgevalScorerconstructor (line 90).
- Imported
- src/tests/common/test_tracer.py
- Updated comment in
test_trace_client_nested_spansfromentriestotrace_spans(line 138).
- Updated comment in
- src/tests/notification/test_notification_integration.py
- Modified
mock_validate_api_keyfixture to also set theJUDGMENT_ORG_IDenvironment variable (line 24). - Updated
JudgmentClientinitialization intest_notification_with_multiple_methodsandtest_notification_integration_directto explicitly includeorganization_id="test_org_id"(lines 274, 385). - Replaced
FaithfulnessScorerwithAnswerRelevancyScorerin rule conditions and scorer lists in tests (lines 294, 303, 405, 414). - Commented out payload verification logic related to rules/notifications in
test_notification_with_multiple_methods(lines 432-442).
- Modified
- src/tests/notification/test_notification_unit.py
- Added a new
mock_validate_api_keyfixture that also setsJUDGMENT_ORG_ID(lines 21-29). - Added
@patch('judgeval.judgment_client.validate_api_key')decorator toTestNotificationWithJudgmentClient(line 309). - Updated
test_judgment_client_with_rules_and_notificationto use the patchedvalidate_api_keyand explicitly includeorganization_id="test_org_id"inJudgmentClientinitialization (lines 318, 340).
- Added a new
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several significant enhancements and refactorings across the codebase. Key changes include the introduction of state tracking within traces, the ability to specify required parameters for scorers, and dataset support for traces. The data model for traces has been updated by renaming entries to trace_spans, and this change has been consistently applied. Additionally, validation for API keys and organization IDs has been improved in the JudgmentClient.
The PR title Staging -> Prod and the empty description body provide limited context about the overall goals of these changes. It would be beneficial to have more descriptive PR titles and bodies in the future to aid reviewers.
Overall, the changes are well-implemented, but there are a couple of areas in the tests and core library code that could be improved.
Summary of Findings
- Abrupt Exit in Library Code: The
check_examplesfunction insrc/judgeval/run_evaluation.pyusessys.exit(0), which can terminate the entire Python process. This is generally discouraged in library code as it can be an unexpected side effect for users. Raising a custom exception would be more idiomatic. - Commented-Out Test Assertions: In
src/tests/notification/test_notification_integration.py, several assertions related to payload verification are commented out with a TODO. These should be addressed to ensure full test coverage and prevent potential regressions. - Default Project Name Generation: The change to generate a UUID for
project_nameinsrc/judgeval/common/tracer.pyif not provided ensures uniqueness, which is good. A minor consideration for developer experience could be if a more human-readable default (e.g., including a timestamp) might be beneficial for quick local tests, though UUIDs are functionally sound. - Improved Input Validation: The added
ValueErrorchecks forNoneAPI key or organization ID insrc/judgeval/judgment_client.pysignificantly improve the robustness and user-friendliness of the client initialization. - Data Model Refactoring: The renaming of
entriestotrace_spansin the trace data model is a notable change and has been consistently applied across tests and an E2E test file. This improves clarity in the data model. - New Feature: State Tracking in Traces: The addition of
record_state_beforeandrecord_state_aftermethods and the associatedidentifydecorator enhancements insrc/judgeval/common/tracer.pyintroduce valuable state tracking capabilities. - New Feature: Required Parameters for Scorers: The introduction of
required_paramsin scorers allows for better validation of example data, which is a good enhancement for scorer robustness.
Merge Readiness
This pull request introduces valuable features and improvements. However, there are a couple of medium-severity issues that should be addressed before merging:
- The use of
sys.exit(0)insrc/judgeval/run_evaluation.pyshould be reconsidered in favor of raising an exception for better library behavior. - The commented-out assertions in
src/tests/notification/test_notification_integration.pyneed to be fixed or have a clear plan for resolution.
Once these points are addressed, the PR should be in good shape for merging. I am unable to approve pull requests, so please ensure these changes are reviewed and approved by another maintainer.
| if prompt_user: | ||
| user_input = input("Do you want to continue? (y/n)") | ||
| if user_input.lower() != "y": | ||
| sys.exit(0) | ||
| else: | ||
| rprint("[green]Continuing...[/green]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of sys.exit(0) here might be too abrupt if run_evaluation (and by extension, check_examples) is intended to be used as a library function. Exiting the entire Python process can be an unexpected side effect for library users.
Would it be more appropriate to raise a custom exception (e.g., MissingExampleParametersError) that the caller can catch and handle? This would give more control to the user of the library.
| # assert "rules" in payload | ||
| # rule_data = payload["rules"][0] | ||
| # assert "notification" in rule_data | ||
| # assert rule_data["notification"]["enabled"] is True | ||
| # assert len(rule_data["notification"]["communication_methods"]) == 4 | ||
| # assert "slack" in rule_data["notification"]["communication_methods"] | ||
| # assert "email" in rule_data["notification"]["communication_methods"] | ||
| # assert "broadcast_slack" in rule_data["notification"]["communication_methods"] | ||
| # assert "broadcast_email" in rule_data["notification"]["communication_methods"] | ||
| # assert len(rule_data["notification"]["email_addresses"]) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These assertions for payload verification are commented out with a TODO. Could you please clarify the reason for this?
If these assertions are important for verifying the correctness of the notification payload, it would be ideal to have them active. If there's an issue preventing them from passing, it might be good to create a ticket to track fixing it. Leaving assertions commented out can lead to them being forgotten and potentially mask issues in the future.
| @@ -307,7 +302,7 @@ def __init__( | |||
| tracer: Optional["Tracer"], | |||
| trace_id: Optional[str] = None, | |||
| name: str = "default", | |||
| project_name: str = "default_project", | |||
| project_name: str = None, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the default project_name from a fixed string to None (and then to a UUID if None) is a good improvement for ensuring uniqueness when a project name isn't explicitly provided. This helps avoid unintentional overwrites or mixing of data if users forget to specify a project name.
Just a thought: while UUIDs ensure uniqueness, they are not very human-readable. For local development or quick tests where a user might not specify a project name, perhaps a more descriptive default like default_project_<timestamp> or similar could be considered if human-readability of default project names in the backend/UI is a concern. However, UUIDs are perfectly fine for ensuring data integrity.
| # Check if API key is None | ||
| if judgment_api_key is None: | ||
| raise ValueError("JUDGMENT_API_KEY cannot be None. Please provide a valid API key or set the JUDGMENT_API_KEY environment variable.") | ||
|
|
||
| # Check if organization ID is None | ||
| if organization_id is None: | ||
| raise ValueError("JUDGMENT_ORG_ID cannot be None. Please provide a valid organization ID or set the JUDGMENT_ORG_ID environment variable.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding these explicit checks for judgment_api_key and organization_id being None is a great improvement for robustness. Raising a ValueError provides clear feedback to the user if these critical parameters are missing.
This proactive validation will help prevent downstream errors that might be harder to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.