-
Notifications
You must be signed in to change notification settings - Fork 82
Async experiment return #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Mandolaro, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team,
Gemini here, providing a summary for this pull request authored by @Mandolaro.
The core intent of this PR is to address a user experience issue where running evaluations asynchronously (async_execution=True
) currently returns None
. This makes it difficult for users to wait for the evaluation to complete or integrate it into asynchronous workflows.
The solution implemented is to change the run_evaluation
method to return an asyncio.Task
object when async_execution=True
. This task can then be awaited by the user, providing a similar experience to the synchronous execution where the results are returned directly, while still allowing for non-blocking operations in an async context.
The changes involve adding the necessary asynchronous logic to poll the backend for evaluation status and results, introducing new API endpoints for this purpose, and adding comprehensive tests and a demo to showcase the new functionality.
Highlights
- Async Evaluation Return Value: The
run_evaluation
method now returns anasyncio.Task
object whenasync_execution=True
, allowing users to await the completion of the asynchronous evaluation and retrieve the results. - Backend Polling Mechanism: New logic has been added to poll the Judgment API for the status and results of asynchronous evaluation runs until they are completed or fail.
- New API Endpoints: Constants for new backend API endpoints (
/get_evaluation_status/
and/eval/fetch/
) have been added to support the polling mechanism. - Async Demo and Tests: A new demo file (
async_evaluation_example.py
) and new end-to-end and unit tests have been added to demonstrate and verify the new asynchronous evaluation workflow and the awaitable task return. - Local Scorer Validation: Added a
ValueError
check to prevent using local scorers withasync_execution=True
, as this combination is not supported.
Changelog
Click here to see the changelog
- src/demo/async_evaluation_example.py
- Added a new demo script to illustrate using the async evaluation feature.
- Includes examples for directly awaiting the evaluation task and running other work concurrently.
- src/e2etests/test_async_execution.py
- Added a new file containing end-to-end tests for the async evaluation functionality.
- Tests cover direct awaiting, multiple scorers, concurrent tasks, pulling results, and task cancellation.
- src/judgeval/constants.py
- Added
JUDGMENT_GET_EVAL_STATUS_API_URL
constant for the new evaluation status endpoint. - Added
JUDGMENT_EVAL_FETCH_API_URL
constant for fetching completed evaluation results.
- Added
- src/judgeval/run_evaluation.py
- Modified
run_eval
to returnUnion[List[ScoringResult], asyncio.Task]
. - Implemented
async def get_evaluation_status
to check evaluation status via API. - Implemented
async def _poll_evaluation_until_complete
to poll the backend until evaluation is done and fetch results. - Updated the
async_execution=True
path inrun_eval
to create and return anasyncio.Task
that runs the polling workflow. - Added a
ValueError
if local scorers are used withasync_execution=True
(previously only logged an error). - Added
warning
to logger imports.
- Modified
- src/tests/test_async_execution.py
- Added a new file containing unit tests for the async execution behavior of
JudgmentClient.run_evaluation
. - Tests verify that a
Task
is returned, that it is awaitable, and that errors are handled.
- Added a new file containing unit tests for the async execution behavior of
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Async runs now yield,
A task, a future field.
Await the score,
Then ask for more,
No longer left concealed.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant improvement by allowing asynchronous evaluation runs to return an asyncio.Task
object. This greatly enhances the user experience for async operations. The implementation includes a polling mechanism to check evaluation status and fetch results, along with new demo examples and comprehensive E2E and unit tests.
The core logic for async handling in run_evaluation.py
is well-structured, particularly the polling and result validation. The new example files clearly demonstrate how to use this feature.
I have a couple of points for discussion, primarily around the polling logic in run_evaluation.py
, detailed in the comments below.
Summary of Findings
- Potentially redundant or outdated polling function: The
wait_for_evaluation
function insrc/judgeval/run_evaluation.py
seems to duplicate logic from_poll_evaluation_until_complete
and has less robust result processing. Its role and necessity should be clarified. - Missing overall timeout in async polling: The primary polling function
_poll_evaluation_until_complete
insrc/judgeval/run_evaluation.py
lacks an overall timeout for the evaluation completion, which could lead to indefinite polling if a backend evaluation gets stuck.
Merge Readiness
The pull request is a valuable addition and is generally well-implemented. However, there are a couple of medium-severity concerns regarding the polling logic in src/judgeval/run_evaluation.py
(potential redundancy/outdated code and lack of an overall timeout in the primary polling loop) that should be addressed or clarified before merging. Addressing these would improve the robustness and maintainability of the async evaluation feature. I am not authorized to approve pull requests, so please ensure further review and approval from authorized team members after addressing these points.
src/judgeval/run_evaluation.py
Outdated
async def wait_for_evaluation(eval_name: str, project_name: str, judgment_api_key: str, organization_id: str, timeout_seconds: int = 3600, poll_interval_seconds: int = 5) -> List[ScoringResult]: | ||
""" | ||
Wait for an asynchronous evaluation to complete by polling the status endpoint. | ||
|
||
Args: | ||
eval_name (str): Name of the evaluation run | ||
project_name (str): Name of the project | ||
judgment_api_key (str): API key for authentication | ||
organization_id (str): Organization ID for the evaluation | ||
timeout_seconds (int, optional): Maximum time to wait in seconds. Defaults to 3600 (1 hour). | ||
poll_interval_seconds (int, optional): Time between status checks in seconds. Defaults to 5. | ||
|
||
Returns: | ||
List[ScoringResult]: The evaluation results when complete | ||
|
||
Raises: | ||
TimeoutError: If the evaluation doesn't complete within the timeout period | ||
JudgmentAPIError: If there's an API error or the evaluation fails | ||
""" | ||
start_time = time.time() | ||
|
||
while time.time() - start_time < timeout_seconds: | ||
status_response = await get_evaluation_status( | ||
eval_name=eval_name, | ||
project_name=project_name, | ||
judgment_api_key=judgment_api_key, | ||
organization_id=organization_id | ||
) | ||
|
||
status = status_response.get("status") | ||
|
||
if status == "completed": | ||
# Evaluation is complete, extract and convert results | ||
results_data = status_response.get("results", {}) | ||
examples_data = results_data.get("examples", []) | ||
|
||
# Create ScoringResult objects from the raw data | ||
scoring_results = [] | ||
for example_data in examples_data: | ||
scorer_data_list = [] | ||
for raw_scorer_data in example_data.get("scorer_data", []): | ||
scorer_data_list.append(ScorerData(**raw_scorer_data)) | ||
|
||
# Create Example from example data (excluding scorer_data) | ||
example_dict = {k: v for k, v in example_data.items() if k != "scorer_data"} | ||
example = Example(**example_dict) | ||
|
||
# Create ScoringResult | ||
scoring_result = ScoringResult( | ||
success=True, # Assume success if we have results | ||
scorers_data=scorer_data_list, | ||
data_object=example | ||
) | ||
scoring_results.append(scoring_result) | ||
|
||
return scoring_results | ||
|
||
elif status == "failed": | ||
# Evaluation failed | ||
error_message = status_response.get("error", "Unknown error") | ||
error(f"Evaluation failed: {error_message}") | ||
raise JudgmentAPIError(f"Evaluation failed: {error_message}") | ||
|
||
# Status is either "pending" or "running", continue polling | ||
info(f"Evaluation status: {status}. Waiting for completion...") | ||
await asyncio.sleep(poll_interval_seconds) | ||
|
||
# If we get here, we've timed out | ||
error(f"Evaluation timed out after {timeout_seconds} seconds") | ||
raise TimeoutError(f"Evaluation timed out after {timeout_seconds} seconds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function wait_for_evaluation
appears to duplicate some of the polling logic found in _poll_evaluation_until_complete
(defined later in this file). It's not directly used by the main run_eval
asynchronous path, which relies on _poll_evaluation_until_complete
.
Additionally, the way ScoringResult
objects are constructed here (e.g., success=True
is assumed on line 573 if results are present) is less robust than the logic in _poll_evaluation_until_complete
(which calculates success based on individual scorer_data
on line 726).
Could you clarify the purpose of wait_for_evaluation
? If it's intended for a different use case, ensuring its logic is consistent and robust would be beneficial. If it's superseded by _poll_evaluation_until_complete
, perhaps it could be removed or refactored to reduce redundancy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
📝 Summary
Allow for return of a Task object for running Async Example instead of running just Sync version
🎯 Purpose
Our current Async Run Evaluation is returning nothing when Async Execution is on, so it is bad UX. Created a new task object that is awaitable for user to have the same behavior as Sync run evaluation
Merge with this: https://github.com/JudgmentLabs/judgment/pull/342
🎥 Demo of Changes
https://judgmentlabs.slack.com/files/U08CHNDBYUU/F08SV2DF37V/video1479408660.mp4
🧪 Testing
Run E2Etest and add new test
✅ Checklist
📌 Linear Issue
https://linear.app/judgment/issue/JUD-921/add-async-evaluation-object
✏️ Additional Notes