Skip to content

Async experiment return #264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 21, 2025
Merged

Async experiment return #264

merged 13 commits into from
May 21, 2025

Conversation

Mandolaro
Copy link
Collaborator

@Mandolaro Mandolaro commented May 20, 2025

📝 Summary

Allow for return of a Task object for running Async Example instead of running just Sync version

🎯 Purpose

Our current Async Run Evaluation is returning nothing when Async Execution is on, so it is bad UX. Created a new task object that is awaitable for user to have the same behavior as Sync run evaluation

Merge with this: https://github.com/JudgmentLabs/judgment/pull/342

🎥 Demo of Changes

https://judgmentlabs.slack.com/files/U08CHNDBYUU/F08SV2DF37V/video1479408660.mp4

🧪 Testing

Run E2Etest and add new test

✅ Checklist

  • Self-review
  • Video demo of changes
  • Unit Tests and CI/CD tests are passing
  • Reviewers assigned

📌 Linear Issue

https://linear.app/judgment/issue/JUD-921/add-async-evaluation-object

✏️ Additional Notes

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Mandolaro, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary for this pull request authored by @Mandolaro.

The core intent of this PR is to address a user experience issue where running evaluations asynchronously (async_execution=True) currently returns None. This makes it difficult for users to wait for the evaluation to complete or integrate it into asynchronous workflows.

The solution implemented is to change the run_evaluation method to return an asyncio.Task object when async_execution=True. This task can then be awaited by the user, providing a similar experience to the synchronous execution where the results are returned directly, while still allowing for non-blocking operations in an async context.

The changes involve adding the necessary asynchronous logic to poll the backend for evaluation status and results, introducing new API endpoints for this purpose, and adding comprehensive tests and a demo to showcase the new functionality.

Highlights

  • Async Evaluation Return Value: The run_evaluation method now returns an asyncio.Task object when async_execution=True, allowing users to await the completion of the asynchronous evaluation and retrieve the results.
  • Backend Polling Mechanism: New logic has been added to poll the Judgment API for the status and results of asynchronous evaluation runs until they are completed or fail.
  • New API Endpoints: Constants for new backend API endpoints (/get_evaluation_status/ and /eval/fetch/) have been added to support the polling mechanism.
  • Async Demo and Tests: A new demo file (async_evaluation_example.py) and new end-to-end and unit tests have been added to demonstrate and verify the new asynchronous evaluation workflow and the awaitable task return.
  • Local Scorer Validation: Added a ValueError check to prevent using local scorers with async_execution=True, as this combination is not supported.

Changelog

Click here to see the changelog
  • src/demo/async_evaluation_example.py
    • Added a new demo script to illustrate using the async evaluation feature.
    • Includes examples for directly awaiting the evaluation task and running other work concurrently.
  • src/e2etests/test_async_execution.py
    • Added a new file containing end-to-end tests for the async evaluation functionality.
    • Tests cover direct awaiting, multiple scorers, concurrent tasks, pulling results, and task cancellation.
  • src/judgeval/constants.py
    • Added JUDGMENT_GET_EVAL_STATUS_API_URL constant for the new evaluation status endpoint.
    • Added JUDGMENT_EVAL_FETCH_API_URL constant for fetching completed evaluation results.
  • src/judgeval/run_evaluation.py
    • Modified run_eval to return Union[List[ScoringResult], asyncio.Task].
    • Implemented async def get_evaluation_status to check evaluation status via API.
    • Implemented async def _poll_evaluation_until_complete to poll the backend until evaluation is done and fetch results.
    • Updated the async_execution=True path in run_eval to create and return an asyncio.Task that runs the polling workflow.
    • Added a ValueError if local scorers are used with async_execution=True (previously only logged an error).
    • Added warning to logger imports.
  • src/tests/test_async_execution.py
    • Added a new file containing unit tests for the async execution behavior of JudgmentClient.run_evaluation.
    • Tests verify that a Task is returned, that it is awaitable, and that errors are handled.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Async runs now yield,
A task, a future field.
Await the score,
Then ask for more,
No longer left concealed.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement by allowing asynchronous evaluation runs to return an asyncio.Task object. This greatly enhances the user experience for async operations. The implementation includes a polling mechanism to check evaluation status and fetch results, along with new demo examples and comprehensive E2E and unit tests.

The core logic for async handling in run_evaluation.py is well-structured, particularly the polling and result validation. The new example files clearly demonstrate how to use this feature.

I have a couple of points for discussion, primarily around the polling logic in run_evaluation.py, detailed in the comments below.

Summary of Findings

  • Potentially redundant or outdated polling function: The wait_for_evaluation function in src/judgeval/run_evaluation.py seems to duplicate logic from _poll_evaluation_until_complete and has less robust result processing. Its role and necessity should be clarified.
  • Missing overall timeout in async polling: The primary polling function _poll_evaluation_until_complete in src/judgeval/run_evaluation.py lacks an overall timeout for the evaluation completion, which could lead to indefinite polling if a backend evaluation gets stuck.

Merge Readiness

The pull request is a valuable addition and is generally well-implemented. However, there are a couple of medium-severity concerns regarding the polling logic in src/judgeval/run_evaluation.py (potential redundancy/outdated code and lack of an overall timeout in the primary polling loop) that should be addressed or clarified before merging. Addressing these would improve the robustness and maintainability of the async evaluation feature. I am not authorized to approve pull requests, so please ensure further review and approval from authorized team members after addressing these points.

Comment on lines 524 to 593
async def wait_for_evaluation(eval_name: str, project_name: str, judgment_api_key: str, organization_id: str, timeout_seconds: int = 3600, poll_interval_seconds: int = 5) -> List[ScoringResult]:
"""
Wait for an asynchronous evaluation to complete by polling the status endpoint.

Args:
eval_name (str): Name of the evaluation run
project_name (str): Name of the project
judgment_api_key (str): API key for authentication
organization_id (str): Organization ID for the evaluation
timeout_seconds (int, optional): Maximum time to wait in seconds. Defaults to 3600 (1 hour).
poll_interval_seconds (int, optional): Time between status checks in seconds. Defaults to 5.

Returns:
List[ScoringResult]: The evaluation results when complete

Raises:
TimeoutError: If the evaluation doesn't complete within the timeout period
JudgmentAPIError: If there's an API error or the evaluation fails
"""
start_time = time.time()

while time.time() - start_time < timeout_seconds:
status_response = await get_evaluation_status(
eval_name=eval_name,
project_name=project_name,
judgment_api_key=judgment_api_key,
organization_id=organization_id
)

status = status_response.get("status")

if status == "completed":
# Evaluation is complete, extract and convert results
results_data = status_response.get("results", {})
examples_data = results_data.get("examples", [])

# Create ScoringResult objects from the raw data
scoring_results = []
for example_data in examples_data:
scorer_data_list = []
for raw_scorer_data in example_data.get("scorer_data", []):
scorer_data_list.append(ScorerData(**raw_scorer_data))

# Create Example from example data (excluding scorer_data)
example_dict = {k: v for k, v in example_data.items() if k != "scorer_data"}
example = Example(**example_dict)

# Create ScoringResult
scoring_result = ScoringResult(
success=True, # Assume success if we have results
scorers_data=scorer_data_list,
data_object=example
)
scoring_results.append(scoring_result)

return scoring_results

elif status == "failed":
# Evaluation failed
error_message = status_response.get("error", "Unknown error")
error(f"Evaluation failed: {error_message}")
raise JudgmentAPIError(f"Evaluation failed: {error_message}")

# Status is either "pending" or "running", continue polling
info(f"Evaluation status: {status}. Waiting for completion...")
await asyncio.sleep(poll_interval_seconds)

# If we get here, we've timed out
error(f"Evaluation timed out after {timeout_seconds} seconds")
raise TimeoutError(f"Evaluation timed out after {timeout_seconds} seconds")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function wait_for_evaluation appears to duplicate some of the polling logic found in _poll_evaluation_until_complete (defined later in this file). It's not directly used by the main run_eval asynchronous path, which relies on _poll_evaluation_until_complete.

Additionally, the way ScoringResult objects are constructed here (e.g., success=True is assumed on line 573 if results are present) is less robust than the logic in _poll_evaluation_until_complete (which calculates success based on individual scorer_data on line 726).

Could you clarify the purpose of wait_for_evaluation? If it's intended for a different use case, ensuring its logic is consistent and robust would be beneficial. If it's superseded by _poll_evaluation_until_complete, perhaps it could be removed or refactored to reduce redundancy?

Copy link
Collaborator

@JCamyre JCamyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JCamyre JCamyre enabled auto-merge (squash) May 21, 2025 19:31
@JCamyre JCamyre merged commit c4aa02b into main May 21, 2025
2 of 3 checks passed
@JCamyre JCamyre deleted the Async_Experiment_Return branch May 21, 2025 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants