Async experiment return #264

Mandolaro · 2025-05-20T00:59:01Z

📝 Summary

Allow for return of a Task object for running Async Example instead of running just Sync version

🎯 Purpose

Our current Async Run Evaluation is returning nothing when Async Execution is on, so it is bad UX. Created a new task object that is awaitable for user to have the same behavior as Sync run evaluation

Merge with this: https://github.com/JudgmentLabs/judgment/pull/342

🎥 Demo of Changes

https://judgmentlabs.slack.com/files/U08CHNDBYUU/F08SV2DF37V/video1479408660.mp4

🧪 Testing

Run E2Etest and add new test

✅ Checklist

Self-review
Video demo of changes
Unit Tests and CI/CD tests are passing
Reviewers assigned

📌 Linear Issue

https://linear.app/judgment/issue/JUD-921/add-async-evaluation-object

✏️ Additional Notes

gemini-code-assist

Hello @Mandolaro, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary for this pull request authored by @Mandolaro.

The core intent of this PR is to address a user experience issue where running evaluations asynchronously (async_execution=True) currently returns None. This makes it difficult for users to wait for the evaluation to complete or integrate it into asynchronous workflows.

The solution implemented is to change the run_evaluation method to return an asyncio.Task object when async_execution=True. This task can then be awaited by the user, providing a similar experience to the synchronous execution where the results are returned directly, while still allowing for non-blocking operations in an async context.

The changes involve adding the necessary asynchronous logic to poll the backend for evaluation status and results, introducing new API endpoints for this purpose, and adding comprehensive tests and a demo to showcase the new functionality.

Highlights

Async Evaluation Return Value: The run_evaluation method now returns an asyncio.Task object when async_execution=True, allowing users to await the completion of the asynchronous evaluation and retrieve the results.
Backend Polling Mechanism: New logic has been added to poll the Judgment API for the status and results of asynchronous evaluation runs until they are completed or fail.
New API Endpoints: Constants for new backend API endpoints (/get_evaluation_status/ and /eval/fetch/) have been added to support the polling mechanism.
Async Demo and Tests: A new demo file (async_evaluation_example.py) and new end-to-end and unit tests have been added to demonstrate and verify the new asynchronous evaluation workflow and the awaitable task return.
Local Scorer Validation: Added a ValueError check to prevent using local scorers with async_execution=True, as this combination is not supported.

Changelog

Click here to see the changelog

src/demo/async_evaluation_example.py
- Added a new demo script to illustrate using the async evaluation feature.
- Includes examples for directly awaiting the evaluation task and running other work concurrently.
src/e2etests/test_async_execution.py
- Added a new file containing end-to-end tests for the async evaluation functionality.
- Tests cover direct awaiting, multiple scorers, concurrent tasks, pulling results, and task cancellation.
src/judgeval/constants.py
- Added JUDGMENT_GET_EVAL_STATUS_API_URL constant for the new evaluation status endpoint.
- Added JUDGMENT_EVAL_FETCH_API_URL constant for fetching completed evaluation results.
src/judgeval/run_evaluation.py
- Modified run_eval to return Union[List[ScoringResult], asyncio.Task].
- Implemented async def get_evaluation_status to check evaluation status via API.
- Implemented async def _poll_evaluation_until_complete to poll the backend until evaluation is done and fetch results.
- Updated the async_execution=True path in run_eval to create and return an asyncio.Task that runs the polling workflow.
- Added a ValueError if local scorers are used with async_execution=True (previously only logged an error).
- Added warning to logger imports.
src/tests/test_async_execution.py
- Added a new file containing unit tests for the async execution behavior of JudgmentClient.run_evaluation.
- Tests verify that a Task is returned, that it is awaitable, and that errors are handled.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Async runs now yield,
A task, a future field.
Await the score,
Then ask for more,
No longer left concealed.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant improvement by allowing asynchronous evaluation runs to return an asyncio.Task object. This greatly enhances the user experience for async operations. The implementation includes a polling mechanism to check evaluation status and fetch results, along with new demo examples and comprehensive E2E and unit tests.

The core logic for async handling in run_evaluation.py is well-structured, particularly the polling and result validation. The new example files clearly demonstrate how to use this feature.

I have a couple of points for discussion, primarily around the polling logic in run_evaluation.py, detailed in the comments below.

Summary of Findings

Potentially redundant or outdated polling function: The wait_for_evaluation function in src/judgeval/run_evaluation.py seems to duplicate logic from _poll_evaluation_until_complete and has less robust result processing. Its role and necessity should be clarified.
Missing overall timeout in async polling: The primary polling function _poll_evaluation_until_complete in src/judgeval/run_evaluation.py lacks an overall timeout for the evaluation completion, which could lead to indefinite polling if a backend evaluation gets stuck.

Merge Readiness

The pull request is a valuable addition and is generally well-implemented. However, there are a couple of medium-severity concerns regarding the polling logic in src/judgeval/run_evaluation.py (potential redundancy/outdated code and lack of an overall timeout in the primary polling loop) that should be addressed or clarified before merging. Addressing these would improve the robustness and maintainability of the async evaluation feature. I am not authorized to approve pull requests, so please ensure further review and approval from authorized team members after addressing these points.

gemini-code-assist · 2025-05-20T01:01:10Z

src/judgeval/run_evaluation.py

+async def wait_for_evaluation(eval_name: str, project_name: str, judgment_api_key: str, organization_id: str, timeout_seconds: int = 3600, poll_interval_seconds: int = 5) -> List[ScoringResult]:
+    """
+    Wait for an asynchronous evaluation to complete by polling the status endpoint.
+
+    Args:
+        eval_name (str): Name of the evaluation run
+        project_name (str): Name of the project
+        judgment_api_key (str): API key for authentication
+        organization_id (str): Organization ID for the evaluation
+        timeout_seconds (int, optional): Maximum time to wait in seconds. Defaults to 3600 (1 hour).
+        poll_interval_seconds (int, optional): Time between status checks in seconds. Defaults to 5.
+
+    Returns:
+        List[ScoringResult]: The evaluation results when complete
+
+    Raises:
+        TimeoutError: If the evaluation doesn't complete within the timeout period
+        JudgmentAPIError: If there's an API error or the evaluation fails
+    """
+    start_time = time.time()
+
+    while time.time() - start_time < timeout_seconds:
+        status_response = await get_evaluation_status(
+            eval_name=eval_name,
+            project_name=project_name,
+            judgment_api_key=judgment_api_key,
+            organization_id=organization_id
+        )
+
+        status = status_response.get("status")
+
+        if status == "completed":
+            # Evaluation is complete, extract and convert results
+            results_data = status_response.get("results", {})
+            examples_data = results_data.get("examples", [])
+
+            # Create ScoringResult objects from the raw data
+            scoring_results = []
+            for example_data in examples_data:
+                scorer_data_list = []
+                for raw_scorer_data in example_data.get("scorer_data", []):
+                    scorer_data_list.append(ScorerData(**raw_scorer_data))
+
+                # Create Example from example data (excluding scorer_data)
+                example_dict = {k: v for k, v in example_data.items() if k != "scorer_data"}
+                example = Example(**example_dict)
+
+                # Create ScoringResult
+                scoring_result = ScoringResult(
+                    success=True,  # Assume success if we have results
+                    scorers_data=scorer_data_list,
+                    data_object=example
+                )
+                scoring_results.append(scoring_result)
+
+            return scoring_results
+
+        elif status == "failed":
+            # Evaluation failed
+            error_message = status_response.get("error", "Unknown error")
+            error(f"Evaluation failed: {error_message}")
+            raise JudgmentAPIError(f"Evaluation failed: {error_message}")
+
+        # Status is either "pending" or "running", continue polling
+        info(f"Evaluation status: {status}. Waiting for completion...")
+        await asyncio.sleep(poll_interval_seconds)
+
+    # If we get here, we've timed out
+    error(f"Evaluation timed out after {timeout_seconds} seconds")
+    raise TimeoutError(f"Evaluation timed out after {timeout_seconds} seconds")


The function wait_for_evaluation appears to duplicate some of the polling logic found in _poll_evaluation_until_complete (defined later in this file). It's not directly used by the main run_eval asynchronous path, which relies on _poll_evaluation_until_complete.

Additionally, the way ScoringResult objects are constructed here (e.g., success=True is assumed on line 573 if results are present) is less robust than the logic in _poll_evaluation_until_complete (which calculates success based on individual scorer_data on line 726).

Could you clarify the purpose of wait_for_evaluation? If it's intended for a different use case, ensuring its logic is consistent and robust would be beneficial. If it's superseded by _poll_evaluation_until_complete, perhaps it could be removed or refactored to reduce redundancy?

src/judgeval/run_evaluation.py

JCamyre

LGTM

Mandolaro added 4 commits May 19, 2025 17:17

add code for async eval to be able to return stuff

a4697c5

add a demo

c8c26a1

add tests

558f39f

fix some fields

6921637

gemini-code-assist bot reviewed May 20, 2025

View reviewed changes

gemini-code-assist bot suggested changes May 20, 2025

View reviewed changes

Mandolaro added 4 commits May 19, 2025 18:02

fix the test

dcee83a

Merge branch 'main' into Async_Experiment_Return

7a4cdc0

fix imports

f374012

refactor outdated code

dc221c5

JCamyre approved these changes May 21, 2025

View reviewed changes

JCamyre and others added 5 commits May 20, 2025 17:32

Merge branch 'main' into Async_Experiment_Return

d748fa4

add scorer data check

ae48920

add assert test fix

f9c50d5

add spinner

96ef8dc

Merge branch 'main' into Async_Experiment_Return

31d9b35

JCamyre enabled auto-merge (squash) May 21, 2025 19:31

JCamyre merged commit c4aa02b into main May 21, 2025
2 of 3 checks passed

JCamyre deleted the Async_Experiment_Return branch May 21, 2025 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Async experiment return #264

Async experiment return #264

Uh oh!

Mandolaro commented May 20, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot May 20, 2025

Uh oh!

Uh oh!

JCamyre left a comment

Uh oh!

Uh oh!

Uh oh!

Async experiment return #264

Async experiment return #264

Uh oh!

Conversation

Mandolaro commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Summary

🎯 Purpose

🎥 Demo of Changes

🧪 Testing

✅ Checklist

📌 Linear Issue

✏️ Additional Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JCamyre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Mandolaro commented May 20, 2025 •

edited

Loading