Skip to content

Refactor default judges #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jan 13, 2025
Merged

Refactor default judges #36

merged 22 commits into from
Jan 13, 2025

Conversation

SecroLoL
Copy link
Contributor

@SecroLoL SecroLoL commented Jan 11, 2025

Default judges used to be created with the syntax

from judgeval.scorers import JudgmentScorer
from judgeval.constants import APIScorer

scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS
...

We can now create them by directly importing

from judgeval.scorers import FaithfulnessScorer

scorer = FaithfulnessScorer(threshold=0.5)

This is a stylistic change but also reveals strong future benefit as we add scorers that require specific args, such as the new JSONCorrectness scorer, which needs a schema field

scorer = JSONCorrectnessScorer(schema=...)

We wouldn't have been able to provide this with the previous schema, but now we can.

Major changes

  1. Implement each default judgment scorer as its own class
  2. Modify functions that run evaluations (standard evals and span-level evaluations) to accommodate this
  3. Update all existing UTs to be compatible, + add UTs for all default scorers

Copy link
Collaborator

@JCamyre JCamyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at my comments, this is sick

model=model,
metadata={},
log_results=log_results,
project_name="TestSpanLevel",
eval_run_name="TestSpanLevel",
project_name="TestSpanLevel1", # TODO this should be dynamic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will handle this, I'll share my thoughts in Slack. In my multi-step eval PR, I added a project and trace name, which will tie in nicely to generate automatic eval run names (to improve UX while remaining clear).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice B)

@SecroLoL SecroLoL merged commit b6944b3 into main Jan 13, 2025
@SecroLoL SecroLoL deleted the refactor_default_judges branch February 10, 2025 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants