Skip to content

Commit 08de2ab

Browse files
author
Judgment Release Bot
committed
[Bump Minor Version] Release: Merge staging to main
2 parents b245399 + 398fe9e commit 08de2ab

File tree

16 files changed

+1760
-361
lines changed

16 files changed

+1760
-361
lines changed

README.md

Lines changed: 6 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
<br>
77
<div style="font-size: 1.5em;">
8-
Enable self-learning agents with traces, evals, and environment data.
8+
Enable self-learning agents with environment data and evals.
99
</div>
1010

1111
## [Docs](https://docs.judgmentlabs.ai/)[Judgment Cloud](https://app.judgmentlabs.ai/register)[Self-Host](https://docs.judgmentlabs.ai/documentation/self-hosting/get-started)[Landing Page](https://judgmentlabs.ai/)
@@ -22,11 +22,11 @@ We're hiring! Join us in our mission to enable self-learning agents by providing
2222

2323
</div>
2424

25-
Judgeval offers **open-source tooling** for tracing and evaluating autonomous, stateful agents. It **provides runtime data from agent-environment interactions** for continuous learning and self-improvement.
25+
Judgeval offers **open-source tooling** for evaluating autonomous, stateful agents. It **provides runtime data from agent-environment interactions** for continuous learning and self-improvement.
2626

2727
## 🎬 See Judgeval in Action
2828

29-
**[Multi-Agent System](https://github.com/JudgmentLabs/judgment-cookbook/tree/main/cookbooks/agents/multi-agent) with complete observability:** (1) A multi-agent system spawns agents to research topics on the internet. (2) With just **3 lines of code**, Judgeval traces every input/output + environment response across all agent tool calls for debugging. (3) After completion, (4) export all interaction data to enable further environment-specific learning and optimization.
29+
**[Multi-Agent System](https://github.com/JudgmentLabs/judgment-cookbook/tree/main/cookbooks/agents/multi-agent) with complete observability:** (1) A multi-agent system spawns agents to research topics on the internet. (2) With just **3 lines of code**, Judgeval captures all environment responses across all agent tool calls for monitoring. (3) After completion, (4) export all interaction data to enable further environment-specific learning and optimization.
3030

3131
<table style="width: 100%; max-width: 800px; table-layout: fixed;">
3232
<tr>
@@ -35,8 +35,8 @@ Judgeval offers **open-source tooling** for tracing and evaluating autonomous, s
3535
<br><strong>🤖 Agents Running</strong>
3636
</td>
3737
<td align="center" style="padding: 8px; width: 50%;">
38-
<img src="assets/trace.gif" alt="Trace Demo" style="width: 100%; max-width: 350px; height: auto;" />
39-
<br><strong>📊 Real-time Tracing</strong>
38+
<img src="assets/trace.gif" alt="Capturing Environment Data Demo" style="width: 100%; max-width: 350px; height: auto;" />
39+
<br><strong>📊 Capturing Environment Data </strong>
4040
</td>
4141
</tr>
4242
<tr>
@@ -77,54 +77,14 @@ export JUDGMENT_ORG_ID=...
7777

7878
**If you don't have keys, [create an account](https://app.judgmentlabs.ai/register) on the platform!**
7979

80-
## 🏁 Quickstarts
81-
82-
### 🛰️ Tracing
83-
84-
Create a file named `agent.py` with the following code:
85-
86-
```python
87-
from judgeval.tracer import Tracer, wrap
88-
from openai import OpenAI
89-
90-
client = wrap(OpenAI()) # tracks all LLM calls
91-
judgment = Tracer(project_name="my_project")
92-
93-
@judgment.observe(span_type="tool")
94-
def format_question(question: str) -> str:
95-
# dummy tool
96-
return f"Question : {question}"
97-
98-
@judgment.observe(span_type="function")
99-
def run_agent(prompt: str) -> str:
100-
task = format_question(prompt)
101-
response = client.chat.completions.create(
102-
model="gpt-4.1",
103-
messages=[{"role": "user", "content": task}]
104-
)
105-
return response.choices[0].message.content
106-
107-
run_agent("What is the capital of the United States?")
108-
```
109-
You'll see your trace exported to the Judgment Platform:
110-
111-
<p align="center"><img src="assets/online_eval.png" alt="Judgment Platform Trace Example" width="1500" /></p>
112-
113-
114-
[Click here](https://docs.judgmentlabs.ai/documentation/tracing/introduction) for a more detailed explanation.
115-
116-
117-
<!-- Created by https://github.com/ekalinin/github-markdown-toc -->
118-
11980

12081
## ✨ Features
12182

12283
| | |
12384
|:---|:---:|
124-
| <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic). **Tracks inputs/outputs, agent tool calls, latency, cost, and custom metadata** at every step.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 📋 Collecting agent environment data <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/agent_trace_example.png" alt="Tracing visualization" width="1200"/></p> |
12585
| <h3>🧪 Evals</h3>Build custom evaluators on top of your agents. Judgeval supports LLM-as-a-judge, manual labeling, and code-based evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 A/B testing <br>• 🛡️ Online guardrails | <p align="center"><img src="assets/test.png" alt="Evaluation metrics" width="800"/></p> |
12686
| <h3>📡 Monitoring</h3>Get Slack alerts for agent failures in production. Add custom hooks to address production regressions.<br><br> **Useful for:** <br>• 📉 Identifying degradation early <br>• 📈 Visualizing performance trends across agent versions and time | <p align="center"><img src="assets/errors.png" alt="Monitoring Dashboard" width="1200"/></p> |
127-
| <h3>📊 Datasets</h3>Export traces and test cases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>• 🗃️ Agent environment interaction data for optimization<br>• 🔄 Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
87+
| <h3>📊 Datasets</h3>Export environment interactions and test cases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>• 🗃️ Agent environment interaction data for optimization<br>• 🔄 Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
12888

12989
## 🏢 Self-Hosting
13090

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ dependencies = [
3131
"langchain-core",
3232
"click<8.2.0",
3333
"typer>=0.9.0",
34+
"fireworks-ai>=0.19.18",
3435
]
3536

3637
[project.urls]

src/e2etests/test_eval_operations.py

Lines changed: 0 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,9 @@
99
from judgeval.scorers import (
1010
FaithfulnessScorer,
1111
AnswerRelevancyScorer,
12-
ToolOrderScorer,
1312
)
1413
from judgeval.scorers.example_scorer import ExampleScorer
1514
from judgeval.dataset import Dataset
16-
from judgeval.tracer import Tracer
1715
from judgeval.constants import DEFAULT_TOGETHER_MODEL
1816

1917

@@ -173,73 +171,3 @@ async def a_score_example(self, example: CustomExample):
173171
assert res[3].scorers_data[0].score == 0
174172

175173
dataset.delete()
176-
177-
178-
@pytest.mark.asyncio
179-
async def test_run_trace_eval(
180-
client: JudgmentClient, project_name: str, random_name: str
181-
):
182-
EVAL_RUN_NAME = random_name
183-
tracer = Tracer(project_name=project_name)
184-
185-
@tracer.observe(span_type="tool")
186-
def simple_function(text: str):
187-
return "finished {text}"
188-
189-
example1 = Example(
190-
input="input",
191-
expected_tools=[
192-
{"tool_name": "simple_function", "parameters": {"text": "input"}}
193-
],
194-
)
195-
196-
example2 = Example(
197-
input="input2",
198-
expected_tools=[
199-
{"tool_name": "simple_function", "parameters": {"text": "input2"}}
200-
],
201-
)
202-
203-
scorer = ToolOrderScorer(threshold=0.5)
204-
results = client.run_trace_evaluation(
205-
examples=[example1, example2],
206-
function=simple_function,
207-
tracer=tracer,
208-
scorers=[scorer],
209-
project_name=project_name,
210-
eval_run_name=EVAL_RUN_NAME,
211-
)
212-
assert results, (
213-
f"No evaluation results found for {EVAL_RUN_NAME} in project {project_name}"
214-
)
215-
assert len(results) == 2, f"Expected 2 trace results but got {len(results)}"
216-
217-
assert results[0].success
218-
assert results[1].success
219-
220-
221-
@pytest.mark.asyncio
222-
async def test_run_trace_eval_with_project_mismatch(
223-
client: JudgmentClient, project_name: str, random_name: str
224-
):
225-
EVAL_RUN_NAME = random_name
226-
227-
tracer = Tracer(project_name="mismatching-project")
228-
scorer = ToolOrderScorer(threshold=0.5)
229-
example = Example(input="hello")
230-
231-
@tracer.observe(span_type="tool")
232-
def simple_function(text: str):
233-
return f"Processed: {text.upper()}"
234-
235-
with pytest.raises(
236-
ValueError, match="Project name mismatch between run_trace_eval and tracer."
237-
):
238-
client.run_trace_evaluation(
239-
examples=[example],
240-
function=simple_function,
241-
tracer=tracer,
242-
scorers=[scorer],
243-
project_name=project_name,
244-
eval_run_name=EVAL_RUN_NAME,
245-
)

src/e2etests/test_tracer.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ def validate_trace_token_counts(
9090
"TOGETHER_API_CALL",
9191
"GOOGLE_API_CALL",
9292
"GROQ_API_CALL",
93+
"FIREWORKS_TRAINABLE_MODEL_CALL",
9394
}
9495

9596
for span in trace_spans:

src/judgeval/cli.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ def upload_scorer(
3838
try:
3939
client = JudgmentClient()
4040

41-
result = client.save_custom_scorer(
41+
result = client.upload_custom_scorer(
4242
scorer_file_path=scorer_file_path,
4343
requirements_file_path=requirements_file_path,
4444
unique_name=unique_name,

src/judgeval/common/api/constants.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ class EvaluationRunsBatchPayload(TypedDict):
5151
JUDGMENT_GET_EVAL_STATUS_API_URL = f"{ROOT_API}/get_evaluation_status/"
5252

5353
# Custom Scorers API
54-
JUDGMENT_CUSTOM_SCORER_UPLOAD_API_URL = f"{ROOT_API}/build_sandbox_template/"
54+
JUDGMENT_CUSTOM_SCORER_UPLOAD_API_URL = f"{ROOT_API}/upload_scorer/"
5555

5656

5757
# Evaluation API Payloads

0 commit comments

Comments
 (0)