Skip to content

Commit 1d3eb9b

Browse files
committed
Merge branch 'az-all-user-db-endpoint' of https://github.com/JudgmentLabs/judgeval into az-all-user-db-endpoint
2 parents 031caa9 + cf5335e commit 1d3eb9b

File tree

13 files changed

+491
-212
lines changed

13 files changed

+491
-212
lines changed

docs/images/basic_trace_example.png

81 KB
Loading

docs/monitoring/introduction.mdx

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
title: Peformance Monitoring Worfklows with Judgment
3+
---
4+
5+
## Overview ##
6+
`judgeval` contains a suite of monitoring tools that allow you to **measure the quality of your LLM applications in production** scenarios.
7+
8+
Using `judgeval` in production, you can:
9+
- Measure the quality of your LLM agent systems in **real time** using Judgment's **10+ researched-backed scoring metrics**.
10+
- Check for regressions in **retrieval quality, hallucinations, and any other scoring metric you care about**.
11+
- Measure token usage
12+
- Track latency of different system components (web searching, LLM generation, etc.)
13+
14+
<Tip>
15+
**Why evaluate your system in production?**
16+
17+
Production data **provides the highest signal** for improving your LLM system on use cases you care about.
18+
Judgment Labs' infrastructure enables LLM teams to **capture quality signals from production use cases** and
19+
provides [**actionable insights**](/monitoring/production_insights) for improving any component of your system.
20+
</Tip>
21+
22+
23+
## Standard Setup ##
24+
A typical setup of `judgeval` on production systems involves:
25+
- Tracing your application using `judgeval`'s [tracing module](/monitoring/tracing).
26+
- Embedding evaluation runs into your traces the `async_evaluate()` function.
27+
- Tracking your LLM agent's performance in real time using the [Judgment platform](/judgment/introduction).
28+
29+
30+
For a full example of how to set up `judgeval` in a production system, see our [OpenAI Travel Agent example](/monitoring/tracing#example-openai-travel-agent).

docs/monitoring/tracing.mdx

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
---
2+
title: Tracing
3+
---
4+
5+
## Overview ##
6+
7+
`judgeval`'s tracing module allows you to view your LLM application's execution from **end-to-end**.
8+
9+
Using tracing, you can:
10+
- Gain observability into **every layer of your agentic system**, from database queries to tool calling and text generation.
11+
- Measure the performance of **each system component in any way** you want to measure it. For instance:
12+
- Catch regressions in **retrieval quality, factuality, answer relevance**, and 10+ other [**research-backed metrics**](/evaluation/scorers/introduction).
13+
- Quantify the **quality of each tool call** your agent makes
14+
- Track the latency of each system component
15+
- Count the token usage of each LLM generation
16+
- Export your workflow runs to the Judgment platform for **real-time analysis** or as a dataset for [**offline experimentation**](/evaluation/introduction).
17+
18+
19+
## Tracing Your Workflow ##
20+
21+
Setting up tracing with `judgeval` takes three simple steps:
22+
23+
### 1. Initialize a tracer with your API key
24+
25+
```python
26+
from judgeval.common.tracer import Tracer
27+
28+
judgment = Tracer() # loads from JUDGMENT_API_KEY env var
29+
```
30+
31+
<Note>
32+
The Judgment tracer is a singleton object that should be shared across your application.
33+
</Note>
34+
35+
36+
### 2. Wrap your workflow components
37+
38+
`judgeval` provides three wrapping mechanisms for your workflow components:
39+
40+
#### `wrap()` ####
41+
The `wrap()` function goes over your LLM client (e.g. OpenAI, Anthropic, etc.) and captures metadata surrounding your LLM calls, such as:
42+
- Latency
43+
- Token usage
44+
- Prompt/Completion
45+
- Model name
46+
47+
#### `@observe` ####
48+
The `@observe` decorator wraps your functions/tools and captures metadata surrounding your function calls, such as:
49+
- Latency
50+
- Input/Output
51+
- Span type (e.g. `retriever`, `tool`, `LLM call`, etc.)
52+
53+
Here's an example of using the `@observe` decorator on a function:
54+
```python
55+
from judgeval.common.tracer import Tracer
56+
57+
judgment = Tracer() # loads from JUDGMENT_API_KEY env var
58+
59+
@judgment.observe(span_type="tool")
60+
def my_tool():
61+
print("Hello world!")
62+
63+
```
64+
65+
<Note>
66+
The `@observe` decorator is used on top of helper functions that you write, but is not designed to be used
67+
on your "main" function. For more information, see the `context manager` section below.
68+
</Note>
69+
70+
#### `context manager` ####
71+
72+
In your main function (e.g. the one that executes the primary workflow logic), you can use the `with judgment.trace()` context manager to trace the entire workflow.
73+
74+
The context manager can **save/print the state of the trace at any point in the workflow**.
75+
This is useful for debugging or exporting any state of your workflow to run an evaluation from!
76+
77+
<Tip>
78+
The `with judgment.trace()` context manager detects any `@observe` decorated functions or wrapped LLM calls within the context and automatically captures their metadata.
79+
</Tip>
80+
81+
82+
#### Putting it all Together
83+
Here's a complete example of using the `with judgment.trace()` context manager with the other tracing mechanisms:
84+
```python
85+
from judgeval.common.tracer import Tracer, wrap
86+
from openai import OpenAI
87+
88+
openai_client = wrap(OpenAI())
89+
judgment = Tracer() # loads from JUDGMENT_API_KEY env var
90+
91+
@judgment.observe(span_type="tool")
92+
def my_tool():
93+
return "Hello world!"
94+
95+
@judgment.observe(span_type="LLM call")
96+
def my_llm_call():
97+
message = my_tool()
98+
res = openai_client.chat.completions.create(
99+
model="gpt-4o",
100+
messages=[{"role": "user", "content": message}]
101+
)
102+
return res.choices[0].message.content
103+
104+
def main():
105+
with judgment.trace(
106+
"main_workflow",
107+
project_name="my_project"
108+
) as trace:
109+
res = my_llm_call()
110+
trace.save()
111+
trace.print()
112+
return res
113+
```
114+
115+
The printed trace appears as follows on the terminal:
116+
```
117+
→ main_workflow (trace: main_workflow)
118+
→ my_llm_call (trace: my_llm_call)
119+
Input: {'args': [], 'kwargs': {}}
120+
→ my_tool (trace: my_tool)
121+
Input: {'args': [], 'kwargs': {}}
122+
Output: Hello world!
123+
← my_tool (0.000s)
124+
Output: Hello! How can I assist you today?
125+
← my_llm_call (0.789s)
126+
```
127+
128+
And the trace will appear on the Judgment platform as follows:
129+
130+
![Alt text](/images/basic_trace_example.png "Basic Trace Example")
131+
132+
### 3. Running Production Evaluations
133+
134+
Optionally, you can run asynchronous evaluations directly inside your traces.
135+
136+
This enables you to run evaluations on your **production data in real-time**, which can be useful for:
137+
- **Guardrailing your production system** against quality regressions (hallucinations, toxic responses, revealing private data, etc.).
138+
- Exporting production data for **offline experimentation** (e.g for A/B testing your workflow versions on relevant use cases).
139+
- Getting **actionable insights** on how to fix common failure modes in your workflow (e.g. missing knowledge base info, suboptimal prompts, etc.).
140+
141+
To execute an asynchronous evaluation, you can use the `trace.async_evaluate()` method. Here's an example of that:
142+
143+
```python
144+
from judgeval.common.tracer import Tracer
145+
from judgeval.scorers import FaithfulnessScorer
146+
147+
judgment = Tracer()
148+
149+
def main():
150+
with judgment.trace(
151+
"main_workflow",
152+
project_name="my_project"
153+
) as trace:
154+
retrieved_info = ... # from knowledge base
155+
res = ... # your main workflow logic
156+
157+
judgment.get_current_trace().async_evaluate(
158+
scorers=[FaithfulnesssScorer(threshold=0.5)],
159+
input="",
160+
actual_output=res,
161+
retrieval_context=[retrieved_info],
162+
model="gpt-4o-mini",
163+
)
164+
return res
165+
```
166+
167+
## Example: OpenAI Travel Agent
168+
169+
In this video, we'll walk through all of the topics covered in this guide by tracing over a simple OpenAI travel agent.
170+
171+
<iframe
172+
width="560"
173+
height="315"
174+
src="https://www.youtube.com/embed/L76V4lXIolc"
175+
title="YouTube video player"
176+
frameborder="0"
177+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
178+
referrerpolicy="strict-origin-when-cross-origin"
179+
allowfullscreen
180+
></iframe>

src/demo/cookbooks/custom_scorers/competitor_mentions.py

Lines changed: 7 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
11
"""
2+
** WARNING **: This script relies on creating a ClassifierScorer on the Judgment platform.
3+
If you haven't created one yet, you can do so by following the instructions in this YouTube guide: https://www.youtube.com/watch?v=LNEX-RkeQQI
4+
25
This script implements a custom scorer to evaluate customer support responses.
36
47
It checks if a support response mentions competitors (like Adidas, Reebok, etc.) in a positive way.
@@ -8,50 +11,30 @@
811

912

1013
from judgeval import JudgmentClient
11-
from judgeval.scorers import ClassifierScorer
1214
from judgeval.data import Example
1315

1416

15-
competitor_mentions_scorer = ClassifierScorer(
16-
"Competitor Mentions",
17-
slug="competitor_mentions-487126418",
18-
threshold=1.0,
19-
conversation=[{
20-
"role": "system",
21-
"content": """Does the following customer support response discuss any of the following competitors in a positive way? (Y/N).
22-
23-
Competitors: Adidas, Reebok, Hoka, ON, Converse
24-
25-
Customer Question: {{input}}
26-
Customer Support Response: {{actual_output}}
27-
"""
28-
}],
29-
options={
30-
"Y": 0.0,
31-
"N": 1.0
32-
}
33-
)
34-
35-
3617
if __name__ == "__main__":
3718
client = JudgmentClient()
3819

3920
positive_example = Example(
4021
input="What are the best shoes for running priced under $130?",
41-
actual_output="You'd want to check out the newest Nike Vaporfly, it's only $120 and built for performance. "
22+
actual_output="You'd want to check out the newest Nike Vaporfly, it's only $120 and built for performance."
4223
)
4324

4425
negative_example = Example(
4526
input="What are the best shoes for running priced under $130?",
4627
actual_output="The Nike Vaporfly is a great shoe built for performance. Other great options include the Adidas Ultraboost and the Reebok Nano X which are affordable and speedy."
4728
)
4829

30+
competitor_mentions_scorer = client.fetch_classifier_scorer("<YOUR_SLUG_HERE>") # replace with slug, see video guide above
31+
4932
client.run_evaluation(
5033
examples=[positive_example, negative_example],
5134
scorers=[competitor_mentions_scorer],
5235
model="gpt-4o-mini",
5336
project_name="competitor_mentions",
54-
eval_run_name="competitor_mentions_test",
37+
eval_run_name="competitor_brand_demo",
5538
)
5639

5740

src/e2etests/judgment_client_test.py

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@
2222
)
2323
from judgeval.judges import TogetherJudge, JudgevalJudge
2424
from playground import CustomFaithfulnessMetric
25-
from judgeval.data.datasets.dataset import EvalDataset
25+
from judgeval.data.datasets.dataset import EvalDataset, GroundTruthExample
26+
from judgeval.data.datasets.eval_dataset_client import EvalDatasetClient
2627
from judgeval.scorers.prompt_scorer import ClassifierScorer
2728

2829
# Configure logging
@@ -62,15 +63,29 @@ def test_dataset(self, client: JudgmentClient):
6263
dataset = client.pull_dataset(alias="test_dataset_5")
6364
assert dataset, "Failed to pull dataset"
6465

65-
def test_pull_all_datasets(self, client: JudgmentClient):
66+
def test_pull_all_user_dataset_stats(self, client: JudgmentClient):
6667
dataset: EvalDataset = client.create_dataset()
6768
dataset.add_example(Example(input="input 1", actual_output="output 1"))
69+
dataset.add_example(Example(input="input 2", actual_output="output 2"))
70+
dataset.add_example(Example(input="input 3", actual_output="output 3"))
71+
random_name1 = ''.join(random.choices(string.ascii_letters + string.digits, k=20))
72+
client.push_dataset(alias=random_name1, dataset=dataset, overwrite=False)
6873

69-
client.push_dataset(alias="test_dataset_7", dataset=dataset, overwrite=False)
74+
dataset: EvalDataset = client.create_dataset()
75+
dataset.add_example(Example(input="input 1", actual_output="output 1"))
76+
dataset.add_example(Example(input="input 2", actual_output="output 2"))
77+
dataset.add_ground_truth(GroundTruthExample(input="input 1", actual_output="output 1"))
78+
dataset.add_ground_truth(GroundTruthExample(input="input 2", actual_output="output 2"))
79+
random_name2 = ''.join(random.choices(string.ascii_letters + string.digits, k=20))
80+
client.push_dataset(alias=random_name2, dataset=dataset, overwrite=False)
7081

71-
dataset = client.pull_all_datasets()
72-
print(dataset)
73-
assert dataset, "Failed to pull dataset"
82+
all_datasets_stats = client.pull_all_user_dataset_stats()
83+
print(all_datasets_stats)
84+
assert all_datasets_stats, "Failed to pull dataset"
85+
assert all_datasets_stats[random_name1]["example_count"] == 3, f"{random_name1} should have 3 examples"
86+
assert all_datasets_stats[random_name1]["ground_truth_count"] == 0, f"{random_name1} should have 0 ground truths"
87+
assert all_datasets_stats[random_name2]["example_count"] == 2, f"{random_name2} should have 2 examples"
88+
assert all_datasets_stats[random_name2]["ground_truth_count"] == 2, f"{random_name2} should have 2 ground truths"
7489

7590
def test_run_eval(self, client: JudgmentClient):
7691
"""Test basic evaluation workflow."""
@@ -415,7 +430,7 @@ def run_selected_tests(client, test_names: list[str]):
415430

416431
test_map = {
417432
'dataset': test_basic_operations.test_dataset,
418-
'pull_all_datasets': test_basic_operations.test_pull_all_datasets,
433+
'pull_all_user_dataset_stats': test_basic_operations.test_pull_all_user_dataset_stats,
419434
'run_eval': test_basic_operations.test_run_eval,
420435
'assert_test': test_basic_operations.test_assert_test,
421436
'json_scorer': test_advanced_features.test_json_scorer,
@@ -444,7 +459,7 @@ def run_selected_tests(client, test_names: list[str]):
444459

445460
run_selected_tests(client, [
446461
'dataset',
447-
'pull_all_datasets',
462+
'pull_all_user_dataset_stats',
448463
'run_eval',
449464
'assert_test',
450465
'json_scorer',

src/judgeval/common/tracer.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
Tracing system for judgeval that allows for function tracing using decorators.
33
"""
44

5+
import os
56
import time
67
import functools
78
import requests
@@ -403,7 +404,7 @@ def __new__(cls, *args, **kwargs):
403404
cls._instance = super(Tracer, cls).__new__(cls)
404405
return cls._instance
405406

406-
def __init__(self, api_key: str):
407+
def __init__(self, api_key: str = os.getenv("JUDGMENT_API_KEY")):
407408
if not hasattr(self, 'initialized'):
408409

409410
if not api_key:

src/judgeval/constants.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ def _missing_(cls, value):
3636
JUDGMENT_EVAL_API_URL = f"{ROOT_API}/evaluate/"
3737
JUDGMENT_DATASETS_PUSH_API_URL = f"{ROOT_API}/datasets/push/"
3838
JUDGMENT_DATASETS_PULL_API_URL = f"{ROOT_API}/datasets/pull/"
39-
JUDGMENT_DATASETS_PULL_ALL_API_URL = f"{ROOT_API}/datasets/pull_all/"
39+
JUDGMENT_DATASETS_PULL_ALL_API_URL = f"{ROOT_API}/datasets/get_all_stats/"
4040
JUDGMENT_EVAL_LOG_API_URL = f"{ROOT_API}/log_eval_results/"
4141
JUDGMENT_EVAL_FETCH_API_URL = f"{ROOT_API}/fetch_eval_results/"
4242
JUDGMENT_TRACES_SAVE_API_URL = f"{ROOT_API}/traces/save/"
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from judgeval.data.datasets.dataset import EvalDataset
22
from judgeval.data.datasets.ground_truth import GroundTruthExample
3+
from judgeval.data.datasets.eval_dataset_client import EvalDatasetClient
34

4-
__all__ = ["EvalDataset", "GroundTruthExample"]
5+
__all__ = ["EvalDataset", "EvalDatasetClient", "GroundTruthExample"]

0 commit comments

Comments
 (0)