Skip to content

Commit 781a42d

Browse files
authored
Merge pull request #67 from JudgmentLabs/add_monitoring_docs
Add Dev Docs for Monitoring Prod Workflows
2 parents 06960cc + 55237bc commit 781a42d

File tree

3 files changed

+210
-0
lines changed

3 files changed

+210
-0
lines changed

docs/images/basic_trace_example.png

81 KB
Loading

docs/monitoring/introduction.mdx

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
title: Peformance Monitoring Worfklows with Judgment
3+
---
4+
5+
## Overview ##
6+
`judgeval` contains a suite of monitoring tools that allow you to **measure the quality of your LLM applications in production** scenarios.
7+
8+
Using `judgeval` in production, you can:
9+
- Measure the quality of your LLM agent systems in **real time** using Judgment's **10+ researched-backed scoring metrics**.
10+
- Check for regressions in **retrieval quality, hallucinations, and any other scoring metric you care about**.
11+
- Measure token usage
12+
- Track latency of different system components (web searching, LLM generation, etc.)
13+
14+
<Tip>
15+
**Why evaluate your system in production?**
16+
17+
Production data **provides the highest signal** for improving your LLM system on use cases you care about.
18+
Judgment Labs' infrastructure enables LLM teams to **capture quality signals from production use cases** and
19+
provides [**actionable insights**](/monitoring/production_insights) for improving any component of your system.
20+
</Tip>
21+
22+
23+
## Standard Setup ##
24+
A typical setup of `judgeval` on production systems involves:
25+
- Tracing your application using `judgeval`'s [tracing module](/monitoring/tracing).
26+
- Embedding evaluation runs into your traces the `async_evaluate()` function.
27+
- Tracking your LLM agent's performance in real time using the [Judgment platform](/judgment/introduction).
28+
29+
30+
For a full example of how to set up `judgeval` in a production system, see our [OpenAI Travel Agent example](/monitoring/tracing#example-openai-travel-agent).

docs/monitoring/tracing.mdx

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
---
2+
title: Tracing
3+
---
4+
5+
## Overview ##
6+
7+
`judgeval`'s tracing module allows you to view your LLM application's execution from **end-to-end**.
8+
9+
Using tracing, you can:
10+
- Gain observability into **every layer of your agentic system**, from database queries to tool calling and text generation.
11+
- Measure the performance of **each system component in any way** you want to measure it. For instance:
12+
- Catch regressions in **retrieval quality, factuality, answer relevance**, and 10+ other [**research-backed metrics**](/evaluation/scorers/introduction).
13+
- Quantify the **quality of each tool call** your agent makes
14+
- Track the latency of each system component
15+
- Count the token usage of each LLM generation
16+
- Export your workflow runs to the Judgment platform for **real-time analysis** or as a dataset for [**offline experimentation**](/evaluation/introduction).
17+
18+
19+
## Tracing Your Workflow ##
20+
21+
Setting up tracing with `judgeval` takes three simple steps:
22+
23+
### 1. Initialize a tracer with your API key
24+
25+
```python
26+
from judgeval.common.tracer import Tracer
27+
28+
judgment = Tracer() # loads from JUDGMENT_API_KEY env var
29+
```
30+
31+
<Note>
32+
The Judgment tracer is a singleton object that should be shared across your application.
33+
</Note>
34+
35+
36+
### 2. Wrap your workflow components
37+
38+
`judgeval` provides three wrapping mechanisms for your workflow components:
39+
40+
#### `wrap()` ####
41+
The `wrap()` function goes over your LLM client (e.g. OpenAI, Anthropic, etc.) and captures metadata surrounding your LLM calls, such as:
42+
- Latency
43+
- Token usage
44+
- Prompt/Completion
45+
- Model name
46+
47+
#### `@observe` ####
48+
The `@observe` decorator wraps your functions/tools and captures metadata surrounding your function calls, such as:
49+
- Latency
50+
- Input/Output
51+
- Span type (e.g. `retriever`, `tool`, `LLM call`, etc.)
52+
53+
Here's an example of using the `@observe` decorator on a function:
54+
```python
55+
from judgeval.common.tracer import Tracer
56+
57+
judgment = Tracer() # loads from JUDGMENT_API_KEY env var
58+
59+
@judgment.observe(span_type="tool")
60+
def my_tool():
61+
print("Hello world!")
62+
63+
```
64+
65+
<Note>
66+
The `@observe` decorator is used on top of helper functions that you write, but is not designed to be used
67+
on your "main" function. For more information, see the `context manager` section below.
68+
</Note>
69+
70+
#### `context manager` ####
71+
72+
In your main function (e.g. the one that executes the primary workflow logic), you can use the `with judgment.trace()` context manager to trace the entire workflow.
73+
74+
The context manager can **save/print the state of the trace at any point in the workflow**.
75+
This is useful for debugging or exporting any state of your workflow to run an evaluation from!
76+
77+
<Tip>
78+
The `with judgment.trace()` context manager detects any `@observe` decorated functions or wrapped LLM calls within the context and automatically captures their metadata.
79+
</Tip>
80+
81+
82+
#### Putting it all Together
83+
Here's a complete example of using the `with judgment.trace()` context manager with the other tracing mechanisms:
84+
```python
85+
from judgeval.common.tracer import Tracer, wrap
86+
from openai import OpenAI
87+
88+
openai_client = wrap(OpenAI())
89+
judgment = Tracer() # loads from JUDGMENT_API_KEY env var
90+
91+
@judgment.observe(span_type="tool")
92+
def my_tool():
93+
return "Hello world!"
94+
95+
@judgment.observe(span_type="LLM call")
96+
def my_llm_call():
97+
message = my_tool()
98+
res = openai_client.chat.completions.create(
99+
model="gpt-4o",
100+
messages=[{"role": "user", "content": message}]
101+
)
102+
return res.choices[0].message.content
103+
104+
def main():
105+
with judgment.trace(
106+
"main_workflow",
107+
project_name="my_project"
108+
) as trace:
109+
res = my_llm_call()
110+
trace.save()
111+
trace.print()
112+
return res
113+
```
114+
115+
The printed trace appears as follows on the terminal:
116+
```
117+
→ main_workflow (trace: main_workflow)
118+
→ my_llm_call (trace: my_llm_call)
119+
Input: {'args': [], 'kwargs': {}}
120+
→ my_tool (trace: my_tool)
121+
Input: {'args': [], 'kwargs': {}}
122+
Output: Hello world!
123+
← my_tool (0.000s)
124+
Output: Hello! How can I assist you today?
125+
← my_llm_call (0.789s)
126+
```
127+
128+
And the trace will appear on the Judgment platform as follows:
129+
130+
![Alt text](/images/basic_trace_example.png "Basic Trace Example")
131+
132+
### 3. Running Production Evaluations
133+
134+
Optionally, you can run asynchronous evaluations directly inside your traces.
135+
136+
This enables you to run evaluations on your **production data in real-time**, which can be useful for:
137+
- **Guardrailing your production system** against quality regressions (hallucinations, toxic responses, revealing private data, etc.).
138+
- Exporting production data for **offline experimentation** (e.g for A/B testing your workflow versions on relevant use cases).
139+
- Getting **actionable insights** on how to fix common failure modes in your workflow (e.g. missing knowledge base info, suboptimal prompts, etc.).
140+
141+
To execute an asynchronous evaluation, you can use the `trace.async_evaluate()` method. Here's an example of that:
142+
143+
```python
144+
from judgeval.common.tracer import Tracer
145+
from judgeval.scorers import FaithfulnessScorer
146+
147+
judgment = Tracer()
148+
149+
def main():
150+
with judgment.trace(
151+
"main_workflow",
152+
project_name="my_project"
153+
) as trace:
154+
retrieved_info = ... # from knowledge base
155+
res = ... # your main workflow logic
156+
157+
judgment.get_current_trace().async_evaluate(
158+
scorers=[FaithfulnesssScorer(threshold=0.5)],
159+
input="",
160+
actual_output=res,
161+
retrieval_context=[retrieved_info],
162+
model="gpt-4o-mini",
163+
)
164+
return res
165+
```
166+
167+
## Example: OpenAI Travel Agent
168+
169+
In this video, we'll walk through all of the topics covered in this guide by tracing over a simple OpenAI travel agent.
170+
171+
<iframe
172+
width="560"
173+
height="315"
174+
src="https://www.youtube.com/embed/L76V4lXIolc"
175+
title="YouTube video player"
176+
frameborder="0"
177+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
178+
referrerpolicy="strict-origin-when-cross-origin"
179+
allowfullscreen
180+
></iframe>

0 commit comments

Comments
 (0)