Skip to content

Add Dev Docs for Monitoring Prod Workflows #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/images/basic_trace_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 30 additions & 0 deletions docs/monitoring/introduction.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
title: Peformance Monitoring Worfklows with Judgment
---

## Overview ##
`judgeval` contains a suite of monitoring tools that allow you to **measure the quality of your LLM applications in production** scenarios.

Using `judgeval` in production, you can:
- Measure the quality of your LLM agent systems in **real time** using Judgment's **10+ researched-backed scoring metrics**.
- Check for regressions in **retrieval quality, hallucinations, and any other scoring metric you care about**.
- Measure token usage
- Track latency of different system components (web searching, LLM generation, etc.)

<Tip>
**Why evaluate your system in production?**

Production data **provides the highest signal** for improving your LLM system on use cases you care about.
Judgment Labs' infrastructure enables LLM teams to **capture quality signals from production use cases** and
provides [**actionable insights**](/monitoring/production_insights) for improving any component of your system.
</Tip>


## Standard Setup ##
A typical setup of `judgeval` on production systems involves:
- Tracing your application using `judgeval`'s [tracing module](/monitoring/tracing).
- Embedding evaluation runs into your traces the `async_evaluate()` function.
- Tracking your LLM agent's performance in real time using the [Judgment platform](/judgment/introduction).


For a full example of how to set up `judgeval` in a production system, see our [OpenAI Travel Agent example](/monitoring/tracing#example-openai-travel-agent).
180 changes: 180 additions & 0 deletions docs/monitoring/tracing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
---
title: Tracing
---

## Overview ##

`judgeval`'s tracing module allows you to view your LLM application's execution from **end-to-end**.

Using tracing, you can:
- Gain observability into **every layer of your agentic system**, from database queries to tool calling and text generation.
- Measure the performance of **each system component in any way** you want to measure it. For instance:
- Catch regressions in **retrieval quality, factuality, answer relevance**, and 10+ other [**research-backed metrics**](/evaluation/scorers/introduction).
- Quantify the **quality of each tool call** your agent makes
- Track the latency of each system component
- Count the token usage of each LLM generation
- Export your workflow runs to the Judgment platform for **real-time analysis** or as a dataset for [**offline experimentation**](/evaluation/introduction).


## Tracing Your Workflow ##

Setting up tracing with `judgeval` takes three simple steps:

### 1. Initialize a tracer with your API key

```python
from judgeval.common.tracer import Tracer

judgment = Tracer() # loads from JUDGMENT_API_KEY env var
```

<Note>
The Judgment tracer is a singleton object that should be shared across your application.
</Note>


### 2. Wrap your workflow components

`judgeval` provides three wrapping mechanisms for your workflow components:

#### `wrap()` ####
The `wrap()` function goes over your LLM client (e.g. OpenAI, Anthropic, etc.) and captures metadata surrounding your LLM calls, such as:
- Latency
- Token usage
- Prompt/Completion
- Model name

#### `@observe` ####
The `@observe` decorator wraps your functions/tools and captures metadata surrounding your function calls, such as:
- Latency
- Input/Output
- Span type (e.g. `retriever`, `tool`, `LLM call`, etc.)

Here's an example of using the `@observe` decorator on a function:
```python
from judgeval.common.tracer import Tracer

judgment = Tracer() # loads from JUDGMENT_API_KEY env var

@judgment.observe(span_type="tool")
def my_tool():
print("Hello world!")

```

<Note>
The `@observe` decorator is used on top of helper functions that you write, but is not designed to be used
on your "main" function. For more information, see the `context manager` section below.
</Note>

#### `context manager` ####

In your main function (e.g. the one that executes the primary workflow logic), you can use the `with judgment.trace()` context manager to trace the entire workflow.

The context manager can **save/print the state of the trace at any point in the workflow**.
This is useful for debugging or exporting any state of your workflow to run an evaluation from!

<Tip>
The `with judgment.trace()` context manager detects any `@observe` decorated functions or wrapped LLM calls within the context and automatically captures their metadata.
</Tip>


#### Putting it all Together
Here's a complete example of using the `with judgment.trace()` context manager with the other tracing mechanisms:
```python
from judgeval.common.tracer import Tracer, wrap
from openai import OpenAI

openai_client = wrap(OpenAI())
judgment = Tracer() # loads from JUDGMENT_API_KEY env var

@judgment.observe(span_type="tool")
def my_tool():
return "Hello world!"

@judgment.observe(span_type="LLM call")
def my_llm_call():
message = my_tool()
res = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": message}]
)
return res.choices[0].message.content

def main():
with judgment.trace(
"main_workflow",
project_name="my_project"
) as trace:
res = my_llm_call()
trace.save()
trace.print()
return res
```

The printed trace appears as follows on the terminal:
```
→ main_workflow (trace: main_workflow)
→ my_llm_call (trace: my_llm_call)
Input: {'args': [], 'kwargs': {}}
→ my_tool (trace: my_tool)
Input: {'args': [], 'kwargs': {}}
Output: Hello world!
← my_tool (0.000s)
Output: Hello! How can I assist you today?
← my_llm_call (0.789s)
```

And the trace will appear on the Judgment platform as follows:

![Alt text](/images/basic_trace_example.png "Basic Trace Example")

### 3. Running Production Evaluations

Optionally, you can run asynchronous evaluations directly inside your traces.

This enables you to run evaluations on your **production data in real-time**, which can be useful for:
- **Guardrailing your production system** against quality regressions (hallucinations, toxic responses, revealing private data, etc.).
- Exporting production data for **offline experimentation** (e.g for A/B testing your workflow versions on relevant use cases).
- Getting **actionable insights** on how to fix common failure modes in your workflow (e.g. missing knowledge base info, suboptimal prompts, etc.).

To execute an asynchronous evaluation, you can use the `trace.async_evaluate()` method. Here's an example of that:

```python
from judgeval.common.tracer import Tracer
from judgeval.scorers import FaithfulnessScorer

judgment = Tracer()

def main():
with judgment.trace(
"main_workflow",
project_name="my_project"
) as trace:
retrieved_info = ... # from knowledge base
res = ... # your main workflow logic

judgment.get_current_trace().async_evaluate(
scorers=[FaithfulnesssScorer(threshold=0.5)],
input="",
actual_output=res,
retrieval_context=[retrieved_info],
model="gpt-4o-mini",
)
return res
```

## Example: OpenAI Travel Agent

In this video, we'll walk through all of the topics covered in this guide by tracing over a simple OpenAI travel agent.

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/L76V4lXIolc"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
referrerpolicy="strict-origin-when-cross-origin"
allowfullscreen
></iframe>