JudgmentLabs
diff --git a/‎docs/images/basic_trace_example.png
81 KB b/‎docs/images/basic_trace_example.png
81 KB
diff --git a/‎docs/monitoring/introduction.mdx
Lines changed: 30 additions & 0 deletions b/‎docs/monitoring/introduction.mdx
Lines changed: 30 additions & 0 deletions
diff --git a/‎docs/monitoring/tracing.mdx
Lines changed: 180 additions & 0 deletions b/‎docs/monitoring/tracing.mdx
Lines changed: 180 additions & 0 deletions
diff --git a/‎src/demo/cookbooks/custom_scorers/competitor_mentions.py
Lines changed: 7 additions & 24 deletions b/‎src/demo/cookbooks/custom_scorers/competitor_mentions.py
Lines changed: 7 additions & 24 deletions
diff --git a/‎src/e2etests/judgment_client_test.py
Lines changed: 23 additions & 8 deletions b/‎src/e2etests/judgment_client_test.py
Lines changed: 23 additions & 8 deletions
diff --git a/‎src/judgeval/common/tracer.py
Lines changed: 2 additions & 1 deletion b/‎src/judgeval/common/tracer.py
Lines changed: 2 additions & 1 deletion
diff --git a/‎src/judgeval/constants.py
Lines changed: 1 addition & 1 deletion b/‎src/judgeval/constants.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/judgeval/data/datasets/__init__.py
Lines changed: 2 additions & 1 deletion b/‎src/judgeval/data/datasets/__init__.py
Lines changed: 2 additions & 1 deletion
@@ -0,0 +1,30 @@
+---
+title: Peformance Monitoring Worfklows with Judgment
+---
+
+## Overview ##
+`judgeval` contains a suite of monitoring tools that allow you to **measure the quality of your LLM applications in production** scenarios. 
+
+Using `judgeval` in production, you can:
+- Measure the quality of your LLM agent systems in **real time** using Judgment's **10+ researched-backed scoring metrics**.
+  - Check for regressions in **retrieval quality, hallucinations, and any other scoring metric you care about**.
+- Measure token usage
+- Track latency of different system components (web searching, LLM generation, etc.)
+
+<Tip>
+    **Why evaluate your system in production?**
+    
+    Production data **provides the highest signal** for improving your LLM system on use cases you care about. 
+    Judgment Labs' infrastructure enables LLM teams to **capture quality signals from production use cases** and 
+    provides [**actionable insights**](/monitoring/production_insights) for improving any component of your system.
+</Tip>
+
+
+## Standard Setup ##
+A typical setup of `judgeval` on production systems involves:
+- Tracing your application using `judgeval`'s [tracing module](/monitoring/tracing).
+- Embedding evaluation runs into your traces the `async_evaluate()` function.
+- Tracking your LLM agent's performance in real time using the [Judgment platform](/judgment/introduction).
+
+
+For a full example of how to set up `judgeval` in a production system, see our [OpenAI Travel Agent example](/monitoring/tracing#example-openai-travel-agent).
@@ -0,0 +1,180 @@
+---
+title: Tracing
+---
+
+## Overview ##
+
+`judgeval`'s tracing module allows you to view your LLM application's execution from **end-to-end**. 
+
+Using tracing, you can:
+- Gain observability into **every layer of your agentic system**, from database queries to tool calling and text generation.
+- Measure the performance of **each system component in any way** you want to measure it. For instance:
+    - Catch regressions in **retrieval quality, factuality, answer relevance**, and 10+ other [**research-backed metrics**](/evaluation/scorers/introduction).
+    - Quantify the **quality of each tool call** your agent makes
+    - Track the latency of each system component
+    - Count the token usage of each LLM generation
+- Export your workflow runs to the Judgment platform for **real-time analysis** or as a dataset for [**offline experimentation**](/evaluation/introduction).
+
+
+## Tracing Your Workflow ##
+
+Setting up tracing with `judgeval` takes three simple steps:
+
+### 1. Initialize a tracer with your API key
+
+```python
+from judgeval.common.tracer import Tracer
+
+judgment = Tracer()  # loads from JUDGMENT_API_KEY env var
+```
+
+<Note>
+    The Judgment tracer is a singleton object that should be shared across your application.
+</Note>
+
+
+### 2. Wrap your workflow components
+
+`judgeval` provides three wrapping mechanisms for your workflow components:
+
+#### `wrap()` ####
+The `wrap()` function goes over your LLM client (e.g. OpenAI, Anthropic, etc.) and captures metadata surrounding your LLM calls, such as:
+- Latency
+- Token usage
+- Prompt/Completion
+- Model name
+
+#### `@observe` ####
+The `@observe` decorator wraps your functions/tools and captures metadata surrounding your function calls, such as:
+- Latency
+- Input/Output
+- Span type (e.g. `retriever`, `tool`, `LLM call`, etc.)
+
+Here's an example of using the `@observe` decorator on a function:
+```python
+from judgeval.common.tracer import Tracer
+
+judgment = Tracer()  # loads from JUDGMENT_API_KEY env var
+
+@judgment.observe(span_type="tool")
+def my_tool():
+    print("Hello world!")
+
+```
+
+<Note>
+    The `@observe` decorator is used on top of helper functions that you write, but is not designed to be used 
+    on your "main" function. For more information, see the `context manager` section below.
+</Note>
+
+#### `context manager` ####
+
+In your main function (e.g. the one that executes the primary workflow logic), you can use the `with judgment.trace()` context manager to trace the entire workflow.
+
+The context manager can **save/print the state of the trace at any point in the workflow**.
+This is useful for debugging or exporting any state of your workflow to run an evaluation from!
+
+<Tip>
+    The `with judgment.trace()` context manager detects any `@observe` decorated functions or wrapped LLM calls within the context and automatically captures their metadata.
+</Tip>
+
+
+#### Putting it all Together
+Here's a complete example of using the `with judgment.trace()` context manager with the other tracing mechanisms:
+```python
+from judgeval.common.tracer import Tracer, wrap
+from openai import OpenAI
+
+openai_client = wrap(OpenAI())
+judgment = Tracer()  # loads from JUDGMENT_API_KEY env var
+
+@judgment.observe(span_type="tool")
+def my_tool():
+    return "Hello world!"
+
+@judgment.observe(span_type="LLM call")
+def my_llm_call():
+    message = my_tool()
+    res = openai_client.chat.completions.create(
+        model="gpt-4o",
+        messages=[{"role": "user", "content": message}]
+    )
+    return res.choices[0].message.content
+
+def main():
+    with judgment.trace(
+        "main_workflow", 
+        project_name="my_project"
+    ) as trace:
+        res = my_llm_call()
+        trace.save()
+        trace.print()
+        return res
+```
+
+The printed trace appears as follows on the terminal:
+```
+→ main_workflow (trace: main_workflow)
+  → my_llm_call (trace: my_llm_call)
+    Input: {'args': [], 'kwargs': {}}
+    → my_tool (trace: my_tool)
+      Input: {'args': [], 'kwargs': {}}
+      Output: Hello world!
+    ← my_tool (0.000s)
+    Output: Hello! How can I assist you today?
+  ← my_llm_call (0.789s)
+```
+
+And the trace will appear on the Judgment platform as follows:
+
+![Alt text](/images/basic_trace_example.png "Basic Trace Example")
+
+### 3. Running Production Evaluations
+
+Optionally, you can run asynchronous evaluations directly inside your traces.
+
+This enables you to run evaluations on your **production data in real-time**, which can be useful for:
+- **Guardrailing your production system** against quality regressions (hallucinations, toxic responses, revealing private data, etc.).
+- Exporting production data for **offline experimentation** (e.g for A/B testing your workflow versions on relevant use cases).
+- Getting **actionable insights** on how to fix common failure modes in your workflow (e.g. missing knowledge base info, suboptimal prompts, etc.). 
+
+To execute an asynchronous evaluation, you can use the `trace.async_evaluate()` method. Here's an example of that:
+
+```python
+from judgeval.common.tracer import Tracer
+from judgeval.scorers import FaithfulnessScorer
+
+judgment = Tracer()
+
+def main():
+    with judgment.trace(
+        "main_workflow", 
+        project_name="my_project"
+    ) as trace:
+        retrieved_info = ...   # from knowledge base
+        res = ...  # your main workflow logic
+        
+        judgment.get_current_trace().async_evaluate(
+            scorers=[FaithfulnesssScorer(threshold=0.5)],
+            input="",
+            actual_output=res,
+            retrieval_context=[retrieved_info],
+            model="gpt-4o-mini",
+        )
+        return res
+```
+
+## Example: OpenAI Travel Agent
+
+In this video, we'll walk through all of the topics covered in this guide by tracing over a simple OpenAI travel agent.
+
+<iframe 
+    width="560" 
+    height="315" 
+    src="https://www.youtube.com/embed/L76V4lXIolc"
+    title="YouTube video player" 
+    frameborder="0" 
+    allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" 
+    referrerpolicy="strict-origin-when-cross-origin" 
+    allowfullscreen
+></iframe>
@@ -1,4 +1,7 @@
 """
+** WARNING **: This script relies on creating a ClassifierScorer on the Judgment platform.
+If you haven't created one yet, you can do so by following the instructions in this YouTube guide: https://www.youtube.com/watch?v=LNEX-RkeQQI
+
 This script implements a custom scorer to evaluate customer support responses.
 
 It checks if a support response mentions competitors (like Adidas, Reebok, etc.) in a positive way.
@@ -8,50 +11,30 @@
 
 
 from judgeval import JudgmentClient
-from judgeval.scorers import ClassifierScorer
 from judgeval.data import Example
 
 
-competitor_mentions_scorer = ClassifierScorer(
-    "Competitor Mentions",
-    slug="competitor_mentions-487126418",
-    threshold=1.0,
-    conversation=[{
-        "role": "system",
-        "content": """Does the following customer support response discuss any of the following competitors in a positive way? (Y/N). 
-        
-        Competitors: Adidas, Reebok, Hoka, ON, Converse
-        
-        Customer Question: {{input}}
-        Customer Support Response: {{actual_output}}
-        """
-    }],
-    options={
-        "Y": 0.0, 
-        "N": 1.0
-    }
-)
-
-
 if __name__ == "__main__":
     client = JudgmentClient()
 
     positive_example = Example(
         input="What are the best shoes for running priced under $130?",
-        actual_output="You'd want to check out the newest Nike Vaporfly, it's only $120 and built for performance. "
+        actual_output="You'd want to check out the newest Nike Vaporfly, it's only $120 and built for performance."
     )
 
     negative_example = Example(
         input="What are the best shoes for running priced under $130?",
         actual_output="The Nike Vaporfly is a great shoe built for performance. Other great options include the Adidas Ultraboost and the Reebok Nano X which are affordable and speedy."
     )
 
+    competitor_mentions_scorer = client.fetch_classifier_scorer("<YOUR_SLUG_HERE>")  # replace with slug, see video guide above
+
     client.run_evaluation(
         examples=[positive_example, negative_example],
         scorers=[competitor_mentions_scorer],
         model="gpt-4o-mini",
         project_name="competitor_mentions",
-        eval_run_name="competitor_mentions_test",
+        eval_run_name="competitor_brand_demo",
     )
 
 
 
@@ -22,7 +22,8 @@
 )
 from judgeval.judges import TogetherJudge, JudgevalJudge
 from playground import CustomFaithfulnessMetric
-from judgeval.data.datasets.dataset import EvalDataset
+from judgeval.data.datasets.dataset import EvalDataset, GroundTruthExample
+from judgeval.data.datasets.eval_dataset_client import EvalDatasetClient
 from judgeval.scorers.prompt_scorer import ClassifierScorer
 
 # Configure logging
@@ -62,15 +63,29 @@ def test_dataset(self, client: JudgmentClient):
         dataset = client.pull_dataset(alias="test_dataset_5")
         assert dataset, "Failed to pull dataset"
 
-    def test_pull_all_datasets(self, client: JudgmentClient):
+    def test_pull_all_user_dataset_stats(self, client: JudgmentClient):
         dataset: EvalDataset = client.create_dataset()
         dataset.add_example(Example(input="input 1", actual_output="output 1"))
+        dataset.add_example(Example(input="input 2", actual_output="output 2"))
+        dataset.add_example(Example(input="input 3", actual_output="output 3"))
+        random_name1 = ''.join(random.choices(string.ascii_letters + string.digits, k=20))
+        client.push_dataset(alias=random_name1, dataset=dataset, overwrite=False)
 
-        client.push_dataset(alias="test_dataset_7", dataset=dataset, overwrite=False)
+        dataset: EvalDataset = client.create_dataset()
+        dataset.add_example(Example(input="input 1", actual_output="output 1"))
+        dataset.add_example(Example(input="input 2", actual_output="output 2"))
+        dataset.add_ground_truth(GroundTruthExample(input="input 1", actual_output="output 1"))
+        dataset.add_ground_truth(GroundTruthExample(input="input 2", actual_output="output 2"))
+        random_name2 = ''.join(random.choices(string.ascii_letters + string.digits, k=20))
+        client.push_dataset(alias=random_name2, dataset=dataset, overwrite=False)
 
-        dataset = client.pull_all_datasets()
-        print(dataset)
-        assert dataset, "Failed to pull dataset"
+        all_datasets_stats = client.pull_all_user_dataset_stats()
+        print(all_datasets_stats)
+        assert all_datasets_stats, "Failed to pull dataset"
+        assert all_datasets_stats[random_name1]["example_count"] == 3, f"{random_name1} should have 3 examples"
+        assert all_datasets_stats[random_name1]["ground_truth_count"] == 0, f"{random_name1} should have 0 ground truths"
+        assert all_datasets_stats[random_name2]["example_count"] == 2, f"{random_name2} should have 2 examples"
+        assert all_datasets_stats[random_name2]["ground_truth_count"] == 2, f"{random_name2} should have 2 ground truths"
 
     def test_run_eval(self, client: JudgmentClient):
         """Test basic evaluation workflow."""
@@ -415,7 +430,7 @@ def run_selected_tests(client, test_names: list[str]):
 
     test_map = {
         'dataset': test_basic_operations.test_dataset,
-        'pull_all_datasets': test_basic_operations.test_pull_all_datasets,
+        'pull_all_user_dataset_stats': test_basic_operations.test_pull_all_user_dataset_stats,
         'run_eval': test_basic_operations.test_run_eval,
         'assert_test': test_basic_operations.test_assert_test,
         'json_scorer': test_advanced_features.test_json_scorer,
@@ -444,7 +459,7 @@ def run_selected_tests(client, test_names: list[str]):
 
     run_selected_tests(client, [
         'dataset',
-        'pull_all_datasets',
+        'pull_all_user_dataset_stats',
         'run_eval', 
         'assert_test',
         'json_scorer',
 
@@ -2,6 +2,7 @@
 Tracing system for judgeval that allows for function tracing using decorators.
 """
 
+import os
 import time
 import functools
 import requests
@@ -403,7 +404,7 @@ def __new__(cls, *args, **kwargs):
             cls._instance = super(Tracer, cls).__new__(cls)
         return cls._instance
 
-    def __init__(self, api_key: str):
+    def __init__(self, api_key: str = os.getenv("JUDGMENT_API_KEY")):
         if not hasattr(self, 'initialized'):
 
             if not api_key:
 
@@ -36,7 +36,7 @@ def _missing_(cls, value):
 JUDGMENT_EVAL_API_URL = f"{ROOT_API}/evaluate/"
 JUDGMENT_DATASETS_PUSH_API_URL = f"{ROOT_API}/datasets/push/"
 JUDGMENT_DATASETS_PULL_API_URL = f"{ROOT_API}/datasets/pull/"
-JUDGMENT_DATASETS_PULL_ALL_API_URL = f"{ROOT_API}/datasets/pull_all/"
+JUDGMENT_DATASETS_PULL_ALL_API_URL = f"{ROOT_API}/datasets/get_all_stats/"
 JUDGMENT_EVAL_LOG_API_URL = f"{ROOT_API}/log_eval_results/"
 JUDGMENT_EVAL_FETCH_API_URL = f"{ROOT_API}/fetch_eval_results/"
 JUDGMENT_TRACES_SAVE_API_URL = f"{ROOT_API}/traces/save/"
 
@@ -1,4 +1,5 @@
 from judgeval.data.datasets.dataset import EvalDataset
 from judgeval.data.datasets.ground_truth import GroundTruthExample
+from judgeval.data.datasets.eval_dataset_client import EvalDatasetClient
 
-__all__ = ["EvalDataset", "GroundTruthExample"]
+__all__ = ["EvalDataset", "EvalDatasetClient", "GroundTruthExample"]