Skip to content

Add developer docs #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jan 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
742d362
Add initial commit for dev docs
SecroLoL Jan 7, 2025
7caf850
Add docs for scorers and basic eval runs
SecroLoL Jan 7, 2025
370a675
Add docs section for datasets
SecroLoL Jan 7, 2025
ad95dbe
Add CustomScorer docs
SecroLoL Jan 7, 2025
e8ef2f3
Add skeletons for other docs pages
SecroLoL Jan 7, 2025
ca8d110
Move datasets.mdx to git status
SecroLoL Jan 8, 2025
94a3e7f
Move datasets docs page to evaluation/
SecroLoL Jan 8, 2025
c72d599
Add introduction docs page for evals
SecroLoL Jan 8, 2025
d952534
Add docs for Example
SecroLoL Jan 8, 2025
35696de
Skeleton for datasets docs
SecroLoL Jan 8, 2025
22c37dc
Finish datasets docs page
SecroLoL Jan 8, 2025
a193c7c
Add scorer intro docs page
SecroLoL Jan 8, 2025
03f6152
Add AnswerRelevancy docs
SecroLoL Jan 8, 2025
c4b5a57
Fix some typos in AnswerRelevancy docs
SecroLoL Jan 8, 2025
d3c114e
Add docs for contextual precision
SecroLoL Jan 8, 2025
3fd1d36
add docs page for ContextualRecall
SecroLoL Jan 8, 2025
f0c80f5
Add docs for ContextualRelevancy
SecroLoL Jan 8, 2025
3f28ba9
Add Faithfulness docs page
SecroLoL Jan 8, 2025
f011177
Add docs for Hallucination scorer
SecroLoL Jan 8, 2025
5b07ca4
add docs for Summarization scorer
SecroLoL Jan 9, 2025
a32bd87
Quick fix: change scorer type in docs example for SummarizationScorer
SecroLoL Jan 9, 2025
8a9479c
add tool correctness scorer docs
SecroLoL Jan 9, 2025
46245c4
JSON correctness docs
SecroLoL Jan 9, 2025
c94a496
Resolve key auth on docs, add todo for platform docs
SecroLoL Jan 9, 2025
acdabbf
Add platform docs for getting-started page
SecroLoL Jan 10, 2025
eb92e19
Add docs page for tracing
SecroLoL Jan 10, 2025
9688c76
Wrap up the 'getting started' docs
SecroLoL Jan 10, 2025
5a4a6a0
Add evals diagram to eval intro doc
SecroLoL Jan 10, 2025
4626eed
Update scorer docs pages with the correct sample implementations
SecroLoL Jan 10, 2025
b8207a7
Add custom scorer docs
SecroLoL Jan 10, 2025
a3377d6
Rename docs -> judgeval_docs
SecroLoL Jan 14, 2025
86b1272
Add docs repo
SecroLoL Jan 14, 2025
21e1321
Add new mintlify docs
SecroLoL Jan 14, 2025
81f44b0
Update docs
SecroLoL Jan 14, 2025
e3a4aea
Update docs
SecroLoL Jan 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ wheels/
.installed.cfg
*.egg-info/

# APIs
google-cloud-sdk/
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
Expand Down
1 change: 1 addition & 0 deletions docs
Submodule docs added at 00ab9e
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
127 changes: 127 additions & 0 deletions judgeval_docs/dev_docs/evaluation/data_datasets.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Datasets

## Quick Summary
In most scenarios, you will have multiple `Example`s that you want to evaluate together.
In `judgeval`, an evaluation dataset (`EvalDataset`) is a collection of `Example`s and/or `GroundTruthExample`s that you can scale evaluations across.

`Tip`:

A `GroundTruthExample` is a specific type of `Example` that do not require the `actual_output` field. This is useful for creating datasets that can be dynamically updated at evaluation time by running your workflow on the `GroundTruthExample`s to create `Example`s.

## Creating a Dataset

Creating an `EvalDataset` is as simple as supplying a list of `Example`s and/or `GroundTruthExample`s.

```
from judgeval import (
EvalDataset,
Example,
GroundTruthExample
)

examples = [Example(input="...", actual_output="..."), Example(input="...", actual_output="..."), ...]
ground_truth_examples = [GroundTruthExample(input="..."), GroundTruthExample(input="..."), ...]

dataset = EvalDataset(examples=examples, ground_truth_examples=ground_truth_examples)
```

You can also add `Example`s and `GroundTruthExample`s to an existing `EvalDataset` using the `add_example` and `add_ground_truth_example` methods.

```
...

dataset.add_example(Example(...))
dataset.add_ground_truth(GroundTruthExample(...))
```

## Saving/Loading Datasets

`judgeval` supports saving and loading datasets in the following formats:
- JSON
- CSV

### From Judgment
You easily can save/load an `EvalDataset` from Judgment's cloud.

```
# Saving
...
from judgeval import JudgmentClient

client = JudgmentClient()
client.push_dataset(alias="my_dataset", dataset=dataset)
```

```
# Loading
from judgeval import JudgmentClient

client = JudgmentClient()
dataset = client.pull_dataset(alias="my_dataset")
```

### From JSON

You can save/load an `EvalDataset` with a JSON file. Your JSON file should have the following structure:
```
{
"examples": [{"input": "...", "actual_output": "..."}, ...],
"ground_truths": [{"input": "..."}, ...]
}
```

Here's an example of how use `judgeval` to save/load from JSON.

```
from judgeval import EvalDataset

# saving
dataset = EvalDataset(...) # filled with examples
dataset.save_as("json", "/path/to/save/dir", "save_name")

# loading
new_dataset = EvalDataset()
new_dataset.add_from_json("/path/to/your/json/file.json")

```

### From CSV

You can save/load an `EvalDataset` with a `.csv` file. Your CSV should contain rows that can be mapped to `Example`s via column names.
TODO: this section needs to be updated because the CSV format is not yet finalized.


Here's an example of how use `judgeval` to save/load from CSV.

```
from judgeval import EvalDataset

# saving
dataset = EvalDataset(...) # filled with examples
dataset.save_as("csv", "/path/to/save/dir", "save_name")

# loading
new_dataset = EvalDataset()
new_dataset.add_from_csv("/path/to/your/csv/file.csv")
```

## Evaluate On Your Dataset

You can use the `JudgmentClient` to evaluate the `Example`s and `GroundTruthExample`s in your dataset using scorers.

```
...

dataset = client.pull_dataset(alias="my_dataset")
res = client.evaluate_dataset(
dataset=dataset,
scorers=[JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS)],
model="gpt-4o",
)
```

## Conclusion

Congratulations! You've now learned how to create, save, and evaluate an `EvalDataset` in `judgeval`.

You can also view and manage your datasets via Judgment's platform. Check out TODO: add link here
166 changes: 166 additions & 0 deletions judgeval_docs/dev_docs/evaluation/data_examples.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Examples

## Quick Summary
An Example is a basic unit of data in `judgeval` that allows you to use evaluation scorers on your LLM system. An `Example` is composed of seven fields:
- `input`
- `actual_output`
- [Optional] `expected_output`
- [Optional] `retrieval_context`
- [Optional] `context`
- [Optional] `tools_called`
- [Optional] `expected_tools`

Here's a sample of creating an `Example`

```
from judgeval.data import Example

example = Example(
input="Who founded Microsoft?",
actual_output="Bill Gates and Paul Allen.",
expected_output="Bill Gates and Paul Allen founded Microsoft in New Mexico in 1975.",
retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
context=["Bill Gates and Paul Allen are the founders of Microsoft."],
tools_called=["Google Search"],
expected_tools=["Google Search", "Perplexity"],
)
```

INFO:
The `input` and `actual_output` fields are required for all examples. However, you don't always need to use them in your evaluations. For example, if you're evaluating whether a chatbot's response is friendly, you don't need to use `input`.

The other fields are optional and may be useful depending on the kind of evaluation you're running. For example, if you want to check for hallucinations in a RAG system, you'd be interested in the `retrieval_context` field for the Faithfulness scorer.

## Example Fields

### Input
The `input` field represents a sample interaction between a user and your LLM system. The input should represent the direct input to your prompt template(s), and SHOULD NOT CONTAIN your prompt template itself.

`Tip`:

You should treat prompt templates as hyperparameters that you optimize for based on the scorer you're executing. Evaluation is inherently tied with optimization, so you should try to isolate your system's independent variables (e.g. prompt template, model choice, RAG retriever) from your evaluation.

### Actual Output

The `actual_output` field represents what the LLM system outputs based on the `input`. This is often the actual output of your LLM system created either at evaluation time or with saved answers.

```
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question)
)
```

### Expected Output

The `expected_output` field is `Optional[str]` and represents the ideal output of your LLM system. One of the nice parts of `judgeval`'s scorers is that they use LLMs which have flexible evaluation criteria. You don't need to worry about your `expected_output` perfectly matching your `actual_output`.

To learn more about how `judgeval`'s scorers work, please see the [scorers docs](./scorers/introduction.mdx).

```
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question),
expected_output="Sparkling water is neither healthy nor unhealthy."
)
```

### Context

The `context` field is `Optional[List[str]]` and represents information that is supplied to the LLM system as ground truth. For instance, context could be a list of facts that the LLM system is aware of. However, `context` should not be confused with `retrieval_context`.

`Tip`:

In RAG systems, contextual information is retrieved from a vector database and is represented in `judgeval` by `retrieval_context`, not `context`. **If you're building a RAG system, you'll want to use `retrieval_context`.**

```
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question),
expected_output="Sparkling water is neither healthy nor unhealthy.",
context=["Sparkling water is a type of water that is carbonated."]
)
```

### Retrieval Context

The `retrieval_context` field is `Optional[List[str]]` and represents the context that is retrieved from a vector database. This is often the context that is used to generate the `actual_output` in a RAG system.

Some common cases for using `retrieval_context` are:
- Checking for hallucinations in a RAG system
- Evaluating the quality of a retriever model (comparing retrieved info to `context`)

```
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question),
expected_output="Sparkling water is neither healthy nor unhealthy.",
context=["Sparkling water is a type of water that is carbonated."],
retrieval_context=["Sparkling water is carbonated and has no calories."]
)
```

`Tip`:

`context` is the ideal retrieval result for a specific `input`, whereas `retrieval_context` is the actual retrieval result at runtime. While they are similar, they are not always interchangeable.

### Tools Called

The `tools_called` field is `Optional[List[str]]` and represents the tools that were called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.

```
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question),
expected_output="Sparkling water is neither healthy nor unhealthy.",
context=["Sparkling water is a type of water that is carbonated."],
retrieval_context=["Sparkling water is carbonated and has no calories."],
tools_called=["Perplexity", "GoogleSearch"]
)
```

### Expected Tools

The `expected_tools` field is `Optional[List[str]]` and represents the tools that are expected to be called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.

```
# Sample app implementation
import medical_chatbot

question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question),
expected_output="Sparkling water is neither healthy nor unhealthy.",
context=["Sparkling water is a type of water that is carbonated."],
retrieval_context=["Sparkling water is carbonated and has no calories."],
tools_called=["Perplexity", "GoogleSearch"],
expected_tools=["Perplexity", "DBQuery"]
)
```

## Conclusion

Congratulations! You've learned how to create an `Example` and can begin using them to execute evaluations or create datasets.

TODO: add links here ^^
91 changes: 91 additions & 0 deletions judgeval_docs/dev_docs/evaluation/introduction.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Introduction

## Quick Summary

Evaluation is the process of scoring an LLM system's outputs with metrics; an evaluation is composed of:
- An evaluation dataset
- Metrics we are interested in tracking

The ideal fit of evaluation into an application workflow looks like this:

![Alt text](judgeval/docs/dev_docs/imgs/evaluation_diagram.png "Optional title")

## Metrics

`judgeval` comes with a set of 10+ built-in evaluation metrics. These metrics are accessible through `judgeval`'s `Scorer` interface.

```
from judgeval.scorers import JudgmentScorer
from judgeval.constants import APIScorer

scorer = JudgmentScorer(score_type=APIScorer.FAITHFULNESS)
```
You can use scorers to evaluate your LLM system's outputs by using `Example`s.

!! We're always working on adding new `Scorer`s, so if you have a metric you'd like to add, please let us know!

## Examples

In `judgeval`, an Example is a unit of data that allows you to use evaluation scorers on your LLM system.

```
from judgeval.data import Example

example = Example(
input="Who founded Microsoft?",
actual_output="Bill Gates and Paul Allen.",
retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
)
```

In this example, `input` represents a user talking with a RAG-based LLM application, where `actual_output` is the output of your chatbot and `retrieval_context` is the retrieved context. Creating an Example allows you to evaluate using `judgeval`'s default scorers:

```
from judgeval.judgment_client import JudgmentClient
from judgeval.scorers import JudgmentScorer
from judgeval.constants import APIScorer

client = JudgmentClient()

faithfulness_scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS)

results = client.run_evaluation(
examples=[example],
scorers=[faithfulness_scorer, summarization_scorer],
model="gpt-4o",
)
print(results)
```

## Datasets

An Evaluation Dataset is a collection of Examples. It provides an interface for running scaled evaluations of your LLM system using one or more scorers.

```
from judgeval.data import Example, EvalDataset

example1 = Example("input"="...", "actual_output"="...", "retrieval_context"="...")
example2 = Example("input"="...", "actual_output"="...", "retrieval_context"="...")

dataset = EvalDataset(examples=[example1, example2])
```

`EvalDataset`s can be saved to disk and loaded back in, or uploaded to the Judgment platform.
For more information on how to use `EvalDataset`s, please see the [EvalDataset docs](./data_datasets.mdx).

Then, you can run evaluations on the dataset:

```
...

client = JudgmentClient()
scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS)
results = client.evaluate_dataset(
dataset=dataset,
scorers=[scorer],
model="QWEN",
)
```

Congratulations! You've learned the basics of building and running evaluations with `judgeval`.
For a deep dive into all the metrics you can run using `judgeval` scorers, click here. TODO add link
Loading