JudgmentLabs · SecroLoL · Jan 15, 2025 · Jan 7, 2025 · Jan 7, 2025 · Jan 7, 2025
diff --git a/.gitignore b/.gitignore
@@ -17,6 +17,8 @@ wheels/
 .installed.cfg
 *.egg-info/
 
+# APIs
+google-cloud-sdk/
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.

diff --git a/docs b/docs
diff --git a/docs/create_dataset.ipynb → judgeval_docs/create_dataset.ipynb b/docs/create_dataset.ipynb → judgeval_docs/create_dataset.ipynb
diff --git a/docs/create_scorer.ipynb → judgeval_docs/create_scorer.ipynb b/docs/create_scorer.ipynb → judgeval_docs/create_scorer.ipynb
diff --git a/docs/demo.ipynb → judgeval_docs/demo.ipynb b/docs/demo.ipynb → judgeval_docs/demo.ipynb
diff --git a/docs/demo_files/debug.txt → judgeval_docs/demo_files/debug.txt b/docs/demo_files/debug.txt → judgeval_docs/demo_files/debug.txt
diff --git a/docs/demo_files/tanaka.txt → judgeval_docs/demo_files/tanaka.txt b/docs/demo_files/tanaka.txt → judgeval_docs/demo_files/tanaka.txt
diff --git a/docs/demo_files/tanaka_beneficiary.txt → ...al_docs/demo_files/tanaka_beneficiary.txt b/docs/demo_files/tanaka_beneficiary.txt → ...al_docs/demo_files/tanaka_beneficiary.txt
diff --git a/docs/demo_files/tanaka_recommender.txt → ...al_docs/demo_files/tanaka_recommender.txt b/docs/demo_files/tanaka_recommender.txt → ...al_docs/demo_files/tanaka_recommender.txt
diff --git a/judgeval_docs/dev_docs/evaluation/data_datasets.mdx b/judgeval_docs/dev_docs/evaluation/data_datasets.mdx
@@ -0,0 +1,127 @@
+# Datasets 
+
+## Quick Summary
+In most scenarios, you will have multiple `Example`s that you want to evaluate together.  
+In `judgeval`, an evaluation dataset (`EvalDataset`) is a collection of `Example`s and/or `GroundTruthExample`s that you can scale evaluations across.
+
+`Tip`:
+
+A `GroundTruthExample` is a specific type of `Example` that do not require the `actual_output` field. This is useful for creating datasets that can be dynamically updated at evaluation time by running your workflow on the `GroundTruthExample`s to create `Example`s.
+
+## Creating a Dataset
+
+Creating an `EvalDataset` is as simple as supplying a list of `Example`s and/or `GroundTruthExample`s.
+
+```
+from judgeval import (
+    EvalDataset, 
+    Example, 
+    GroundTruthExample
+)
+
+examples = [Example(input="...", actual_output="..."), Example(input="...", actual_output="..."), ...]
+ground_truth_examples = [GroundTruthExample(input="..."), GroundTruthExample(input="..."), ...]
+
+dataset = EvalDataset(examples=examples, ground_truth_examples=ground_truth_examples)
+```
+
+You can also add `Example`s and `GroundTruthExample`s to an existing `EvalDataset` using the `add_example` and `add_ground_truth_example` methods.
+
+```
+...
+
+dataset.add_example(Example(...))
+dataset.add_ground_truth(GroundTruthExample(...))
+```
+
+## Saving/Loading Datasets
+
+`judgeval` supports saving and loading datasets in the following formats:
+- JSON
+- CSV
+
+### From Judgment
+You easily can save/load an `EvalDataset` from Judgment's cloud. 
+
+```
+# Saving
+...
+from judgeval import JudgmentClient
+
+client = JudgmentClient()
+client.push_dataset(alias="my_dataset", dataset=dataset)
+```
+
+```
+# Loading
+from judgeval import JudgmentClient
+
+client = JudgmentClient()
+dataset = client.pull_dataset(alias="my_dataset")
+```
+
+### From JSON
+
+You can save/load an `EvalDataset` with a JSON file. Your JSON file should have the following structure:
+```
+{
+    "examples": [{"input": "...", "actual_output": "..."}, ...],
+    "ground_truths": [{"input": "..."}, ...]
+}
+```
+
+Here's an example of how use `judgeval` to save/load from JSON.
+
+```
+from judgeval import EvalDataset
+
+# saving
+dataset = EvalDataset(...)  # filled with examples
+dataset.save_as("json", "/path/to/save/dir", "save_name")
+
+# loading
+new_dataset = EvalDataset()
+new_dataset.add_from_json("/path/to/your/json/file.json")
+
+```
+
+### From CSV
+
+You can save/load an `EvalDataset` with a `.csv` file. Your CSV should contain rows that can be mapped to `Example`s via column names.
+TODO: this section needs to be updated because the CSV format is not yet finalized.
+
+
+Here's an example of how use `judgeval` to save/load from CSV.
+
+```
+from judgeval import EvalDataset
+
+# saving
+dataset = EvalDataset(...)  # filled with examples
+dataset.save_as("csv", "/path/to/save/dir", "save_name")
+
+# loading
+new_dataset = EvalDataset()
+new_dataset.add_from_csv("/path/to/your/csv/file.csv")
+```
+
+## Evaluate On Your Dataset
+
+You can use the `JudgmentClient` to evaluate the `Example`s and `GroundTruthExample`s in your dataset using scorers.
+
+```
+...
+
+dataset = client.pull_dataset(alias="my_dataset")
+res = client.evaluate_dataset(
+    dataset=dataset,
+    scorers=[JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS)],
+    model="gpt-4o",
+)
+```
+
+## Conclusion 
+
+Congratulations! You've now learned how to create, save, and evaluate an `EvalDataset` in `judgeval`.
+
+You can also view and manage your datasets via Judgment's platform. Check out TODO: add link here
diff --git a/judgeval_docs/dev_docs/evaluation/data_examples.mdx b/judgeval_docs/dev_docs/evaluation/data_examples.mdx
@@ -0,0 +1,166 @@
+# Examples
+
+## Quick Summary
+An Example is a basic unit of data in `judgeval` that allows you to use evaluation scorers on your LLM system. An `Example` is composed of seven fields:
+- `input`
+- `actual_output`
+- [Optional] `expected_output`
+- [Optional] `retrieval_context`
+- [Optional] `context`
+- [Optional] `tools_called`
+- [Optional] `expected_tools`
+
+Here's a sample of creating an `Example`
+
+```
+from judgeval.data import Example
+
+example = Example(
+    input="Who founded Microsoft?",
+    actual_output="Bill Gates and Paul Allen.",
+    expected_output="Bill Gates and Paul Allen founded Microsoft in New Mexico in 1975.",
+    retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
+    context=["Bill Gates and Paul Allen are the founders of Microsoft."],
+    tools_called=["Google Search"],
+    expected_tools=["Google Search", "Perplexity"],
+)
+```
+
+INFO:
+The `input` and `actual_output` fields are required for all examples. However, you don't always need to use them in your evaluations. For example, if you're evaluating whether a chatbot's response is friendly, you don't need to use `input`. 
+
+The other fields are optional and may be useful depending on the kind of evaluation you're running. For example, if you want to check for hallucinations in a RAG system, you'd be interested in the `retrieval_context` field for the Faithfulness scorer.
+
+## Example Fields 
+
+### Input 
+The `input` field represents a sample interaction between a user and your LLM system. The input should represent the direct input to your prompt template(s), and SHOULD NOT CONTAIN your prompt template itself.
+
+`Tip`: 
+
+You should treat prompt templates as hyperparameters that you optimize for based on the scorer you're executing. Evaluation is inherently tied with optimization, so you should try to isolate your system's independent variables (e.g. prompt template, model choice, RAG retriever) from your evaluation. 
+
+### Actual Output 
+
+The `actual_output` field represents what the LLM system outputs based on the `input`. This is often the actual output of your LLM system created either at evaluation time or with saved answers.
+
+```
+# Sample app implementation
+import medical_chatbot
+
+question = "Is sparkling water healthy?"
+example = Example(
+    input=question,
+    actual_output=medical_chatbot.chat(question)
+)
+```
+
+### Expected Output
+
+The `expected_output` field is `Optional[str]` and represents the ideal output of your LLM system. One of the nice parts of `judgeval`'s scorers is that they use LLMs which have flexible evaluation criteria. You don't need to worry about your `expected_output` perfectly matching your `actual_output`.
+
+To learn more about how `judgeval`'s scorers work, please see the [scorers docs](./scorers/introduction.mdx).
+
+```
+# Sample app implementation
+import medical_chatbot
+
+question = "Is sparkling water healthy?"
+example = Example(
+    input=question,
+    actual_output=medical_chatbot.chat(question),
+    expected_output="Sparkling water is neither healthy nor unhealthy."
+)
+```
+
+### Context 
+
+The `context` field is `Optional[List[str]]` and represents information that is supplied to the LLM system as ground truth. For instance, context could be a list of facts that the LLM system is aware of. However, `context` should not be confused with `retrieval_context`.
+
+`Tip`:
+
+In RAG systems, contextual information is retrieved from a vector database and is represented in `judgeval` by `retrieval_context`, not `context`. **If you're building a RAG system, you'll want to use `retrieval_context`.**
+
+```
+# Sample app implementation
+import medical_chatbot
+
+question = "Is sparkling water healthy?"
+example = Example(
+    input=question,
+    actual_output=medical_chatbot.chat(question),
+    expected_output="Sparkling water is neither healthy nor unhealthy.",
+    context=["Sparkling water is a type of water that is carbonated."]
+)
+```
+
+### Retrieval Context 
+
+The `retrieval_context` field is `Optional[List[str]]` and represents the context that is retrieved from a vector database. This is often the context that is used to generate the `actual_output` in a RAG system.
+
+Some common cases for using `retrieval_context` are:
+- Checking for hallucinations in a RAG system
+- Evaluating the quality of a retriever model (comparing retrieved info to `context`)
+
+```
+# Sample app implementation
+import medical_chatbot
+
+question = "Is sparkling water healthy?"
+example = Example(
+    input=question,
+    actual_output=medical_chatbot.chat(question),
+    expected_output="Sparkling water is neither healthy nor unhealthy.",
+    context=["Sparkling water is a type of water that is carbonated."],
+    retrieval_context=["Sparkling water is carbonated and has no calories."]
+)
+```
+
+`Tip`:
+
+`context` is the ideal retrieval result for a specific `input`, whereas `retrieval_context` is the actual retrieval result at runtime. While they are similar, they are not always interchangeable.
+
+### Tools Called 
+
+The `tools_called` field is `Optional[List[str]]` and represents the tools that were called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.
+
+```
+# Sample app implementation
+import medical_chatbot
+
+question = "Is sparkling water healthy?"
+example = Example(
+    input=question,
+    actual_output=medical_chatbot.chat(question),
+    expected_output="Sparkling water is neither healthy nor unhealthy.",
+    context=["Sparkling water is a type of water that is carbonated."],
+    retrieval_context=["Sparkling water is carbonated and has no calories."],
+    tools_called=["Perplexity", "GoogleSearch"]
+)
+```
+
+### Expected Tools 
+
+The `expected_tools` field is `Optional[List[str]]` and represents the tools that are expected to be called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.
+
+```
+# Sample app implementation
+import medical_chatbot
+
+question = "Is sparkling water healthy?"
+example = Example(
+    input=question,
+    actual_output=medical_chatbot.chat(question),
+    expected_output="Sparkling water is neither healthy nor unhealthy.",
+    context=["Sparkling water is a type of water that is carbonated."],
+    retrieval_context=["Sparkling water is carbonated and has no calories."],
+    tools_called=["Perplexity", "GoogleSearch"],
+    expected_tools=["Perplexity", "DBQuery"]
+)
+```
+
+## Conclusion 
+
+Congratulations! You've learned how to create an `Example` and can begin using them to execute evaluations or create datasets.
+
+TODO: add links here ^^
diff --git a/judgeval_docs/dev_docs/evaluation/introduction.mdx b/judgeval_docs/dev_docs/evaluation/introduction.mdx
@@ -0,0 +1,91 @@
+# Introduction 
+
+## Quick Summary
+
+Evaluation is the process of scoring an LLM system's outputs with metrics; an evaluation is composed of:
+- An evaluation dataset
+- Metrics we are interested in tracking
+
+The ideal fit of evaluation into an application workflow looks like this:
+
+![Alt text](judgeval/docs/dev_docs/imgs/evaluation_diagram.png "Optional title")
+
+## Metrics 
+
+`judgeval` comes with a set of 10+ built-in evaluation metrics. These metrics are accessible through `judgeval`'s `Scorer` interface. 
+
+```
+from judgeval.scorers import JudgmentScorer
+from judgeval.constants import APIScorer
+
+scorer = JudgmentScorer(score_type=APIScorer.FAITHFULNESS)
+```
+You can use scorers to evaluate your LLM system's outputs by using `Example`s.
+
+!! We're always working on adding new `Scorer`s, so if you have a metric you'd like to add, please let us know!
+
+## Examples 
+
+In `judgeval`, an Example is a unit of data that allows you to use evaluation scorers on your LLM system.
+
+```
+from judgeval.data import Example
+
+example = Example(
+    input="Who founded Microsoft?",
+    actual_output="Bill Gates and Paul Allen.",
+    retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
+)
+```
+
+In this example, `input` represents a user talking with a RAG-based LLM application, where `actual_output` is the output of your chatbot and `retrieval_context` is the retrieved context. Creating an Example allows you to evaluate using `judgeval`'s default scorers:
+
+```
+from judgeval.judgment_client import JudgmentClient
+from judgeval.scorers import JudgmentScorer
+from judgeval.constants import APIScorer
+
+client = JudgmentClient()
+
+faithfulness_scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS)
+
+results = client.run_evaluation(
+    examples=[example],
+    scorers=[faithfulness_scorer, summarization_scorer],
+    model="gpt-4o",
+)
+print(results)
+```
+
+## Datasets
+
+An Evaluation Dataset is a collection of Examples. It provides an interface for running scaled evaluations of your LLM system using one or more scorers.
+
+```
+from judgeval.data import Example, EvalDataset
+
+example1 = Example("input"="...", "actual_output"="...", "retrieval_context"="...")
+example2 = Example("input"="...", "actual_output"="...", "retrieval_context"="...")
+
+dataset = EvalDataset(examples=[example1, example2])
+```
+
+`EvalDataset`s can be saved to disk and loaded back in, or uploaded to the Judgment platform.
+For more information on how to use `EvalDataset`s, please see the [EvalDataset docs](./data_datasets.mdx).
+
+Then, you can run evaluations on the dataset:
+
+```
+...
+
+client = JudgmentClient()
+scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS)
+results = client.evaluate_dataset(
+    dataset=dataset,
+    scorers=[scorer],
+    model="QWEN",
+)
+```
+
+Congratulations! You've learned the basics of building and running evaluations with `judgeval`. 
+For a deep dive into all the metrics you can run using `judgeval` scorers, click here. TODO  add link