diff --git a/.gitignore b/.gitignore index ba2f2e83..8436975e 100644 --- a/.gitignore +++ b/.gitignore @@ -17,6 +17,8 @@ wheels/ .installed.cfg *.egg-info/ +# APIs +google-cloud-sdk/ # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. diff --git a/docs b/docs new file mode 160000 index 00000000..00ab9e46 --- /dev/null +++ b/docs @@ -0,0 +1 @@ +Subproject commit 00ab9e46926eff3a327b4646a0bd71a1f2ed0650 diff --git a/docs/create_dataset.ipynb b/judgeval_docs/create_dataset.ipynb similarity index 100% rename from docs/create_dataset.ipynb rename to judgeval_docs/create_dataset.ipynb diff --git a/docs/create_scorer.ipynb b/judgeval_docs/create_scorer.ipynb similarity index 100% rename from docs/create_scorer.ipynb rename to judgeval_docs/create_scorer.ipynb diff --git a/docs/demo.ipynb b/judgeval_docs/demo.ipynb similarity index 100% rename from docs/demo.ipynb rename to judgeval_docs/demo.ipynb diff --git a/docs/demo_files/debug.txt b/judgeval_docs/demo_files/debug.txt similarity index 100% rename from docs/demo_files/debug.txt rename to judgeval_docs/demo_files/debug.txt diff --git a/docs/demo_files/tanaka.txt b/judgeval_docs/demo_files/tanaka.txt similarity index 100% rename from docs/demo_files/tanaka.txt rename to judgeval_docs/demo_files/tanaka.txt diff --git a/docs/demo_files/tanaka_beneficiary.txt b/judgeval_docs/demo_files/tanaka_beneficiary.txt similarity index 100% rename from docs/demo_files/tanaka_beneficiary.txt rename to judgeval_docs/demo_files/tanaka_beneficiary.txt diff --git a/docs/demo_files/tanaka_recommender.txt b/judgeval_docs/demo_files/tanaka_recommender.txt similarity index 100% rename from docs/demo_files/tanaka_recommender.txt rename to judgeval_docs/demo_files/tanaka_recommender.txt diff --git a/judgeval_docs/dev_docs/evaluation/data_datasets.mdx b/judgeval_docs/dev_docs/evaluation/data_datasets.mdx new file mode 100644 index 00000000..79554ed4 --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/data_datasets.mdx @@ -0,0 +1,127 @@ +# Datasets + +## Quick Summary +In most scenarios, you will have multiple `Example`s that you want to evaluate together. +In `judgeval`, an evaluation dataset (`EvalDataset`) is a collection of `Example`s and/or `GroundTruthExample`s that you can scale evaluations across. + +`Tip`: + +A `GroundTruthExample` is a specific type of `Example` that do not require the `actual_output` field. This is useful for creating datasets that can be dynamically updated at evaluation time by running your workflow on the `GroundTruthExample`s to create `Example`s. + +## Creating a Dataset + +Creating an `EvalDataset` is as simple as supplying a list of `Example`s and/or `GroundTruthExample`s. + +``` +from judgeval import ( + EvalDataset, + Example, + GroundTruthExample +) + +examples = [Example(input="...", actual_output="..."), Example(input="...", actual_output="..."), ...] +ground_truth_examples = [GroundTruthExample(input="..."), GroundTruthExample(input="..."), ...] + +dataset = EvalDataset(examples=examples, ground_truth_examples=ground_truth_examples) +``` + +You can also add `Example`s and `GroundTruthExample`s to an existing `EvalDataset` using the `add_example` and `add_ground_truth_example` methods. + +``` +... + +dataset.add_example(Example(...)) +dataset.add_ground_truth(GroundTruthExample(...)) +``` + +## Saving/Loading Datasets + +`judgeval` supports saving and loading datasets in the following formats: +- JSON +- CSV + +### From Judgment +You easily can save/load an `EvalDataset` from Judgment's cloud. + +``` +# Saving +... +from judgeval import JudgmentClient + +client = JudgmentClient() +client.push_dataset(alias="my_dataset", dataset=dataset) +``` + +``` +# Loading +from judgeval import JudgmentClient + +client = JudgmentClient() +dataset = client.pull_dataset(alias="my_dataset") +``` + +### From JSON + +You can save/load an `EvalDataset` with a JSON file. Your JSON file should have the following structure: +``` +{ + "examples": [{"input": "...", "actual_output": "..."}, ...], + "ground_truths": [{"input": "..."}, ...] +} +``` + +Here's an example of how use `judgeval` to save/load from JSON. + +``` +from judgeval import EvalDataset + +# saving +dataset = EvalDataset(...) # filled with examples +dataset.save_as("json", "/path/to/save/dir", "save_name") + +# loading +new_dataset = EvalDataset() +new_dataset.add_from_json("/path/to/your/json/file.json") + +``` + +### From CSV + +You can save/load an `EvalDataset` with a `.csv` file. Your CSV should contain rows that can be mapped to `Example`s via column names. +TODO: this section needs to be updated because the CSV format is not yet finalized. + + +Here's an example of how use `judgeval` to save/load from CSV. + +``` +from judgeval import EvalDataset + +# saving +dataset = EvalDataset(...) # filled with examples +dataset.save_as("csv", "/path/to/save/dir", "save_name") + +# loading +new_dataset = EvalDataset() +new_dataset.add_from_csv("/path/to/your/csv/file.csv") +``` + +## Evaluate On Your Dataset + +You can use the `JudgmentClient` to evaluate the `Example`s and `GroundTruthExample`s in your dataset using scorers. + +``` +... + +dataset = client.pull_dataset(alias="my_dataset") +res = client.evaluate_dataset( + dataset=dataset, + scorers=[JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS)], + model="gpt-4o", +) +``` + +## Conclusion + +Congratulations! You've now learned how to create, save, and evaluate an `EvalDataset` in `judgeval`. + +You can also view and manage your datasets via Judgment's platform. Check out TODO: add link here diff --git a/judgeval_docs/dev_docs/evaluation/data_examples.mdx b/judgeval_docs/dev_docs/evaluation/data_examples.mdx new file mode 100644 index 00000000..d6b16a04 --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/data_examples.mdx @@ -0,0 +1,166 @@ +# Examples + +## Quick Summary +An Example is a basic unit of data in `judgeval` that allows you to use evaluation scorers on your LLM system. An `Example` is composed of seven fields: +- `input` +- `actual_output` +- [Optional] `expected_output` +- [Optional] `retrieval_context` +- [Optional] `context` +- [Optional] `tools_called` +- [Optional] `expected_tools` + +Here's a sample of creating an `Example` + +``` +from judgeval.data import Example + +example = Example( + input="Who founded Microsoft?", + actual_output="Bill Gates and Paul Allen.", + expected_output="Bill Gates and Paul Allen founded Microsoft in New Mexico in 1975.", + retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."], + context=["Bill Gates and Paul Allen are the founders of Microsoft."], + tools_called=["Google Search"], + expected_tools=["Google Search", "Perplexity"], +) +``` + +INFO: +The `input` and `actual_output` fields are required for all examples. However, you don't always need to use them in your evaluations. For example, if you're evaluating whether a chatbot's response is friendly, you don't need to use `input`. + +The other fields are optional and may be useful depending on the kind of evaluation you're running. For example, if you want to check for hallucinations in a RAG system, you'd be interested in the `retrieval_context` field for the Faithfulness scorer. + +## Example Fields + +### Input +The `input` field represents a sample interaction between a user and your LLM system. The input should represent the direct input to your prompt template(s), and SHOULD NOT CONTAIN your prompt template itself. + +`Tip`: + +You should treat prompt templates as hyperparameters that you optimize for based on the scorer you're executing. Evaluation is inherently tied with optimization, so you should try to isolate your system's independent variables (e.g. prompt template, model choice, RAG retriever) from your evaluation. + +### Actual Output + +The `actual_output` field represents what the LLM system outputs based on the `input`. This is often the actual output of your LLM system created either at evaluation time or with saved answers. + +``` +# Sample app implementation +import medical_chatbot + +question = "Is sparkling water healthy?" +example = Example( + input=question, + actual_output=medical_chatbot.chat(question) +) +``` + +### Expected Output + +The `expected_output` field is `Optional[str]` and represents the ideal output of your LLM system. One of the nice parts of `judgeval`'s scorers is that they use LLMs which have flexible evaluation criteria. You don't need to worry about your `expected_output` perfectly matching your `actual_output`. + +To learn more about how `judgeval`'s scorers work, please see the [scorers docs](./scorers/introduction.mdx). + +``` +# Sample app implementation +import medical_chatbot + +question = "Is sparkling water healthy?" +example = Example( + input=question, + actual_output=medical_chatbot.chat(question), + expected_output="Sparkling water is neither healthy nor unhealthy." +) +``` + +### Context + +The `context` field is `Optional[List[str]]` and represents information that is supplied to the LLM system as ground truth. For instance, context could be a list of facts that the LLM system is aware of. However, `context` should not be confused with `retrieval_context`. + +`Tip`: + +In RAG systems, contextual information is retrieved from a vector database and is represented in `judgeval` by `retrieval_context`, not `context`. **If you're building a RAG system, you'll want to use `retrieval_context`.** + +``` +# Sample app implementation +import medical_chatbot + +question = "Is sparkling water healthy?" +example = Example( + input=question, + actual_output=medical_chatbot.chat(question), + expected_output="Sparkling water is neither healthy nor unhealthy.", + context=["Sparkling water is a type of water that is carbonated."] +) +``` + +### Retrieval Context + +The `retrieval_context` field is `Optional[List[str]]` and represents the context that is retrieved from a vector database. This is often the context that is used to generate the `actual_output` in a RAG system. + +Some common cases for using `retrieval_context` are: +- Checking for hallucinations in a RAG system +- Evaluating the quality of a retriever model (comparing retrieved info to `context`) + +``` +# Sample app implementation +import medical_chatbot + +question = "Is sparkling water healthy?" +example = Example( + input=question, + actual_output=medical_chatbot.chat(question), + expected_output="Sparkling water is neither healthy nor unhealthy.", + context=["Sparkling water is a type of water that is carbonated."], + retrieval_context=["Sparkling water is carbonated and has no calories."] +) +``` + +`Tip`: + +`context` is the ideal retrieval result for a specific `input`, whereas `retrieval_context` is the actual retrieval result at runtime. While they are similar, they are not always interchangeable. + +### Tools Called + +The `tools_called` field is `Optional[List[str]]` and represents the tools that were called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them. + +``` +# Sample app implementation +import medical_chatbot + +question = "Is sparkling water healthy?" +example = Example( + input=question, + actual_output=medical_chatbot.chat(question), + expected_output="Sparkling water is neither healthy nor unhealthy.", + context=["Sparkling water is a type of water that is carbonated."], + retrieval_context=["Sparkling water is carbonated and has no calories."], + tools_called=["Perplexity", "GoogleSearch"] +) +``` + +### Expected Tools + +The `expected_tools` field is `Optional[List[str]]` and represents the tools that are expected to be called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them. + +``` +# Sample app implementation +import medical_chatbot + +question = "Is sparkling water healthy?" +example = Example( + input=question, + actual_output=medical_chatbot.chat(question), + expected_output="Sparkling water is neither healthy nor unhealthy.", + context=["Sparkling water is a type of water that is carbonated."], + retrieval_context=["Sparkling water is carbonated and has no calories."], + tools_called=["Perplexity", "GoogleSearch"], + expected_tools=["Perplexity", "DBQuery"] +) +``` + +## Conclusion + +Congratulations! You've learned how to create an `Example` and can begin using them to execute evaluations or create datasets. + +TODO: add links here ^^ diff --git a/judgeval_docs/dev_docs/evaluation/introduction.mdx b/judgeval_docs/dev_docs/evaluation/introduction.mdx new file mode 100644 index 00000000..788203f2 --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/introduction.mdx @@ -0,0 +1,91 @@ +# Introduction + +## Quick Summary + +Evaluation is the process of scoring an LLM system's outputs with metrics; an evaluation is composed of: +- An evaluation dataset +- Metrics we are interested in tracking + +The ideal fit of evaluation into an application workflow looks like this: + +![Alt text](judgeval/docs/dev_docs/imgs/evaluation_diagram.png "Optional title") + +## Metrics + +`judgeval` comes with a set of 10+ built-in evaluation metrics. These metrics are accessible through `judgeval`'s `Scorer` interface. + +``` +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +scorer = JudgmentScorer(score_type=APIScorer.FAITHFULNESS) +``` +You can use scorers to evaluate your LLM system's outputs by using `Example`s. + +!! We're always working on adding new `Scorer`s, so if you have a metric you'd like to add, please let us know! + +## Examples + +In `judgeval`, an Example is a unit of data that allows you to use evaluation scorers on your LLM system. + +``` +from judgeval.data import Example + +example = Example( + input="Who founded Microsoft?", + actual_output="Bill Gates and Paul Allen.", + retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."], +) +``` + +In this example, `input` represents a user talking with a RAG-based LLM application, where `actual_output` is the output of your chatbot and `retrieval_context` is the retrieved context. Creating an Example allows you to evaluate using `judgeval`'s default scorers: + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() + +faithfulness_scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS) + +results = client.run_evaluation( + examples=[example], + scorers=[faithfulness_scorer, summarization_scorer], + model="gpt-4o", +) +print(results) +``` + +## Datasets + +An Evaluation Dataset is a collection of Examples. It provides an interface for running scaled evaluations of your LLM system using one or more scorers. + +``` +from judgeval.data import Example, EvalDataset + +example1 = Example("input"="...", "actual_output"="...", "retrieval_context"="...") +example2 = Example("input"="...", "actual_output"="...", "retrieval_context"="...") + +dataset = EvalDataset(examples=[example1, example2]) +``` + +`EvalDataset`s can be saved to disk and loaded back in, or uploaded to the Judgment platform. +For more information on how to use `EvalDataset`s, please see the [EvalDataset docs](./data_datasets.mdx). + +Then, you can run evaluations on the dataset: + +``` +... + +client = JudgmentClient() +scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS) +results = client.evaluate_dataset( + dataset=dataset, + scorers=[scorer], + model="QWEN", +) +``` + +Congratulations! You've learned the basics of building and running evaluations with `judgeval`. +For a deep dive into all the metrics you can run using `judgeval` scorers, click here. TODO add link diff --git a/judgeval_docs/dev_docs/evaluation/scorers/answer_relevancy.mdx b/judgeval_docs/dev_docs/evaluation/scorers/answer_relevancy.mdx new file mode 100644 index 00000000..33b82aa7 --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/answer_relevancy.mdx @@ -0,0 +1,59 @@ +# Answer Relevancy + +The answer relevancy scorer is a default LLM judge scorer that measures how relevant the LLM system's `actual_output` is to the `input`. +In practice, this scorer helps determine whether your RAG pipeline's generator produces relevant answers to the user's query. + +`Tip`: + +There are many factors to consider when evaluating the quality of your RAG pipeline. `judgeval` offers a suite of default scorers to construct a comprehensive +evaluation of each RAG component. Check out our guide on RAG system evaluation for a deeper dive! TODO add link here. + + +## Required Fields + +To run the answer relevancy scorer, you must include the following fields in your `Example`: +- `input` +- `actual_output` + +## Scorer Breakdown + +`AnswerRelevancy` scores are calculated by extracting statements made in the `actual_output` and classifying how many are relevant to the `input`. + +The score is calculated as: + +$$ +\text{relevancy\_score} = \frac{\text{relevant\_statements}}{\text{total\_statements}} +$$ + +TODO add latex rendering + +## Sample Implementation + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() +example = Example( + input="What's your return policy for a pair of socks?", + # Replace this with your LLM system's output + actual_output="We offer a 30-day return policy for all items, including socks!", +) +# supply your own threshold +scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.ANSWER_RELEVANCY) + +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` + +`Note:` + +The `AnswerRelevancy` scorer uses an LLM judge, so you'll receive a reason for the score in the `reason` field of the results. +This allows you to double-check the accuracy of the evaluation and understand how the score was calculated. + diff --git a/judgeval_docs/dev_docs/evaluation/scorers/classifier_scorer.mdx b/judgeval_docs/dev_docs/evaluation/scorers/classifier_scorer.mdx new file mode 100644 index 00000000..e69de29b diff --git a/judgeval_docs/dev_docs/evaluation/scorers/contextual_precision.mdx b/judgeval_docs/dev_docs/evaluation/scorers/contextual_precision.mdx new file mode 100644 index 00000000..13aa502e --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/contextual_precision.mdx @@ -0,0 +1,68 @@ +# Contextual Precision + +The contextual precision scorer is a default LLM judge scorer that measures whether contexts in your `retrieval_context` are properly ranked by importance relative to the `input`. +In practice, this scorer helps determine whether your RAG pipeline's retriever is effectively ordering the retrieved contexts. + +`Tip`: + +There are many factors to consider when evaluating the quality of your RAG pipeline. `judgeval` offers a suite of default scorers to construct a comprehensive +evaluation of each RAG component. Check out our guide on RAG system evaluation for a deeper dive! TODO add link here. + +## Required Fields + +To run the contextual precision scorer, you must include the following fields in your `Example`: +- `input` +- `actual_output` +- `expected_output` +- `retrieval_context` + +## Scorer Breakdown + +`ContextualPrecision` scores are calculated by first determining which contexts in `retrieval_context` are relevant to the `input` based on the information in `expected_output`. +Then, we compute the weighted cumulative precision (WCP) of the retrieved contexts. We use WCP because it: +- Emphasizes on top results: WCP places a strong emphasis on the relevance of top-ranked results. This emphasis is important because LLMs tend to give more attention to earlier nodes in the `retrieval_context`. +Therefore, improper rankings can induce hallucinations in the `actual_output`. +- Rewards Effective Rankings: WCP captures the comparative relevance of different contexts (highly relevant vs. somewhat relevant). This is preferable to other approaches such as standard precision, which weights all retrieved contexts as equally relevant. + +The score is calculated as: + +$$ +\text{Contextual Precision} = \frac{1}{\text{Number of Relevant Nodes}} \sum_{k=1}^n \left( \frac{\text{Number of Relevant Nodes Up to Position } k}{k} \times r_k \right) +$$ + +TODO: add source paper for this approach + +## Sample Implementation + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() + +example = Example( + input="What's your return policy for a pair of socks?", + # Replace this with your LLM system's output + actual_output="We offer a 30-day return policy for all items, including socks!", + # Replace this with the ideal output from your RAG generator model + expected_output="All customers are eligible for a 30-day return policy, no questions asked.", + # Replace this with the contexts retrieved by your RAG retriever + retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."] +) +# supply your own threshold +scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.CONTEXTUAL_PRECISION) + +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` + +`Note:` + +The `ContextualPrecision` scorer uses an LLM judge, so you'll receive a reason for the score in the `reason` field of the results. +This allows you to double-check the accuracy of the evaluation and understand how the score was calculated. diff --git a/judgeval_docs/dev_docs/evaluation/scorers/contextual_recall.mdx b/judgeval_docs/dev_docs/evaluation/scorers/contextual_recall.mdx new file mode 100644 index 00000000..e5528b4b --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/contextual_recall.mdx @@ -0,0 +1,64 @@ +# Contextual Recall + +The contextual recall scorer is a default LLM judge scorer that measures whether the `retrieval_context` aligns with the `expected_output`. +In practice, this scorer helps determine whether your RAG pipeline's retriever is effectively retrieving relevant contexts. + +`Tip`: + +There are many factors to consider when evaluating the quality of your RAG pipeline. `judgeval` offers a suite of default scorers to construct a comprehensive +evaluation of each RAG component. Check out our guide on RAG system evaluation for a deeper dive! TODO add link here. + +## Required Fields + +To run the contextual recall scorer, you must include the following fields in your `Example`: +- `input` +- `actual_output` +- `expected_output` +- `retrieval_context` + +## Scorer Breakdown + +`ContextualRecall` scores are calculated by first determining all statements made in `expected_output`, then classifying which statements are backed up by the `retrieval_context`. +Note that this scorer uses the `expected_output` rather than `actual_output` because we're interested in whether the retriever is performing well. + +The score is calculated as: + +$$ +\text{Contextual Recall} = \frac{\text{Number of Relevant Statements in Retrieval Context}}{\text{Number of Relevant Statements in Expected Output}} +$$ + +TODO: add source paper for this approach + +## Sample Implementation + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() +example = Example( + input="What's your return policy for a pair of socks?", + # Replace this with your LLM system's output + actual_output="We offer a 30-day return policy for all items, including socks!", + # Replace this with the ideal output from your RAG generator model + expected_output="All customers are eligible for a 30-day return policy, no questions asked.", + # Replace this with the contexts retrieved by your RAG retriever + retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."] +) +# supply your own threshold +scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.CONTEXTUAL_RECALL) + +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` + +`Note:` + +The `ContextualRecall` scorer uses an LLM judge, so you'll receive a reason for the score in the `reason` field of the results. +This allows you to double-check the accuracy of the evaluation and understand how the score was calculated. diff --git a/judgeval_docs/dev_docs/evaluation/scorers/contextual_relevancy.mdx b/judgeval_docs/dev_docs/evaluation/scorers/contextual_relevancy.mdx new file mode 100644 index 00000000..8e4fda3e --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/contextual_relevancy.mdx @@ -0,0 +1,55 @@ +# Contextual Relevancy + +The contextual relevancy scorer is a default LLM judge scorer that measures how relevant the contexts in `retrieval_context` are for an `input`. +In practice, this scorer helps determine whether your RAG pipeline's retriever effectively retrieves relevant contexts for a query. + +## Required Fields + +To run the contextual relevancy scorer, you must include the following fields in your `Example`: +- `input` +- `actual_output` +- `retrieval_context` + +## Scorer Breakdown + +`ContextualRelevancy` scores are calculated by first extracting all statements in `retrieval_context` and then classifying which ones are relevant to the `input`. + +The score is then calculated as: + +$$ +\text{Contextual Relevancy} = \frac{\text{Number of Relevant Statements}}{\text{Total Number of Statements}} +$$ + +TODO: add source paper for this approach + +## Sample Implementation + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() +example = Example( + input="What's your return policy for a pair of socks?", + # Replace this with your LLM system's output + actual_output="We offer a 30-day return policy for all items, including socks!", + # Replace this with the contexts retrieved by your RAG retriever + retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."] +) +# supply your own threshold +scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.CONTEXTUAL_RELEVANCY) + +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` + +`Note:` + +The `ContextualRelevancy` scorer uses an LLM judge, so you'll receive a reason for the score in the `reason` field of the results. +This allows you to double-check the accuracy of the evaluation and understand how the score was calculated. diff --git a/judgeval_docs/dev_docs/evaluation/scorers/custom_scorers.mdx b/judgeval_docs/dev_docs/evaluation/scorers/custom_scorers.mdx new file mode 100644 index 00000000..1d01d069 --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/custom_scorers.mdx @@ -0,0 +1,149 @@ +# Custom Scorers + +If none of `judgeval`'s built-in scorers fit your evaluation criteria, you can easily build your own custom metric to be run through a `CustomScorer`. +`CustomScorer`s are automatically integrated within `judgeval`'s infrastructure, so you can: +- Run your own scorer with the same syntax as any other `judgeval` scorer. +- Use `judgeval`'s batched evaluation infrastructure to execute **scalable evaluation runs**. +- Have your scorer's results be viewed and analyzed in the Judgment platform. + +`Tip:` + +Be creative with your custom scorers! You can measure anything you want using a `CustomScorer`, +including using evaluations that aren't LLM judge-based such as ROUGE or embedding similarity. + + +## Guidelines for Implementing Custom Scorers + +To implement your own custom scorer, you must: + +1. Inherit from the `CustomScorer` class + +This will help `judgeval` integrate your scorer into evaluation runs. + +``` +from judgeval.scorers import CustomScorer + +class SampleScorer(CustomScorer): + ... +``` + +2. Implement the `__init__()` method + +`CustomScorer`s have some required attributes that must be determined in the `__init__()` method. +For instance, you must set a `threshold` to determine what constitutes success/failure for a scorer. + +There are additional optional attributes that can be set here for even more flexibility: +- `score_type (str)`: The name of your scorer. This will be displayed in the Judgment platform. +- `include_reason (bool)`: Whether your scorer includes a reason for the score in the results. Only for LLM judge-based scorers. +- `async_mode (bool)`: Whether your scorer should be run asynchronously during evaluations. +- `strict_mode (bool)`: Whether your scorer fails if the score is not perfect (1.0). +- `verbose_mode (bool)`: Whether your scorer produces verbose logs. + +``` +class SampleScorer(CustomScorer): + def __init__( + self, + threshold=0.5, + score_type="Sample Scorer", + include_reason=True, + async_mode=True, + strict_mode=False, + verbose_mode=True + ): + super().__init__(score_type=score_type, threshold=threshold) + self.threshold = 1 if strict_mode else threshold + # Optional attributes + self.include_reason = include_reason + self.async_mode = async_mode + self.strict_mode = strict_mode + self.verbose_mode = verbose_mode +``` + +3. Implement the `score_example()` and `a_score_example()` methods + +The `score_example()` and `a_score_example()` methods take an `Example` object and execute your scorer to produce a score (float) between 0 and 1. +Optionally, you can include a reason to accompany the score if applicable (e.g. for LLM judge-based scorers). + +The only requirement for `score_example()` and `a_score_example()` is that they: +- Take an `Example` as an argument (you can add other arguments too) +- Set the self.score attribute +- Set the self.success attribute + +You can optionally set the self.reason attribute, depending on your preference. + +`Note:` + +`a_score_example()` is simply the async version of `score_example()`, so the implementation should largely be identical. + +These methods are the core of your scorer, and you can implement them in any way you want. **Be creative!** +Check out this list of examples our users have implemented if you need inspiration: TODO add link here + +`Tip:` + +If you want to handle errors gracefully, you can use a `try` block and in the `except` block, set the `self.error` attribute to the error message. +This will allow `judgeval` to catch the error but still execute the rest of an evaluation run, assuming you have multiple examples to evaluate. + +Here's a sample implementation that integrates everything we've covered: + +``` +class SampleScorer(CustomScorer): + ... + + def score_example(self, example, ...): + try: + self.score = run_scorer_logic(example) + if self.include_reason: + self.reason = justify_score(example, self.score) + if self.verbose_mode: + self.verbose_logs = make_logs(example, self.reason, self.score) + self.success = self.score >= self.threshold + except Exception as e: + self.error = str(e) + self.success = False + + async def a_score_example(self, example, ...): + try: + self.score = await a_run_scorer_logic(example) # async version + if self.include_reason: + self.reason = justify_score(example, self.score) + if self.verbose_mode: + self.verbose_logs = make_logs(example, self.reason, self.score) + self.success = self.score >= self.threshold + except Exception as e: + self.error = str(e) + self.success = False +``` + +4. Implement the `success_check()` method + +When executing an evaluation run, `judgeval` will check if your scorer has passed the `success_check()` method. + +You can implement this method in any way you want, but it should return a `bool`. Here's a perfectly valid implementation: + +``` +class SampleScorer(CustomScorer): + ... + + def success_check(self, example): + if self.error is not None: + return False + return self.score >= self.threshold # or you can do self.success if set +``` + +5. Give your scorer a name + +This is so that when displaying your scorer's results in the Judgment platform, you can easily sort by and find your scorer. + +``` +class SampleScorer(CustomScorer): + ... + + @property + def __name__(self): + return "Sample Scorer" +``` + +That's it! Congratulations, you've made your first custom scorer! Now that your scorer is implemented, you can run it on your own datasets +just like any other `judgeval` scorer. Your scorer is fully integrated with `judgeval`'s infrastructure so you can view it on the Judgment platform too. + +For more examples, check out some of the custom scorers our users have implemented: TODO add link here. diff --git a/judgeval_docs/dev_docs/evaluation/scorers/faithfulness.mdx b/judgeval_docs/dev_docs/evaluation/scorers/faithfulness.mdx new file mode 100644 index 00000000..e52c28fe --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/faithfulness.mdx @@ -0,0 +1,62 @@ +# Faithfulness + +The `Faithfulness` scorer is a default LLM judge scorer that measures how factually aligned the `actual_output` is to the `retrieval_context`. +In practice, this scorer helps determine the degree to which your RAG pipeline's generator is hallucinating. + +TODO plug our evaluation foundation models here + +`Note:` + +The `Faithfulness` scorer is similar to but not identical to the `Hallucination` scorer. +`Faithfulness` is concerned with contradictions between the `actual_output` and `retrieval_context`, while `Hallucination` is concerned with `context`. +If you're building an app with a RAG pipeline, you should try the `Faithfulness` scorer first. + +## Required Fields + +To run the `Faithfulness` scorer, you must include the following fields in your `Example`: +- `input` +- `actual_output` +- `retrieval_context` + +## Scorer Breakdown + +`Faithfulness` scores are calculated by first extracting all statements in `actual_output` and then classifying which ones are contradicted by the `retrieval_context`. +A claim is considered faithful if it does not contradict any information in `retrieval_context`. + +The score is calculated as: + +$$ +\text{Faithfulness} = \frac{\text{Number of Faithful Statements}}{\text{Total Number of Statements}} +$$ + +## Sample Implementation + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() +example = Example( + input="What's your return policy for a pair of socks?", + # Replace this with your LLM system's output + actual_output="We offer a 30-day return policy for all items, including socks!", + # Replace this with the contexts retrieved by your RAG retriever + retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."] +) +# supply your own threshold +scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.FAITHFULNESS) + +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` + +`Note:` + +The `Faithfulness` scorer uses an LLM judge, so you'll receive a reason for the score in the `reason` field of the results. +This allows you to double-check the accuracy of the evaluation and understand how the score was calculated. diff --git a/judgeval_docs/dev_docs/evaluation/scorers/hallucination.mdx b/judgeval_docs/dev_docs/evaluation/scorers/hallucination.mdx new file mode 100644 index 00000000..cf4c451e --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/hallucination.mdx @@ -0,0 +1,51 @@ +# Hallucination + +The `Hallucination` scorer is a default LLM judge scorer that measures how much the `actual_output` contains information that contradicts the `context`. + +`Note:` + +If you're building an app with a RAG pipeline, you should try the `Faithfulness` scorer instead. +The `Hallucination` scorer is concerned with `context`, the ideal retrieved context, while `Faithfulness` is concerned with `retrieval_context`, the actual retrieved context. + +## Required Fields + +To run the `Hallucination` scorer, you must include the following fields in your `Example`: +- `input` +- `actual_output` +- `context` + +## Scorer Breakdown + +`Hallucination` scores are calculated by determining for each document in `context`, whether there are any contradictions to `actual_output`. +The score is then calculated as: + +$$ +\text{Hallucination} = \frac{\text{Number of Contradicted Documents}}{\text{Total Number of Documents}} +$$ + +## Sample Implementation + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() +example = Example( + input="What's your return policy for a pair of socks?", + # Replace this with your LLM system's output + actual_output="We offer a 30-day return policy for all items, including socks!", + # Replace this with the contexts passed to your LLM as ground truth + context=["**RETURN POLICY** all products returnable with no cost for 30-days after purchase (receipt required)."] +) +# supply your own threshold +scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.HALLUCINATION) + +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` \ No newline at end of file diff --git a/judgeval_docs/dev_docs/evaluation/scorers/introduction.mdx b/judgeval_docs/dev_docs/evaluation/scorers/introduction.mdx new file mode 100644 index 00000000..da1ef8e8 --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/introduction.mdx @@ -0,0 +1,90 @@ +# Introduction + +## Quick Summary + +`Scorer`s act as measurement tools for evaluating LLM systems based on specific criteria. `judgeval` comes with a set of 10+ built-in scorers that you can easily start with, including: +- Answer Relevancy +- Contextual Precision +- Contextual Recall +- Contextual Relevancy +- Faithfulness +- Hallucination +- Summarization +- Tool Correctness +- JSON Correctness +- Custom Scorers +- Classifier Scorers + +`Tip`: + +We're always building new scorers to add to `judgeval`. If you have a specific scorer in mind, please let us know at `contact@judment.ai`! + +`Scorer`s execute on `Example`s, `GroundTruthExample`s, and `EvalDataset`s and produce a score between 0 and 1. +As such, you can set a `threshold` to determine whether an evaluation was successful or not. +A default scorer will only succeed if the score is greater than or equal to the `threshold`. + +## Categories of Scorers +`judgeval` supports three categories of scorers. +- Default Scorers: built-in scorers that are ready to use +- Custom Scorers: Versatile and powerful scorers that you can tailor to your own evaluation needs +- Classifier Scorers: A special custom scorer that can evaluate your LLM system using a natural language criteria + +In this section, we'll cover each kind of scorer and how to use them. + +### Default Scorers +Most of the ready-to-use scorers in `judgeval` are LLM judges, meaning they use LLMs to evaluate your LLM system. This is intentional since LLM evaluations are flexible, scalable, and strongly correlate with human evaluation. + +`judgeval`'s default scorers have been crafted by our research team based on leading work in the LLM evaluation community. +You can access our scorer implementations via `judgeval` which run the scorers on Judgment's infrastructure. +Our implementations are described on their respective documentation pages. Judgment implementations of default scorers are backed by leading industry/academic research and are preferable to other implementations because: +- They are meticulously prompt-engineered to maximize evaluation quality and consistency +- Provide a chain of thought for evaluation scores, so you can double-check the evaluation quality +- Can be run using any LLM, including Judgment's **state-of-the-art LLM judges** developed in collaboration with **Stanford's AI Lab**. + +### Custom Scorers + +If you find that none of the default scorers meet your evaluation needs, setting up a custom scorer is easy with `judgeval`. +You can create a custom scorer by inheritng from the `CustomScorer` class and implementing three methods: +- `score_example()`: produces a score for a single `Example`. +- `a_score_example()`: async version of `score_example()`. You may use the same implementation logic as `score_example()`. +- `success_check()`: determines whether an evaluation was successful. + +Custom scorers can be as simple or complex as you want, and do not need to use LLMs. For sample implementations, check out the `CustomScorer` documentation page. TODO add link here + + +### Classifier Scorers + +Classifier scorers are a special type of custom scorer that can evaluate your LLM system using a natural language criteria. + +TODO update this section when SDK is updated + +## Running Scorers + +All scorers in `judgeval` can be run uniformly through the `JudgmentClient`. All scorers are set to run in async mode by default in order to support parallelized evaluations for large datasets. + +``` +... + +client = JudgmentClient() +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o-mini", +) +``` + +If you want to execute a `CustomScorer` without running it through the `JudgmentClient`, you can score locally. +Simply use the `score_example()` or `a_score_example()` method directly: + +``` +... + +example = Example(input="...", actual_output="...") + +scorer = CustomScorer() # Your scorer here +score = scorer.score_example(example) +``` + +`Tip`: + +To learn about how a certain default scorer works, check out its documentation page for a deep dive into how scores are calculated and what fields are required. diff --git a/judgeval_docs/dev_docs/evaluation/scorers/json_correctness.mdx b/judgeval_docs/dev_docs/evaluation/scorers/json_correctness.mdx new file mode 100644 index 00000000..62da1776 --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/json_correctness.mdx @@ -0,0 +1,52 @@ +# JSON Correctness + +The `JSONCorrectness` scorer is a default scorer that checks whether your LLM's `actual_output` matches your JSON schema. + +## Required Fields + +To run the `JSONCorrectness` scorer, you must include the following fields in your `Example`: +- `input` +- `actual_output` + +## Scorer Breakdown + +`JSONCorrectness` scores are calculated with a binary score representing whether the `actual_output` matches the JSON schema. + +To define a JSON schema, you can define a `pydantic` `BaseModel` and pass it to the `JSONCorrectness` scorer. + +``` +from pydantic import BaseModel + +class SampleSchema(BaseModel): + field1: str + field2: int +``` + +$$ +\text{JSONCorrectness} = \begin{cases} +1 & \text{if } \text{actual\_output} \text{ matches } \text{schema} \\ +0 & \text{otherwise} +\end{cases} +$$ + +## Sample Implementation + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer +client = JudgmentClient() +example = Example( + input="Create a JSON object with the keys 'field1' (str) and 'field2' (int). Fill them with random values.", + # Replace this with your LLM system's output + actual_output="{'field1': 'value1', 'field2': 1}", +) +scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.JSON_CORRECTNESS) # TODO update this +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` diff --git a/judgeval_docs/dev_docs/evaluation/scorers/summarization.mdx b/judgeval_docs/dev_docs/evaluation/scorers/summarization.mdx new file mode 100644 index 00000000..faa60adf --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/summarization.mdx @@ -0,0 +1,59 @@ +# Summarization + +The `Summarization` scorer is a default LLM judge scorer that measures whether your LLM can accurately summarize text. +In this case, the `actual_output` is the summary, and the `input` is the text to summarize. + +## Required Fields + +To run the `Summarization` scorer, you must include the following fields in your `Example`: +- `input` +- `actual_output` + +## Scorer Breakdown + +`Summarization` scores are calculated by determining: +1. Whether the summary contains contradictory information from the original text. +2. Whether the summary contains all of the important information from the original text. + +To do so, we compute two subscores respectively: + +$$ +\text{contradiction\_score} = \frac{\text{Number of Contradictory Statements in Summary}}{\text{Total Number of Statements in Summary}} +$$ + +For the information score, we generate a list of important questions from the original text and check the fraction of the questions that are answered by information inthe summary. + +$$ +\text{information\_score} = \frac{\text{Number of Important Questions Answered in Summary}}{\text{Total Number of Important Questions}} +$$ + +The final score is the minimum of the two subscores. + +$$ +\text{Summarization Score} = \min(\text{contradiction\_score}, \text{information\_score}) +$$ + +## Sample Implementation + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() +example = Example( + input="...", + # Replace this with your LLM system's summary + actual_output="...", +) +# supply your own threshold +scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.SUMMARIZATION) + +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` diff --git a/judgeval_docs/dev_docs/evaluation/scorers/tool_correctness.mdx b/judgeval_docs/dev_docs/evaluation/scorers/tool_correctness.mdx new file mode 100644 index 00000000..2e1f96de --- /dev/null +++ b/judgeval_docs/dev_docs/evaluation/scorers/tool_correctness.mdx @@ -0,0 +1,48 @@ +# Tool Correctness + +The `ToolCorrectness` scorer is a default scorer that checks whether your LLM correctly calls and uses tools. +In practice, this allows you to assess the quality of an LLM agent's tool choice and applied use of tools. + +## Required Fields + +To run the `ToolCorrectness` scorer, you must include the following fields in your `Example`: +- `input` +- `actual_output` +- `tools_called` +- `expected_output` + +## Scorer Breakdown + +The tool correctness score is calculated via the fraction of the total number of tools called that are correct. + +$$ +\text{Tool Correctness} = \frac{\text{Number of Correct Tools Called}}{\text{Total Number of Tools Called}} +$$ + +TODO add more docs here regarding tool ordering, exact match, or even correct tool use aside from calling tools. + +## Sample Implementation + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() +example = Example( + input="...", + actual_output="...", + tools_called=["GoogleSearch", "Perplexity"], + expected_output=["DBQuery", "GoogleSearch"], +) +# supply your own threshold +scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.TOOL_CORRECTNESS) + +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` diff --git a/judgeval_docs/dev_docs/getting_started.mdx b/judgeval_docs/dev_docs/getting_started.mdx new file mode 100644 index 00000000..b4be59f4 --- /dev/null +++ b/judgeval_docs/dev_docs/getting_started.mdx @@ -0,0 +1,267 @@ +# Quick Introduction + +Judgeval is an evaluation framework for LLM systems. Judgeval is designed for AI teams to build on and iterate on their LLM systems (applications) and was built to +- Easily unit test LLM systems. +- Supply a quality control layer for multi-step LLM applications, especially for **agentic systems**. +- Plug-and-evaluate LLM systems with 10+ research-backed metrics including hallucination detection, RAG retrieval quality, and more. +- Construct custom evaluation pipelines for your LLM systems. +- Monitor LLM systems in production using state-of-the-art real-time evaluation foundation models. + + +Additionally, Judgeval integrates natively with Judgment Labs, allowing you to evaluate, regression test, and monitor LLM applications in the cloud. + +Judgeval was built from a team of LLM researchers from Stanford, Datadog, and Together AI. + +# Installation + +`pip install judgeval` + +Judgeval runs evaluations on your local machine. However, you may find it easier to directly run evaluations using Judgment Labs' infrastructure, an all-in-one platform for LLM system evaluation. + +# Making a Judgment Key + +Getting a Judgment key allows you to run evaluations on Judgment Labs' infrastructure, accessing our state-of-the-art judge models and platform to manage your evaluations/datasets. + +To receive a key, please email us at `contact@judgmentlabs.ai`. + + +`Tip:` + +Running evaluations on Judgment Labs' infrastructure is recommended for large-scale evaluations. Contact us if you're dealing with sensitive data that has to reside in your private VPCs. + +# Create your first evaluation + + +``` +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.judgment_client import JudgmentClient +from judgeval.constants import APIScorer + +client = JudgmentClient() + +example = Example( + input="What if these shoes don't fit?", + actual_output="We offer a 30-day full refund at no extra cost.", + retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."], +) + +scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS) +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` + +Congratulations! Your evaluation should have passed. Let's break down what happened. + +- The variable `input` mimics a user input and `actual_output` is a placeholder for what your LLM system returns based on the input. +- The variable `retrieval_context` represents the retrieved context from your knowledge base and `JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS)` is a scorer that checks if the output is hallucinated relative to the retrieved context. +- Scorers give values betweeen 0 - 1 and we set the threshold for this scorer to 0.5 in the context of a unit test. If you are interested measuring rather than testing, you can ignore this threshold and reference the `score` field alone. +- We chose `gpt-4o` as our judge model for faithfulness. Judgment Labs offers ANY judge model for your evaluation needs. + +# Create Your First Scorer +`judgeval` offers three kinds of LLM scorers for your evaluation needs: ready-made, prompt scorers, and custom scorers. + +## Ready-made Scorers +Judgment Labs provides default implementations of 10+ research-backed metrics covering evaluation needs ranging from hallucination detection to RAG retrieval quality. To create a ready-made scorer, just import it directly from `judgeval.scorers`: + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.data import Example +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() +example = Example( + input="...", + actual_output="...", + retrieval_context=["..."], +) +scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS) + +results = client.run_evaluation( + examples=[example], + scorers=[scorer], + model="gpt-4o", +) +print(results) +``` + +## Prompt Scorers +`judgeval` allows you to create custom scorers using natural language. These can range from simple judges to powerful evaluators for your LLM systems. + +``` +TODO +``` + +## Custom Scorers +If you find that none of the ready-made scorers or prompt scorers fit your needs, you can create your own custom scorer. These can be as simple or complex as you need them to be and do not have to use an LLM judge model. Here's an example of computing BLEU scores: + +``` +import sacrebleu +from judgeval.scorers import CustomScorer + +class BLEUScorer(CustomScorer): + def __init__(self, threshold: float = 0.5): + super().__init__(score_type="BLEU", threshold=threshold) + + def score_example(self, example: Example) -> float: + reference = example.expected_output + candidate = example.actual_output + + score = sacrebleu.sentence_bleu(candidate, [reference]).score + self.score = score + return score + + # Async implementation of score_example(). If you have no async logic, you can + # just use the synchronous implementation. + async def a_score_example(self, example: Example) -> float: + return self.score_example(example) + + def success_check(self) -> bool: + return self.score >= self.threshold + + @property + def __name__(self): + return "BLEU" + +# example usage +example = Example("input"="...", "actual_output"="...", "expected_output"="...") +scorer = BLEUScorer() +results = scorer.score_example(example) +print(results) +``` + +## Running Multiple Scorers Simultaneously + +If you're interested in measuring multiple metrics at once, you can group scorers together when running evaluations, regardless of the type of scorer. + +``` +from judgeval.judgment_client import JudgmentClient +from judgeval.scorers import JudgmentScorer +from judgeval.constants import APIScorer + +client = JudgmentClient() + +faithfulness_scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS) +summarization_scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.SUMMARIZATION) + +results = client.run_evaluation( + examples=[example], + scorers=[faithfulness_scorer, summarization_scorer], + model="gpt-4o", +) +``` + +# Create Your First Dataset +In most cases, you will not be running evaluations on a single example; instead, you will be scoring your LLM system on a dataset. `judgeval` allows you to create datasets, save them, and run evaluations on them. An `EvalDataset` is a collection of `Example`s and/or `GroundTruthExample`s. + +`Note: A GroundTruthExample is an Example that has no actual_output field since it will be generated at test time.` + +``` +from judgeval.data import Example, GroundTruthExample, EvalDataset + +example1 = Example("input"="...", "actual_output"="...") +example2 = Example("input"="...", "actual_output"="...") + +dataset = EvalDataset(examples=[example1, example2]) +``` + +Then, you can run evaluations on the dataset: + +``` +... + +client = JudgmentClient() +scorer = JudgmentScorer(threshold=0.5, score_type=APIScorer.FAITHFULNESS) +results = client.evaluate_dataset( + dataset=dataset, + scorers=[scorer], + model="QWEN", +) +``` + + +# Using Judgment Labs Platform + +When scaling your evaluations, Judgment's platform allows you to manage your evaluations, datasets, and scorers in a single place. +To get started, create a Judgment account by emailing us at `contact@judgmentlabs.ai`. We'll get you set up with a login and you'll be able to: +- Run evaluations directly on Judgment's platform +- Track and inspect evaluations with an intuitive UI +- Compare your evaluations across iterations of your LLM system, optimizing your models, prompts, etc. +- Manage your datasets and scorers +- Monitor your LLM systems in production + +`Note:` Click here to learn more about Judgment Labs' platform: TODO add link to `judgment/` docs. + +## Running Evaluations on Judgment + +Work in progress! + +## Managing Datasets + +Work in progress! + +## Creating ClassifierScorers + +ClassifierScorers are powerful evaluators that can be created in minutes via Judgment's platform or SDK using natural language criteria. + +`Tip`: + +For more information on what a ClassifierScorer is, click here: TODO add link to `classifier_scorers/` docs. + +1. Navigate to the `Scorers` tab in the Judgment platform. You'll find this on via the sidebar on the left. +2. Click the "Create Scorer" button in the top right corner. + +![Alt text](judgeval/docs/dev_docs/imgs/create_scorer.png "Optional title") + +3. Here, you can create a custom scorer by using a criteria in natural language, supplying custom arguments from the `Example` class. +Then, you supply a set of choices the scorer can select from when evaluating an example. Finally, you can test your scorer on samples in our playground. + +4. Once you're finished, you can save the scorer and use it in your evaluation runs just like any other scorer in `judgeval`. + +### Example + +Here's an example of building a `ClassifierScorer` that checks if the LLM's tone is too aggressive. +This might be useful when building a customer support chatbot. + +![Alt text](judgeval/docs/dev_docs/imgs/create_aggressive_scorer.png "Optional title") + +## Optimizing System Performance + +Evaluations are prerequisite for optimizing your LLM systems. Measuring the quality of your LLM workflows +allows you to compare build iterations and ultimately find the optimal set of prompts, models, RAG architectures, etc. that +make your LLM perform best. + +A typical experimental setup might look like this: + +1. Create a new `Project` in the Judgment platform by either running an evaluation from the SDK or via the platform UI. +This will help you keep track of all evaluations for different iterations of your LLM system. + +`Tip`: +A `Project` keeps track of `Evaluation Run`s in your project. Each `Evaluation Run` contains a set of `Scorer`s that have been run on a set of `Example`s. + +2. You can create separate `Evaluation Run`s for different iterations of your LLM system, allowing you to independently test each component of your LLM system. + +`Tip`: +You can try different models (e.g. `gpt-4o`, `claude-3-5-sonnet`, etc.) and prompt templates in each `Evaluation Run` to find the +optimal setup for your LLM system. + + +## Monitoring LLM Systems in Production + +Beyond experimenting and measuring historical performance, `judgeval` supports monitoring your LLM systems in production. +Using our `tracing` module, you can track your LLM system outputs from end to end, allowing you to visualize the flow of your LLM system. +Additionally, you can enable evaluations to run in real-time using Judgment's state-of-the-art judge models. + +TODO add picture of tracing, or an embedded gif + +Some of the benefits of monitoring your LLM systems in production with `judgeval` include: +- Detecting hallucinations and other quality issues before they reach your customers +- Automatically creating datasets from your real-world production cases for future improvement/optimization +- Tracking and alerting on quality metrics (e.g. latency, cost, etc.) + +For more information on monitoring, click here. TODO: add link to tracing docs diff --git a/judgeval_docs/dev_docs/judgment/introduction.mdx b/judgeval_docs/dev_docs/judgment/introduction.mdx new file mode 100644 index 00000000..e69de29b diff --git a/docs/prompt_scorer.ipynb b/judgeval_docs/prompt_scorer.ipynb similarity index 100% rename from docs/prompt_scorer.ipynb rename to judgeval_docs/prompt_scorer.ipynb diff --git a/docs/quickstart.ipynb b/judgeval_docs/quickstart.ipynb similarity index 100% rename from docs/quickstart.ipynb rename to judgeval_docs/quickstart.ipynb