-
Notifications
You must be signed in to change notification settings - Fork 229
chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds evaluation capabilities for the compass assistant using the Braintrust platform. The evaluation framework allows testing assistant responses against expected outputs with automated scoring.
- Introduces a complete evaluation framework with test cases for the MongoDB compass assistant
- Implements custom scoring functions for factuality and source link matching
- Sets up evaluation test cases covering MongoDB topics like data modeling, aggregation pipelines, and search filtering
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
packages/compass-assistant/test/assistant.eval.ts |
Main evaluation framework setup with Braintrust integration and scoring functions |
packages/compass-assistant/test/fuzzylinkmatch.ts |
Utility for fuzzy URL matching copied from chatbot project |
packages/compass-assistant/test/binaryndcgatk.ts |
Binary NDCG@K scoring implementation for evaluating source link relevance |
packages/compass-assistant/test/eval-cases/*.ts |
Test case definitions for various MongoDB topics |
packages/compass-assistant/test/eval-cases/index.ts |
Central export for all evaluation test cases |
packages/compass-assistant/package.json |
Adds dependencies for autoevals and braintrust packages |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
For those less used to working on compass:
You need a newish version of node (probably 22) and npm (11-sh). See nvm if you don't have it yet. Clone this repo, switch to this branch (chat-playground),
npm run bootstrap
which will donpm install
followed by a compile (probably not strictly needed, but should make vscode happier).You'll need a braintrust API key (here somewhere,
mongodb-ai-education organisation
), then set it with:This key is used for braintrust, but also by the braintrust proxy so that we can use other LLMs (gpt-4.1 at this point) to score these results. The proxy functionality is only used by Factuality at the moment.
Then in
packages/compass-assistant
you can run the following:Then your results should end up here as a new entry. They should stream in while it runs. With the temperature being set to zero as it is the braintrust proxy might even cache some things for us.
The only scorers are Factuality for judging the text and binaryNdcgAtK (totally stolen from the chatbot project) for judging the sources/links. See the autoevals repo for more possibilities.
To add cases, add a file in
packages/compass-assistant/test/eval-cases
(see others for inspiration) and then import/register the file in the index.ts in that folder. We'll probably add automation around this over time and I'm still trying to come up with the nicest, most ergonomic layout. Let me know how it goes!PS I haven't linked this up with CI yet - I don't think we want to fail anything if the scores drop yet anyway plus for now our experiments will probably all land in the same pool. If this runs on every PR that will probably get cluttered pretty quickly. Still iterating on that.