Skip to content

chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

lerouxb
Copy link
Contributor

@lerouxb lerouxb commented Aug 20, 2025

For those less used to working on compass:

You need a newish version of node (probably 22) and npm (11-sh). See nvm if you don't have it yet. Clone this repo, switch to this branch (chat-playground), npm run bootstrap which will do npm install followed by a compile (probably not strictly needed, but should make vscode happier).

You'll need a braintrust API key (here somewhere, mongodb-ai-education organisation), then set it with:

export BRAINTRUST_API_KEY=blah

This key is used for braintrust, but also by the braintrust proxy so that we can use other LLMs (gpt-4.1 at this point) to score these results. The proxy functionality is only used by Factuality at the moment.

Then in packages/compass-assistant you can run the following:

% npx braintrust eval  test/assistant.eval.ts --verbose

Then your results should end up here as a new entry. They should stream in while it runs. With the temperature being set to zero as it is the braintrust proxy might even cache some things for us.

The only scorers are Factuality for judging the text and binaryNdcgAtK (totally stolen from the chatbot project) for judging the sources/links. See the autoevals repo for more possibilities.

To add cases, add a file in packages/compass-assistant/test/eval-cases (see others for inspiration) and then import/register the file in the index.ts in that folder. We'll probably add automation around this over time and I'm still trying to come up with the nicest, most ergonomic layout. Let me know how it goes!

PS I haven't linked this up with CI yet - I don't think we want to fail anything if the scores drop yet anyway plus for now our experiments will probably all land in the same pool. If this runs on every PR that will probably get cluttered pretty quickly. Still iterating on that.

@lerouxb lerouxb changed the title WIP: compass assistant eval cases chore(compass-assistant): compass assistant eval cases COMPASS-9609 Aug 20, 2025
@lerouxb lerouxb marked this pull request as ready for review August 20, 2025 15:55
@Copilot Copilot AI review requested due to automatic review settings August 20, 2025 15:55
@lerouxb lerouxb requested a review from a team as a code owner August 20, 2025 15:55
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds evaluation capabilities for the compass assistant using the Braintrust platform. The evaluation framework allows testing assistant responses against expected outputs with automated scoring.

  • Introduces a complete evaluation framework with test cases for the MongoDB compass assistant
  • Implements custom scoring functions for factuality and source link matching
  • Sets up evaluation test cases covering MongoDB topics like data modeling, aggregation pipelines, and search filtering

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/compass-assistant/test/assistant.eval.ts Main evaluation framework setup with Braintrust integration and scoring functions
packages/compass-assistant/test/fuzzylinkmatch.ts Utility for fuzzy URL matching copied from chatbot project
packages/compass-assistant/test/binaryndcgatk.ts Binary NDCG@K scoring implementation for evaluating source link relevance
packages/compass-assistant/test/eval-cases/*.ts Test case definitions for various MongoDB topics
packages/compass-assistant/test/eval-cases/index.ts Central export for all evaluation test cases
packages/compass-assistant/package.json Adds dependencies for autoevals and braintrust packages

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@lerouxb lerouxb changed the title chore(compass-assistant): compass assistant eval cases COMPASS-9609 chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants