chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216

lerouxb · 2025-08-20T09:43:50Z

For those less used to working on compass:

You need a newish version of node (probably 22) and npm (11-sh). See nvm if you don't have it yet. Clone this repo, switch to this branch (chat-playground), npm run bootstrap which will do npm install followed by a compile (probably not strictly needed, but should make vscode happier).

You'll need a braintrust API key (here somewhere, mongodb-ai-education organisation), then set it with:

export BRAINTRUST_API_KEY=blah

This key is used for braintrust, but also by the braintrust proxy so that we can use other LLMs (gpt-4.1 at this point) to score these results. The proxy functionality is only used by Factuality at the moment.

Then in packages/compass-assistant you can run the following:

% npx braintrust eval  test/assistant.eval.ts --verbose

Then your results should end up here as a new entry. They should stream in while it runs. With the temperature being set to zero as it is the braintrust proxy might even cache some things for us.

The only scorers are Factuality for judging the text and binaryNdcgAtK (totally stolen from the chatbot project) for judging the sources/links. See the autoevals repo for more possibilities.

To add cases, add a file in packages/compass-assistant/test/eval-cases (see others for inspiration) and then import/register the file in the index.ts in that folder. We'll probably add automation around this over time and I'm still trying to come up with the nicest, most ergonomic layout. Let me know how it goes!

PS I haven't linked this up with CI yet - I don't think we want to fail anything if the scores drop yet anyway plus for now our experiments will probably all land in the same pool. If this runs on every PR that will probably get cluttered pretty quickly. Still iterating on that.

packages/compass-assistant/test/assistant.eval.ts

Copilot

Pull Request Overview

This PR adds evaluation capabilities for the compass assistant using the Braintrust platform. The evaluation framework allows testing assistant responses against expected outputs with automated scoring.

Introduces a complete evaluation framework with test cases for the MongoDB compass assistant
Implements custom scoring functions for factuality and source link matching
Sets up evaluation test cases covering MongoDB topics like data modeling, aggregation pipelines, and search filtering

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`packages/compass-assistant/test/assistant.eval.ts`	Main evaluation framework setup with Braintrust integration and scoring functions
`packages/compass-assistant/test/fuzzylinkmatch.ts`	Utility for fuzzy URL matching copied from chatbot project
`packages/compass-assistant/test/binaryndcgatk.ts`	Binary NDCG@K scoring implementation for evaluating source link relevance
`packages/compass-assistant/test/eval-cases/*.ts`	Test case definitions for various MongoDB topics
`packages/compass-assistant/test/eval-cases/index.ts`	Central export for all evaluation test cases
`packages/compass-assistant/package.json`	Adds dependencies for autoevals and braintrust packages

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

packages/compass-assistant/test/assistant.eval.ts

WIP

a811fe7

gagik reviewed Aug 20, 2025

View reviewed changes

packages/compass-assistant/test/assistant.eval.ts Show resolved Hide resolved

gagik reviewed Aug 20, 2025

View reviewed changes

packages/compass-assistant/test/assistant.eval.ts Outdated Show resolved Hide resolved

lerouxb added 2 commits August 20, 2025 14:29

add some eval cases

9bb69f2

Factuality and BinaryNdcgAtK

c3b345e

lerouxb changed the title ~~WIP: compass assistant eval cases~~ chore(compass-assistant): compass assistant eval cases COMPASS-9609 Aug 20, 2025

lerouxb marked this pull request as ready for review August 20, 2025 15:55

Copilot AI review requested due to automatic review settings August 20, 2025 15:55

lerouxb requested a review from a team as a code owner August 20, 2025 15:55

Copilot AI reviewed Aug 20, 2025

View reviewed changes

packages/compass-assistant/test/assistant.eval.ts Show resolved Hide resolved

packages/compass-assistant/test/assistant.eval.ts Show resolved Hide resolved

packages/compass-assistant/test/assistant.eval.ts Show resolved Hide resolved

lerouxb changed the title ~~chore(compass-assistant): compass assistant eval cases COMPASS-9609~~ chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 Aug 20, 2025

re-package-lock.json...

82505e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216

chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216

Uh oh!

lerouxb commented Aug 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216

Are you sure you want to change the base?

chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216

Uh oh!

Conversation

lerouxb commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lerouxb commented Aug 20, 2025 •

edited

Loading