Skip to content

feat(compass-collection): Process schema into format for LLM submission for Mock Data Generator – CLOUDP-337090 #7205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

jcobis
Copy link
Collaborator

@jcobis jcobis commented Aug 15, 2025

Description

Converts MongoDB raw schema analysis object into a flat, LLM-friendly format for the Mock Data Generator feature.

  • Nested documents are represented with dot notation (e.g., user.profile.name)
  • Uses bracket notation for arrays (e.g., users[], matrix[][])
  • Maintains field sample values

Notation examples:

  • Nested documents: user.profile.name (dot notation)
  • Array: users[] (bracket notation)
  • Nested arrays: matrix[][] (multiple brackets)
  • Nested array of documents fields: users[].name (brackets + dots)

Checklist

  • New tests and/or benchmarks are included
  • Documentation is changed or added
  • If this change updates the UI, screenshots/videos are added and a design review is requested

Motivation and Context

The existing mongodb-schema structure is overly verbose for our feature and contains nested structures that are both difficult for LLMs to parse and do not correspond to the requirements of LLM structured outputs (eg. no optional fields).

Types of changes

  • Backport Needed
  • Patch (non-breaking change which fixes an issue)
  • Minor (non-breaking change which adds functionality)
  • Major (fix or feature that would cause existing functionality to change)

@jcobis jcobis changed the title Process schema feat(compass-collection): Process schema into format for LLM submission for Mock Data Generator – CLOUDP-337090 Aug 19, 2025
@github-actions github-actions bot added the feat label Aug 19, 2025
@jcobis jcobis added the no release notes Fix or feature not for release notes label Aug 19, 2025
@jcobis jcobis marked this pull request as ready for review August 19, 2025 17:10
@Copilot Copilot AI review requested due to automatic review settings August 19, 2025 17:10
@jcobis jcobis requested a review from a team as a code owner August 19, 2025 17:10
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.


const result = processSchema(schema);

expect(result).to.deep.equal({
Copy link

@kpamaran kpamaran Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this is a good example of how the dot notation combined with the constraints we've placed simplifies the parsing and surface area for the LLM to write to

{
name: 'Array',
bsonType: 'Array',
path: ['cube'],
Copy link

@kpamaran kpamaran Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm my understanding, the path stays as ['cube'] because the named field/path captures arrays-within-arrays at all levels (and until there's a document)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from my understanding


const result = processSchema(schema);

expect(result).to.deep.equal({
Copy link

@kpamaran kpamaran Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good example here and below 👍🏼

fieldProbability?: number,
arraySampleValues?: unknown[]
): void {
if (type.name === 'Array' || type.bsonType === 'Array') {
Copy link

@kpamaran kpamaran Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use the type-predicate validators like isArraySchemaType in compass-schema/src/components/field.tsx

Type guards will give you the branch in a program enough type information to prevent casts like type as ArraySchemaType

See https://www.typescriptlang.org/docs/handbook/2/narrowing.html#using-type-predicates


const arrayPath = `${currentPath}[]`;
const sampleValues =
arraySampleValues || getSampleValues(arrayType).slice(0, 3); // Limit full-context array sample values to 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of the magic 3, use a constant

Copy link

@kpamaran kpamaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had some minor feedback; lgtm overall

@@ -30,9 +29,16 @@ export type SchemaAnalysisErrorState = {
error: SchemaAnalysisError;
};

export interface FieldInfo {
type: string; // MongoDB type (eg. String, Double, Array, Document)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could type be a string union type?

@@ -66,7 +66,8 @@
"react": "^17.0.2",
"react-redux": "^8.1.3",
"redux": "^4.2.1",
"redux-thunk": "^2.4.2"
"redux-thunk": "^2.4.2",
"bson": "^6.10.1"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since bson is only used for its types, it should be a dev dependency so it doesn't get bundled to prod

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought at first too. However, the dependency checker was still complaining even with using import type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding 'bson' to the ignores in packages/compass-collection/.depcheckrc. But not sure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


function isPrimitiveSchemaType(type: SchemaType): type is PrimitiveSchemaType {
return (
!isConstantSchemaType(type) &&
Copy link

@kpamaran kpamaran Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] null and undefined classify as primitives too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@kpamaran kpamaran Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, let's go with consistency then. I disagree with that source's typing but it may be working around some type issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat no release notes Fix or feature not for release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants