Skip to content

draft: augment yaml definitions with test schema #1659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: v2.x
Choose a base branch
from

Conversation

nirinchev
Copy link

This is very much a POC, but I figured I'd open up a PR to get early feedback on the direction we're heading in.

For context, the devtools team wants to use the yaml definitions from the PHP driver to build an autocomplete experience for the aggregation pipelines in mongosh/compass. We've built a tool to convert the yaml to typescript definitions and it seems to be working well for the base case, but to polish it, we'd need to enhance the yaml definitions to allow us to generate more precise type definitions.

Here's a list of the high-level changes we're proposing (will be updated over time):

  • Update schema.json to remove duplicate $comment fields (since most json parsers are not happy about it).
  • Update all yaml definitions to add schema information for the test cases.
    • The schema shape matches SimplifiedSchema from the mongodb-schema package. This is a little bit verbose since it supports multiple types per field, even though the docs rarely have heterogeneous schemas.
    • Schema is generated automatically by crawling the docs website at the link pointed to by the test and then using the mongodb-schema package to infer a probabilistic schema from the example collection.
  • Removed a few tests that no longer seem to point to anything in the docs - e.g. firstN:Null and Missing Values.

Copy link
Member

@GromNaN GromNaN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea to have the schema of the source collection.

@@ -1,6 +1,6 @@
# $schema: ../schema.json
name: $accumulator
link: 'https://www.mongodb.com/docs/manual/reference/operator/aggregation/accumulator/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to keep quotes around strings? This style changes make the diff hard to read. Also, I use quotes by default around every string to prevent unexpected parsing rule from Yaml.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, looks like the quotes usage is inconsistent in these files - I'd be inclined to define a prettier rule and reformat them in a separate PR to make this one less noisy if that'd make sense to you. The reason for these getting removed is that the way we're adding the schema information is parsing the yaml, fetching the schema from the docs, then writing it, which results in the file getting reformatted based on the tool conventions (we use https://www.npmjs.com/package/js-yaml here).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least prettier supports YAML out of the box, so that feels like a great way to remove any formatting preference ambiguity

@@ -25,7 +25,7 @@ arguments:
-
name: p
type:
- resolvesToArray # of resolvesToNumber
Copy link
Member

@GromNaN GromNaN Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was keeping this comments as a note for later, as we would like to extend type definition for generic lists and hashes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense - same as above, those got removed due to the yaml parser not preserving comments, but will try and find a way to avoid that.

pipeline:
-
$sample:
size: 100
-
$group:
_id: ~
_id: null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted that both ~ and null can be used according to the spec.

Comment on lines 48 to 49
onNull:
bytes: !!binary AAAAAAAAAAAAAAAAAABAMA==
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new notation seems less legible to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will look at tweaking the generator to ensure it preserves the old notation.

@@ -24,99 +24,122 @@ arguments:
A positive integral expression that is either a constant or depends on the _id value for $group.
tests:
-
name: 'Null and Missing Values'
link: 'https://www.mongodb.com/docs/manual/reference/operator/aggregation/firstN/#null-and-missing-values'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed a few tests that no longer seem to point to anything in the docs - e.g. "firstN:Null and Missing Values".

This example is in the "Behavior" part of the docs.

https://www.mongodb.com/docs/manual/reference/operator/aggregation/firstN/#null-and-missing-values

"type": "string",
"enum": [
"array",
"object",
"single",
"search"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this value is not used.

Comment on lines 323 to 345
"enum": [
"Array",
"Binary",
"Boolean",
"Code",
"CodeWScope",
"Date",
"Decimal128",
"Double",
"Int32",
"Int64",
"MaxKey",
"MinKey",
"Null",
"ObjectId",
"BSONRegExp",
"String",
"BSONSymbol",
"Timestamp",
"Undefined",
"Document",
"Number"
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to converge on the same type names for operator return types and argument types. I'm fine with changing the names in expressions.php to be consistent with other tools.

"Binary",
"Boolean",
"Code",
"CodeWScope",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the CodeWScope type specific to mongosh?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the deprecated code_w_s bson type. Realistically, I don't expect any of the more advanced types to be found in the docs, but I generated the schema from the type definitions and opted not to manually edit it to make it easier to maintain if it ever changes. If we wanted to clean it up, we can definitely remove the deprecated types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GromNaN: For additional context, PHPC conditionally encodes Code or CodeWScope based on whether the MongoDB\BSON\Javascript object was constructed with a non-null $scope parameter (see: php_phongo_javascript_init).

@@ -11,7 +11,7 @@ arguments:
name: specification
type:
- expression
variadic: object
variadic: object # XXX: This should somehow allow explicit typings for { fieldName: 1|0 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is "expression", because any expression is accepted to compute a value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. For IDE integrations/user support it's still valuable to be able to say "while this can be an arbitrary document, frequently it will have keys that match document keys and values which are either 0 or 1"

Copy link
Member

@GromNaN GromNaN May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need an additional type for the projection. Maybe an enum:

0: Suppress
1: Include

And this parameter accepts an union of ProjectEnum|expression

- resolvesToArray
- expression
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was conflicting with the docs, which say that it can be any expression. It was particularly problematic in the $group example since the actual input field doesn't need to be an array, it only becomes one in the context of a $group, which is hard to statically enforce. I've relaxed the requirement in the interest of reducing the false negatives, at the cost of potentially accepting more false positives.

Comment on lines 21 to 23
$avg:
- $quizzes
$avg: $quizzes
labAvg:
$avg:
- $labs
$avg: $labs
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to have been errors, right? At least in the docs, those are field expressions rather than arrays.

@@ -9,7 +9,7 @@ description: |
arguments:
- name: value
type:
- any
- expression
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this was any previously, but happy to revert if expression is incorrect here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants