Skip to content

[CDRIVER-6017] BSON Validation Refactor #2026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

vector-of-bool
Copy link
Contributor

This changeset is a large refactor of the bson_validate implementation.

Existing validation test cases were rewritten to use a uniform code generation mechanism that notes the purpose of each case. Many more validation scenarios were added.

The BSON_VALIDATE_UTF8 flags have been removed from the default flags used by the CRUD API. Because of a bug, UTF-8 validation wasn't being applied to the CRUD API, so removing the flag is necessary since it would otherwise be a breaking change.

The new validation is written as a straightforward recursive algorithm, and doesn't use the error-prone bson_visit API.

The following changes are of note:

  1. The error code BSON_VALIDATE_CORRUPT is now set on corruption. Previously, this was set to BSON_VALIDATE_NONE, which had a value of zero. This was very confusing, as it was also set to zero in case of success.
  2. Doc comments are added to better explain the behavior of the validation flags.
  3. The "error offset" parameter is now adjusted properly in the case of document nesting.
  4. Error message strings have been updated and made more explanatory.
  5. BSON data that has over 1000 levels of nesting is now rejected, unconditionally.

Copy link
Collaborator

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor comments. I found the test generator easy to read and modify. And I like the move away from bson_iter_visit_all. LGTM

return false;
}

// Validate an element's key string according to the valiation rules
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Validate an element's key string according to the valiation rules
// Validate an element's key string according to the validation rules

// Check for special keys
if (self->params->check_special_dollar_keys) {
// dollar-keys are checked during the startup of _validate_doc. If we get here, there's a problem.
require_with_error (key[0] != '$', iter->off, BSON_VALIDATE_DOLLAR_KEYS, "Disallowed element key: \"%s\"", key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require_with_error (key[0] != '$', iter->off, BSON_VALIDATE_DOLLAR_KEYS, "Disallowed element key: \"%s\"", key);
require_with_error (key[0] != '$', iter->off, BSON_VALIDATE_DOLLAR_KEYS, "Disallowed '$' in element key: \"%s\"", key);

Suggest including offending character in error.

}

if (!self->params->allow_dot_in_keys) {
require_with_error (!strstr (key, "."), iter->off, BSON_VALIDATE_DOT_KEYS, "Disallowed element key: \"%s\"", key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require_with_error (!strstr (key, "."), iter->off, BSON_VALIDATE_DOT_KEYS, "Disallowed element key: \"%s\"", key);
require_with_error (!strstr (key, "."), iter->off, BSON_VALIDATE_DOT_KEYS, "Disallowed '.' in element key: \"%s\"", key);

vector-of-bool and others added 24 commits June 4, 2025 18:09
The new `bson_validate` implementation does not
make use of the error-prone `bson_visit` APIs. Instead, it is written
as a simple recursive validator. The new validator respects requests
for UTF-8 validation properly.
The existing test cases used BSON files, and didn't have
any commentary on what they were actually testing. New test cases are
generated from a Python shorthand and contain the tested bytes inline,
with a distinct test case for each actual validation scenario.
These include small changes in spelling, param assertions, and small cleanups

Co-authored-by: Kevin Albertson <[email protected]>
Co-authored-by: Ezra Chung <[email protected]>
- Add `bson/validate-private.h` to contain the declaration
  of the validation impl.
- Add BSON_VALIDATE_MAX_NESTING_DEPTH to represent the validation limit.
- Change the nesting limit to 100, to match the behavior of Server
@vector-of-bool vector-of-bool force-pushed the CDRIVER-6017-validate-fixes branch from 7a48211 to f0f60b5 Compare June 5, 2025 20:45
@vector-of-bool
Copy link
Contributor Author

Accidental rebase plz ignore.

I believe I've address all outstanding change requests, but may have missed some as the GitHub UI has been very uncooperative this week.

@vector-of-bool vector-of-bool requested a review from eramongodb June 5, 2025 20:46
Copy link
Contributor

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor feedback remaining; otherwise, LGTM.

BSON_ASSERT_PARAM (bson);

// The depth limit of 100 is chosen to match the limit enforced by MongoDB server.
// Refer: https://www.mongodb.com/docs/manual/reference/limits/#mongodb-limit-Nested-Depth-for-BSON-Documen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Refer: https://www.mongodb.com/docs/manual/reference/limits/#mongodb-limit-Nested-Depth-for-BSON-Documen
// Refer: https://www.mongodb.com/docs/manual/reference/limits/#mongodb-limit-Nested-Depth-for-BSON-Documents

End of URL is clipped.

* @return true If the given document has no validation errors
* @return false Otherwise
*/
extern bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extern bool
bool

extern should (now) be unnecessary.

* The depth limit of 100 is chosen to match the limit enforced by MongoDB server.
* Refer: https://www.mongodb.com/docs/manual/reference/limits/#mongodb-limit-Nested-Depth-for-BSON-Document
*/
BSON_VALIDATION_MAX_NESTING_DEPTH = 100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BSON_VALIDATION_MAX_NESTING_DEPTH = 100,
BSON_VALIDATE_MAX_NESTING_DEPTH = 100,

All other validation-related constants use a prefix of BSON_VALIDATE_*.

Do we really need to expose this in the public API? Can it be moved to <bson/validate-private.h> instead? Test code is able to include private headers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in the public header because I think it might be useful to users to know the limit. Because it's public, I intentionally chose the differing name to prevent a user from thinking that this is meaningfull:

bson_validate(d, BSON_VALIDATE_UTF8 | BSON_VALIDATE_MAX_NESTING_DEPTH, NULL);

It may actually be better to have a function bson_get_max_nesting_depth(), in case it is changed in the future and may vary depending on what library they load at runtime?

Copy link
Contributor

@eramongodb eramongodb Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation.

It might be useful to users to know the limit.

I'm not sure about that, but in that case, perhaps BSON_MAX_NESTING_DEPTH via bson/bson.h (alongside BSON_MAX_SIZE) might be better, since it is not a limit that is restricted to validation only? (Apparently we have #define BCON_STACK_MAX 100 in bson/bcon.h, but that doesn't seem to be meant to be used outside of the component.)

It may actually be better to have a function bson_get_max_nesting_depth()

I do not expect the limit to change anytime soon, so even if we expose it in public API, I do not mind leaving it as a compile-time constant.

@vector-of-bool
Copy link
Contributor Author

Note for future readers:

The validation limit was set to 1000, as it is possible that users want to use documents deeper than are accepted by the server, and it is also possible that the server can accept documents with deeper nesting levels. The purpose of the limit is mostly to prevent deep recursion, not to prevent invalid data.

@vector-of-bool vector-of-bool merged commit d184525 into mongodb:master Jun 6, 2025
37 of 39 checks passed
vector-of-bool added a commit to vector-of-bool/mongo-c-driver that referenced this pull request Jun 6, 2025
* New BSON validation routine rewrite

The new `bson_validate` implementation does not
make use of the error-prone `bson_visit` APIs. Instead, it is written
as a simple recursive validator. The new validator respects requests
for UTF-8 validation properly.

* Stop validating at 1000 depth, preventing stack overflow
* Replace most BSON validation tests with generated ones

The existing test cases used BSON files, and didn't have
any commentary on what they were actually testing. New test cases are
generated from a Python shorthand and contain the tested bytes inline,
with a distinct test case for each actual validation scenario.

* Disable UTF-8 validation by default on CRUD APIs
* Document and tweak the value of BSON_VALIDATE_CORRUPT
* Add test cases related to the overlong null encoding
* Tweak JS scope validation to permit more obj keys
* Add a NEWS entry for validation changes.
* Allow `-private.h` headers to not include the prelude header

---------

Co-authored-by: Kevin Albertson <[email protected]>
Co-authored-by: Ezra Chung <[email protected]>
vector-of-bool added a commit to vector-of-bool/mongo-c-driver that referenced this pull request Jun 6, 2025
* New BSON validation routine rewrite

The new `bson_validate` implementation does not
make use of the error-prone `bson_visit` APIs. Instead, it is written
as a simple recursive validator. The new validator respects requests
for UTF-8 validation properly.

* Stop validating at 1000 depth, preventing stack overflow
* Replace most BSON validation tests with generated ones

The existing test cases used BSON files, and didn't have
any commentary on what they were actually testing. New test cases are
generated from a Python shorthand and contain the tested bytes inline,
with a distinct test case for each actual validation scenario.

* Disable UTF-8 validation by default on CRUD APIs
* Document and tweak the value of BSON_VALIDATE_CORRUPT
* Add test cases related to the overlong null encoding
* Tweak JS scope validation to permit more obj keys
* Add a NEWS entry for validation changes.
* Allow `-private.h` headers to not include the prelude header

---------

Co-authored-by: Kevin Albertson <[email protected]>
Co-authored-by: Ezra Chung <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants