Fix Blake2b hash #5089

terryquigleysas · 2025-02-05T15:51:31Z

Description

Bug fix
Blake2b is deterministic. Passing the parameters incorrectly results in the wrong hash being produced.
What is the old behavior before changes and new behavior after changes?
This may be considered a "Breaking Change" for 3.0.0 as the hashes will now be different - correct, but different from before.

Issues Resolved

Resolves #4274

Testing

Updated existing tests.
Ran Bulk Integration Test action against the branch.
Local testing.

Check List

New functionality includes testing
New functionality has been documented
New Roles/Permissions have a corresponding security dashboards plugin PR
API changes companion pull request created
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Terry Quigley <[email protected]>

nibix · 2025-02-06T07:15:11Z

Note: In a mixed cluster state, this will yield inconsistent results: Some nodes will hash value A to hash X, while other nodes will hash value A to hash Y. This will especially affect aggregations. Are we okay with this (genuine question)? In any case, this should be documented.

terryquigleysas · 2025-02-07T14:38:17Z

Note: In a mixed cluster state, this will yield inconsistent results: Some nodes will hash value A to hash X, while other nodes will hash value A to hash Y. This will especially affect aggregations. Are we okay with this (genuine question)? In any case, this should be documented.

@nibix Thank you for your comment. That is not something that I would have been aware of.

I still strongly lean towards putting this change in.

It is a clear bug with incorrect behavior that should be fixed
Now that version 3.x is imminent would be a good time to make the change
I totally agree that any potential behavior differences should be documented

nibix · 2025-02-10T07:06:26Z

@terryquigleysas One way to work around that issue would be to gate the new behavior by a config option that can be changed at runtime (In config.yml for example). That way, the behavior can be changed from old to new nearly instantaneously, thus reducing the chance of inconsistent aggregations massively.

terryquigleysas · 2025-02-10T12:21:02Z

@terryquigleysas One way to work around that issue would be to gate the new behavior by a config option that can be changed at runtime (In config.yml for example). That way, the behavior can be changed from old to new nearly instantaneously, thus reducing the chance of inconsistent aggregations massively.

@nibix What would you suggest naming the property?
Would the default for 3.x be the old behavior or the fixed behavior?

nibix · 2025-02-10T13:40:29Z

Just to avoid any misunderstanding: I am in no position to give authoritative rulings on such changes. I can only give my opinion and my recommendations. Any unclear issues need to be clarified in a community driven process.

To reiterate the issue:

Any upgrade of an existing OpenSearch cluster with high availability requirements to a new version is done using the "rolling upgrade" technique. In this technique, one or a few nodes are removed from the cluster, upgraded and then added to the cluster again. This is repeated until all the nodes are on the new version.
Thus, you will have a phase where the cluster will be consisting of nodes of two different versions. This is called a "mixed cluster" state. For larger clusters, such a rolling upgrade can take a significant time, possibly a day or more.
Index contents are usually spread around over several different nodes in shards. The code we are looking at in this PR operates on the shard level. Thus, if shard 1 of an index is on OpenSearch version A, and shard 2 of the same index is on OpenSearch version B, two different versions of OpenSearch will process the data and especially the field masking functionality. One random node will then have the responsibility to combine the sub-responses of the individual nodes.
Thus, if the field masking logic is changed, there can be cases where search and aggregation results contain the combination of the old logic with the new logic.

A dynamically changeable config flag would solve this the following way:

Initially, the config flag retains the old behavior.
The cluster is upgraded to the new version.
After the upgrade is complete, an admin can change the config flag to the new behavior.

If the config flag would be initially set to the new behavior, the issue would be actually not avoidable.

Having said this, this approach has indeed the downside that it needs manual intervention of an admin after the completion of the rolling upgrade. There would be also not an easy way to automate that.

Another alternative solution might be to use the cluster state to check whether the cluster is in a mixed state or not. The cluster state API provides methods for that. However, TBH, I am not sure how easy these APIs are accessible from the very low level code we are talking about.

cwperks · 2025-02-10T14:34:05Z

FYI There is a class called ClusterInfoHolder that listens to changes in cluster state and can be interrogated to find the min node version in a cluster.

cwperks · 2025-02-18T15:33:08Z

@terryquigleysas @nibix

How should we proceed here? As far I see, there are 2 choices:

Document that aggregations could be inaccurate in a mixed cluster
Implement logic to check if the min node in a cluster is below 3_0_0 and intentionally do aggregations w/ the old logic and then switch over to new logic once min node version in the cluster is >= 3_0_0

nibix · 2025-02-19T09:55:40Z

@cwperks

How should we proceed here? As far I see, there are 2 choices:
* Document that aggregations could be inaccurate in a mixed cluster

This has the downside that it makes certain uses cases impossible in mixed cluster states. If there are, for example, alerting solutions on such data, these might produce false positives in this phase.

What are the exact high availability and compatibility promises OpenSearch makes? I guess we need to know these in order to decide whether this is viable.

* Implement logic to check if the min node in a cluster is below 3_0_0 and intentionally do aggregations w/ the old logic and then switch over to new logic once min node version in the cluster is >= 3_0_0

One downside is here also that the change happens to an kind of uncontrolled point in time. If there are use cases which depend on specific hashes, this also might present a challenge to react to the changed hashes in the right point in time.

I think there are a couple of further options:

Introduce a configuration option to control the behavior, but do not initially change the behavior. Communicate to the users that there will be an upcoming change and that they should change that option proactively in order to avoid uncontrolled incidents in there applications caused by the change. At a later version, then change the behavior to the correct blake hash.
Just keep it the way it is and give users a further option to have a "correct" blake2b hash. If I understand blake2b correctly (Please correct me if I am wrong!), the salt and personalization parameters are just concated together to an IV. Thus, the current use of the parameters does not reduce the strength of the hashing - it just produces results which are inconsistent to correct applications to the parameters. The option would be to document that by default a non-standard hashing is used. Additionally, users should be given the choice to explicitly specify the blake2b hash in the role configuration.

terryquigleysas · 2025-02-19T17:18:23Z

@nibix @cwperks Thank you for the comments and suggestions. I have been on vacation for a few days and just catching up on this.

I have looked at checking for the min cluster version. Unfortunately I don't think it is feasible for several reasons. As mentioned above, making these settings available to the relevant code doesn't look trivial. There would likely be an unwanted performance hit for the checks, and even then it would still result in erratic results in various scenarios.

I think we could however use an existing option to support setting the default masking algorithm to revert to the legacy behavior- see https://opensearch.org/docs/latest/security/access-control/field-masking/#advanced-use-an-alternative-hash-algorithm. For example:

plugins.security.masked_fields.algorithm.default: BLAKE2B_LEGACY_DEFAULT

This means that:

The new code hashes correctly by default, getting rid of the bug
If a user is concerned about inconsistent results in the case of a mixed cluster BLAKE2B_LEGACY_DEFAULT can be set on the 3.x nodes
If a user wishes to retain the old hashes, for whatever reason, BLAKE2B_LEGACY_DEFAULT can also be set
This ensures that the hashes provided are consistent and deterministic
This would need to be documented

What do you think?

terryquigleysas and others added 4 commits February 4, 2025 14:30

Initial Blake2b fix

7c5db53

Signed-off-by: Terry Quigley <[email protected]>

Corresponding test changes

b908467

Signed-off-by: Terry Quigley <[email protected]>

Test update

18cc671

Signed-off-by: Terry Quigley <[email protected]>

Merge branch 'opensearch-project:main' into fix_blake2b_hash

0d27d33

terryquigleysas requested review from cwperks, DarshitChanpura, derek-ho, nibix, peternied, RyanL1997, reta and willyborankin as code owners February 5, 2025 15:51

cwperks previously approved these changes Feb 5, 2025

View reviewed changes

Update integration test

637d412

Signed-off-by: Terry Quigley <[email protected]>

terryquigleysas dismissed cwperks’s stale review via 637d412 February 5, 2025 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Blake2b hash #5089

Fix Blake2b hash #5089

terryquigleysas commented Feb 5, 2025

nibix commented Feb 6, 2025 •

edited

Loading

terryquigleysas commented Feb 7, 2025

nibix commented Feb 10, 2025

terryquigleysas commented Feb 10, 2025

nibix commented Feb 10, 2025 •

edited

Loading

cwperks commented Feb 10, 2025

cwperks commented Feb 18, 2025

nibix commented Feb 19, 2025 •

edited

Loading

terryquigleysas commented Feb 19, 2025 •

edited

Loading

Fix Blake2b hash #5089

Are you sure you want to change the base?

Fix Blake2b hash #5089

Conversation

terryquigleysas commented Feb 5, 2025

Description

Issues Resolved

Testing

Check List

nibix commented Feb 6, 2025 • edited Loading

terryquigleysas commented Feb 7, 2025

nibix commented Feb 10, 2025

terryquigleysas commented Feb 10, 2025

nibix commented Feb 10, 2025 • edited Loading

cwperks commented Feb 10, 2025

cwperks commented Feb 18, 2025

nibix commented Feb 19, 2025 • edited Loading

terryquigleysas commented Feb 19, 2025 • edited Loading

nibix commented Feb 6, 2025 •

edited

Loading

nibix commented Feb 10, 2025 •

edited

Loading

nibix commented Feb 19, 2025 •

edited

Loading

terryquigleysas commented Feb 19, 2025 •

edited

Loading