Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Blake2b hash #5089

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

terryquigleysas
Copy link
Contributor

Description

  • Bug fix
  • Blake2b is deterministic. Passing the parameters incorrectly results in the wrong hash being produced.
  • What is the old behavior before changes and new behavior after changes?
    This may be considered a "Breaking Change" for 3.0.0 as the hashes will now be different - correct, but different from before.

Issues Resolved

Resolves #4274

Testing

Updated existing tests.
Ran Bulk Integration Test action against the branch.
Local testing.

Check List

  • New functionality includes testing
  • New functionality has been documented
  • New Roles/Permissions have a corresponding security dashboards plugin PR
  • API changes companion pull request created
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

cwperks
cwperks previously approved these changes Feb 5, 2025
Signed-off-by: Terry Quigley <[email protected]>
@nibix
Copy link
Collaborator

nibix commented Feb 6, 2025

Note: In a mixed cluster state, this will yield inconsistent results: Some nodes will hash value A to hash X, while other nodes will hash value A to hash Y. This will especially affect aggregations. Are we okay with this (genuine question)? In any case, this should be documented.

@terryquigleysas
Copy link
Contributor Author

Note: In a mixed cluster state, this will yield inconsistent results: Some nodes will hash value A to hash X, while other nodes will hash value A to hash Y. This will especially affect aggregations. Are we okay with this (genuine question)? In any case, this should be documented.

@nibix Thank you for your comment. That is not something that I would have been aware of.

I still strongly lean towards putting this change in.

  • It is a clear bug with incorrect behavior that should be fixed
  • Now that version 3.x is imminent would be a good time to make the change
  • I totally agree that any potential behavior differences should be documented

@nibix
Copy link
Collaborator

nibix commented Feb 10, 2025

@terryquigleysas One way to work around that issue would be to gate the new behavior by a config option that can be changed at runtime (In config.yml for example). That way, the behavior can be changed from old to new nearly instantaneously, thus reducing the chance of inconsistent aggregations massively.

@terryquigleysas
Copy link
Contributor Author

@terryquigleysas One way to work around that issue would be to gate the new behavior by a config option that can be changed at runtime (In config.yml for example). That way, the behavior can be changed from old to new nearly instantaneously, thus reducing the chance of inconsistent aggregations massively.

@nibix What would you suggest naming the property?
Would the default for 3.x be the old behavior or the fixed behavior?

@nibix
Copy link
Collaborator

nibix commented Feb 10, 2025

Just to avoid any misunderstanding: I am in no position to give authoritative rulings on such changes. I can only give my opinion and my recommendations. Any unclear issues need to be clarified in a community driven process.

To reiterate the issue:

  • Any upgrade of an existing OpenSearch cluster with high availability requirements to a new version is done using the "rolling upgrade" technique. In this technique, one or a few nodes are removed from the cluster, upgraded and then added to the cluster again. This is repeated until all the nodes are on the new version.
  • Thus, you will have a phase where the cluster will be consisting of nodes of two different versions. This is called a "mixed cluster" state. For larger clusters, such a rolling upgrade can take a significant time, possibly a day or more.
  • Index contents are usually spread around over several different nodes in shards. The code we are looking at in this PR operates on the shard level. Thus, if shard 1 of an index is on OpenSearch version A, and shard 2 of the same index is on OpenSearch version B, two different versions of OpenSearch will process the data and especially the field masking functionality. One random node will then have the responsibility to combine the sub-responses of the individual nodes.
  • Thus, if the field masking logic is changed, there can be cases where search and aggregation results contain the combination of the old logic with the new logic.

A dynamically changeable config flag would solve this the following way:

  • Initially, the config flag retains the old behavior.
  • The cluster is upgraded to the new version.
  • After the upgrade is complete, an admin can change the config flag to the new behavior.

If the config flag would be initially set to the new behavior, the issue would be actually not avoidable.

Having said this, this approach has indeed the downside that it needs manual intervention of an admin after the completion of the rolling upgrade. There would be also not an easy way to automate that.

Another alternative solution might be to use the cluster state to check whether the cluster is in a mixed state or not. The cluster state API provides methods for that. However, TBH, I am not sure how easy these APIs are accessible from the very low level code we are talking about.

@cwperks
Copy link
Member

cwperks commented Feb 10, 2025

FYI There is a class called ClusterInfoHolder that listens to changes in cluster state and can be interrogated to find the min node version in a cluster.

@cwperks
Copy link
Member

cwperks commented Feb 18, 2025

@terryquigleysas @nibix

How should we proceed here? As far I see, there are 2 choices:

  • Document that aggregations could be inaccurate in a mixed cluster
  • Implement logic to check if the min node in a cluster is below 3_0_0 and intentionally do aggregations w/ the old logic and then switch over to new logic once min node version in the cluster is >= 3_0_0

@nibix
Copy link
Collaborator

nibix commented Feb 19, 2025

@cwperks

How should we proceed here? As far I see, there are 2 choices:

* Document that aggregations could be inaccurate in a mixed cluster

This has the downside that it makes certain uses cases impossible in mixed cluster states. If there are, for example, alerting solutions on such data, these might produce false positives in this phase.

What are the exact high availability and compatibility promises OpenSearch makes? I guess we need to know these in order to decide whether this is viable.

* Implement logic to check if the min node in a cluster is below 3_0_0 and intentionally do aggregations w/ the old logic and then switch over to new logic once min node version in the cluster is >= 3_0_0

One downside is here also that the change happens to an kind of uncontrolled point in time. If there are use cases which depend on specific hashes, this also might present a challenge to react to the changed hashes in the right point in time.

I think there are a couple of further options:

  • Introduce a configuration option to control the behavior, but do not initially change the behavior. Communicate to the users that there will be an upcoming change and that they should change that option proactively in order to avoid uncontrolled incidents in there applications caused by the change. At a later version, then change the behavior to the correct blake hash.

  • Just keep it the way it is and give users a further option to have a "correct" blake2b hash. If I understand blake2b correctly (Please correct me if I am wrong!), the salt and personalization parameters are just concated together to an IV. Thus, the current use of the parameters does not reduce the strength of the hashing - it just produces results which are inconsistent to correct applications to the parameters. The option would be to document that by default a non-standard hashing is used. Additionally, users should be given the choice to explicitly specify the blake2b hash in the role configuration.

@terryquigleysas
Copy link
Contributor Author

terryquigleysas commented Feb 19, 2025

@nibix @cwperks Thank you for the comments and suggestions. I have been on vacation for a few days and just catching up on this.

I have looked at checking for the min cluster version. Unfortunately I don't think it is feasible for several reasons. As mentioned above, making these settings available to the relevant code doesn't look trivial. There would likely be an unwanted performance hit for the checks, and even then it would still result in erratic results in various scenarios.

I think we could however use an existing option to support setting the default masking algorithm to revert to the legacy behavior- see https://opensearch.org/docs/latest/security/access-control/field-masking/#advanced-use-an-alternative-hash-algorithm. For example:

plugins.security.masked_fields.algorithm.default: BLAKE2B_LEGACY_DEFAULT

This means that:

  • The new code hashes correctly by default, getting rid of the bug
  • If a user is concerned about inconsistent results in the case of a mixed cluster BLAKE2B_LEGACY_DEFAULT can be set on the 3.x nodes
  • If a user wishes to retain the old hashes, for whatever reason, BLAKE2B_LEGACY_DEFAULT can also be set
  • This ensures that the hashes provided are consistent and deterministic
  • This would need to be documented

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Blake2b hashing for Masked Fields does not apply salt correctly
3 participants