Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: skip duplicated puts when calculating put return time #19383

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joshuazh-x
Copy link
Contributor

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

issue #19303

Skip duplicated persisted put requests when adjusting put return time.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joshuazh-x
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @joshuazh-x. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

codecov bot commented Feb 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.90%. Comparing base (9de211d) to head (2a6cad2).
Report is 61 commits behind head on main.

Additional details and impacted files

see 34 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19383      +/-   ##
==========================================
- Coverage   68.98%   68.90%   -0.09%     
==========================================
  Files         420      420              
  Lines       35739    35753      +14     
==========================================
- Hits        24656    24635      -21     
- Misses       9660     9695      +35     
  Partials     1423     1423              

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9de211d...2a6cad2. Read the comment docs.

@@ -214,6 +216,12 @@ func putReturnTime(allOperations []porcupine.Operation, reports []report.ClientR
continue
}
kv := keyValue{Key: op.Put.Key, Value: op.Put.Value}
if persistedPutCount[kv] > 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that we don't check uniqueness of put request here, however on line 92 we check if request was sent uniquely by client. This should prevent patching this, or is this a case that put is unique from client perspective but was persisted twice?

Copy link
Contributor Author

@joshuazh-x joshuazh-x Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 92 is called after putReturnTime where duplicated requests in wal entries will be hit. Maybe we can make it simple by skipping persisted dup requests in PutReturnTime.

Copy link
Member

@serathius serathius Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to keep calculations of maps like clientPutCount and putReturnTime independent. Instead of trying to check all conditions everywhere, they are checked once at the top. Have you confirmed that there is no duplicate in client requests?

Copy link
Contributor Author

@joshuazh-x joshuazh-x Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are dups in client requests and they match those in wal. But skipping them won't help because the incorrect return time does not come from the dup requests. The calculation logic in putReturnTime must apply to non-dup requests, otherwise it will propagate the twisted return time to other non-dup requests.

How about dedup the persisted requests at the top?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow, can you add a test case that exemplifies this?

@serathius
Copy link
Member

Investigation makes sense, however I think we should fix it in different place.
Please also consider preparing tests.

@joshuazh-x
Copy link
Contributor Author

Let me check why there is one client request but 2 persisted records.

@@ -214,6 +214,9 @@ func putReturnTime(allOperations []porcupine.Operation, reports []report.ClientR
continue
}
kv := keyValue{Key: op.Put.Key, Value: op.Put.Value}
if count := persistedPutCount[kv]; count > 1 {
continue
Copy link
Contributor Author

@joshuazh-x joshuazh-x Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment LN218 to fail the new test.

@@ -318,6 +318,25 @@ func TestPatchHistory(t *testing.T) {
{Return: infinite + 99, Output: model.MaybeEtcdResponse{Persisted: true}},
},
},
{
Copy link
Contributor Author

@joshuazh-x joshuazh-x Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test simulates retried txn during a leader transition and ensures that other txn correctly adjust their return times in response to duplicated persisted operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, the second operation (key2, value2)'s return time will be adjusted to 149 which is less than its start time 200.

@joshuazh-x
Copy link
Contributor Author

cc @serathius

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants