Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimistic mode #123

Merged
merged 4 commits into from
Dec 10, 2024
Merged

Conversation

ramondeklein
Copy link
Contributor

@ramondeklein ramondeklein commented Dec 5, 2024

This PR adds the --optimistic flag that reduces the amount of health-checks significantly. It works like this:

  1. Only an initial health-check is done when starting sidekick to determine which nodes are active.
  2. When a node reports to be healthy, then the health-check for that node is stopped.
  3. When a node reports either 502 (bad gateway), 503 (service unavailable) or 504 (gateway timeout) then the node is taken out of the pool immediately and it will start health-checks again.
  4. When the node is healthy again, it's added back to the pool and health checks are stopped.

There are two advantages to this approach:

  • Less health checks needed that can become expensive in a cluster with a lot of nodes.
  • Node is taken out of the pool (for that sidekick node) immediately after reporting failure.

The major disadvantages are:

  • It only takes out the node upon 502, 503 and 504. Note that an unreachable node also reports 502 by the reverse-proxy logic.
  • It relies on clients doing retries when a request was forwarded to a bad node. Clients should already do this anyway.

I would like to hear your comments...

main.go Outdated Show resolved Hide resolved
@ramondeklein ramondeklein requested a review from poornas December 6, 2024 16:28
main.go Outdated Show resolved Hide resolved
http-tracer.go Outdated Show resolved Hide resolved
@klauspost
Copy link
Contributor

Reviewing tomorrow

Copy link
Contributor

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stuff.

Agree this is a weird location to have it, but not a major pain point.

main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
@harshavardhana harshavardhana merged commit c2078b8 into minio:master Dec 10, 2024
4 checks passed
@ramondeklein
Copy link
Contributor Author

ramondeklein commented Dec 10, 2024

@harshavardhana Sorry, I was still working on the changes proposed by @klauspost. They were already committed to this branch, but I was in the process of testing them.

@klauspost
Copy link
Contributor

@ramondeklein What's the worry then :D

@ramondeklein
Copy link
Contributor Author

Everything checks out okay. I made the following changes:

  • Moved the statistics to the setOnline / setOffline methods to prevent repetition.
  • If a node is down and statistics are shown, then it also shows the current downtime and also cumulative downtime shows the current downtime.

Code looks a bit cleaner this way. I did test both optimistic and regular modes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants