Skip to content

fix(cluster_family): Cancel slot migration from incoming node on OOM #5000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mkaruza
Copy link
Contributor

@mkaruza mkaruza commented Apr 25, 2025

If applying command on incoming node will result in OOM (we overflow
max_memory_limit) we are closing migration and switch state to FATAL.

@mkaruza mkaruza requested a review from BorysTheDev April 25, 2025 11:16
@mkaruza mkaruza marked this pull request as draft April 25, 2025 11:16
@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from 9c88c11 to 6a5d621 Compare May 6, 2025 12:34
@mkaruza mkaruza requested a review from BorysTheDev May 6, 2025 12:35
@mkaruza mkaruza marked this pull request as ready for review May 6, 2025 12:35
@mkaruza mkaruza changed the title fix(cluster_family): Cancel slot migration in case of OOM errors on t… fix(cluster_family): Cancel slot migration from incoming node on OOM May 6, 2025
Comment on lines 985 to 987
if (migration->GetState() == MigrationState::C_FATAL) {
migration->Stop();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like incorrect place for this logic

Copy link
Contributor Author

@mkaruza mkaruza May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FLOW fails with C_FATAL we'll call it, where would you put migration->Stop to handle stopping of migration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move it to reportError or SetState, where we set the fatal status

@@ -70,6 +76,20 @@ class ClusterShardMigration {
break;
}

auto oom_check = [&]() -> bool {
auto used_mem = used_mem_current.load(memory_order_relaxed);
if ((used_mem + tx_data->command.cmd_len) > max_memory_limit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is used_mem RSS? maybe we need max_memory_limit - 100MB for example, or how we can guarantee 100% success of this condition?

Copy link
Contributor Author

@mkaruza mkaruza May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is checks whatever is set with maxmemory. I'll modify so it uses 90% of max memory as upper bound. That should be safe i think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss it with @adiholden or @romange

Comment on lines +57 to +60
void ReportFatalError(dfly::GenericError err) ABSL_LOCKS_EXCLUDED(state_mu_, error_mu_) {
errors_count_.fetch_add(1, std::memory_order_relaxed);
util::fb2::LockGuard lk_state(state_mu_);
util::fb2::LockGuard lk_error(error_mu_);
state_ = MigrationState::C_FATAL;
last_error_ = std::move(err);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this code. Let's think how to make it better. Maybe we can use state_mu_ for error too or merge this method with reportError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but i changed ReportError to check GetState() != C_FATAL and i wanted to update state & set error message while having both locks.

@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from e9049f0 to 16a27f6 Compare May 7, 2025 11:32
@mkaruza mkaruza requested a review from BorysTheDev May 7, 2025 11:33
@BorysTheDev BorysTheDev requested review from adiholden and kostasrim May 7, 2025 11:50
mkaruza added 2 commits May 7, 2025 20:40
If applying command on incoming node will result in OOM (we overflow
max_memory_limit) we are closing migration and switch state to FATAL.

Signed-off-by: mkaruza <[email protected]>
@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from 16a27f6 to f080a52 Compare May 7, 2025 18:54
@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from 7e7ebf4 to 40851f9 Compare May 7, 2025 19:41
@mkaruza mkaruza requested a review from BorysTheDev May 8, 2025 13:30
@@ -925,7 +926,7 @@ void ClusterFamily::InitMigration(CmdArgList args, SinkReplyBuilder* builder) {

if (!migration) {
VLOG(1) << "Unrecognized incoming migration from " << source_id;
return builder->SendSimpleString(OutgoingMigration::kUnknownMigration);
return builder->SendSimpleString(kUnknownMigration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be error not string, like below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of code that was handling kUnknownMigration is using both SendError and SendSimpleString. This PR didn't change that but it moved this constant into different location. kUnknownMigration will be matched only if SendSimpleString is used.

}

auto migration = GetIncomingMigration(source_id);
if (!migration)
return builder->SendError(kIdNotFound);

if (!migration->Join(attempt)) {
return builder->SendError("Join timeout happened");
if (migration->GetState() == MigrationState::C_FATAL) {
return builder->SendError(absl::StrCat("-", kIncomingMigrationOOM));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need "-", maybe we can make it as part of the kIncomingMigrationOOM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If - is not provided in error string response would be -ERR INCOMING_MIGRATION_OOM which will not be matched. If we use - as part of constant we would need to remove - prefix when matching error response.

Comment on lines +224 to +231
// Break outgoing migration if INIT from incoming node responded with OOM. Usually this will
// happen on second iteration after first failed with OOM. Sending second INIT is required to
// cleanup slots on incoming slot migration node.
if (CheckRespSimpleError(kIncomingMigrationOOM)) {
ChangeState(MigrationState::C_FATAL);
break;
}

Copy link
Contributor

@BorysTheDev BorysTheDev May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move it in next if if (!CheckRespIsSimpleReply("OK")) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckRespIsSimpleReply expects that response is STRING (+) while we are returning now ERROR (-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants