Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guard to setMode for NTEs #598

Open
wants to merge 6 commits into
base: palantir-cassandra-2.2.18
Choose a base branch
from

Conversation

d-guo
Copy link
Contributor

@d-guo d-guo commented Dec 9, 2024

When a node enters non-transient error, it should never regain a normal operation mode (hence non-transient). This PR adds a guard for this and additionally short-circuits the bootstrap when the node enters NTE. This addition is necessary because the bootstrap NTE can occur from different threads than the one calling bootstrapStream.get().

@rhuffy
Copy link
Contributor

rhuffy commented Dec 9, 2024

please fix the formatting to match the Cassandra style. You should be able to use the auto formatter.

logger.warn("Attempted to change mode from NTE to non-NTE", SafeArg.of("attemptedSetMode", m), SafeArg.of("msg", msg));
else
logger.debug("Attempted to change mode from NTE to non-NTE", SafeArg.of("attemptedSetMode", m), SafeArg.of("msg", msg));
return;
Copy link
Contributor

@inespot inespot Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify: log with warn no matter the value of log, I think this is always worth a warning log

break;
}
Uninterruptibles.sleepUninterruptibly(1, TimeUnit.SECONDS);
}
Copy link
Contributor

@inespot inespot Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to do less aggressive polling while keeping the reactivity by doing something like

while (true) {
    try {
        bootstrapStream.get(timeBetweenCheck, TimeUnit.SECONDS);
        logger.info("Bootstrap streaming completed for tokens {}", tokens);
        break;
    } catch (TimeoutException e) {
        if (hasNonTransientError(NonTransientError.BOOTSTRAP_ERROR)) {
            logger.info("Aborting waiting for bootstrap streaming to complete because detected a bootstrap error.",
                    SafeArg.of("nonTransientErrors", getNonTransientErrors()));
            break;
        }
    } catch (Exception e) {
        logger.error("Error while waiting for bootstrap stream to complete", e);
        break;
    }
}

If we stop waiting for the streams, is there a way to also notify peers/shut down stream early? Or will the following disableNode handle this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, why is this less aggressive polling? if possible, i'd like to avoid using exceptions for control flow and i think i don't recognize the benefit here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notify peers/shut down stream early

i'll test this tomorrow to see if disableNode kills the streams as we would hope. with operator, the node will be deleted after entering NTE anyway though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disableNode will not kill the streams. i just realized that in the original instance that this happened, the streams continued even after the NTE blip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants