Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Update sink handling to retry more exceptions #8

Merged
merged 6 commits into from
May 7, 2024

Conversation

andrewma2
Copy link
Collaborator

@andrewma2 andrewma2 commented Apr 26, 2024

Updating sink retry logic. For now, make it accept anything, but can eventually tune it for the specific issues that pop up to upstream

Added unit tests, which test two issues that we want to retry

104 (connection reset) request that can't be properly cast for Azure (ELK)

400 Bad Request due to token expiration (ELK)

Also double-checked with load test

@andrewma2 andrewma2 force-pushed the MakeAllRetriesRetriable branch from f40532d to bd58e68 Compare April 26, 2024 20:23
@andrewma2 andrewma2 changed the title Make all retries retriable Update sink handling to make all exceptions retriable Apr 26, 2024
@@ -311,8 +311,11 @@ impl RetryLogic for S3RetryLogic {
type Error = SdkError<PutObjectError, HttpResponse>;
type Response = S3Response;

fn is_retriable_error(&self, error: &Self::Error) -> bool {
is_retriable_error(error)
fn is_retriable_error(&self, _error: &Self::Error) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any limit to how many retries we can do? Worried that we will have infinite retries which will make vector unstable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point, the limit is configurable but by default it is basically infinite, should update our config to set that before we add this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged a change to set limit

// For now, retry request in all cases

// error.status().is_server_error()
// || StatusCode::TOO_MANY_REQUESTS.as_u16() == Into::<u16>::into(error.status())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start with opening up just the conditions we see? I am sure not all 400 will be retriable? See error codes here: https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry meant to add this comment to the s3 sink

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think then in the AWS case the only issue we see is a 400 Bad Request with ExpiredToken error code. In that case, we'd only update the logic to handle just that part cc @anil-db

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to just handle the ExpiredToken issue we see

@andrewma2 andrewma2 changed the title Update sink handling to make all exceptions retriable Update sink handling to retry more exceptions May 2, 2024
@andrewma2
Copy link
Collaborator Author

andrewma2 commented May 2, 2024

Screenshot 2024-05-02 at 2 43 54 PM

FYI also added new metric to track errors (and retry status)

@andrewma2 andrewma2 requested a review from flaviofcruz May 2, 2024 21:46
@@ -193,12 +193,14 @@ where
);
Some(self.build_retry())
} else {
// For now, retry request in all cases
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice.
it took me sometime to understand. can you add a bit more explanation referencing line 177/189.
basically what is happening at line 177/189 and so when we would end up here.

Copy link
Collaborator

@anil-db anil-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.

@@ -312,7 +314,12 @@ impl RetryLogic for S3RetryLogic {
type Response = S3Response;

fn is_retriable_error(&self, error: &Self::Error) -> bool {
is_retriable_error(error)
let retry = is_retriable_error(error);
emit!(CheckRetryEvent {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we log the error with out decision to retry or not as well. so we know we got the error and its type etc. and our decision about. would help in improving is_retriable_error

error!(
message = "Unexpected error type; dropping the request.",
// message = "Unexpected error type; dropping the request.",
message = "Unexpected error type encountered.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append ..retrying in message just to be clear for someone who just see this log.

status_code: error.error_code().unwrap_or(""),
retry: true,
});
true
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for adding log here with error and our decision to retry.

@andrewma2 andrewma2 merged commit 150a890 into databricks:v0.36.1 May 7, 2024
39 of 48 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants