-
Notifications
You must be signed in to change notification settings - Fork 4
Update sink handling to retry more exceptions #8
Update sink handling to retry more exceptions #8
Conversation
f40532d
to
bd58e68
Compare
src/sinks/s3_common/config.rs
Outdated
@@ -311,8 +311,11 @@ impl RetryLogic for S3RetryLogic { | |||
type Error = SdkError<PutObjectError, HttpResponse>; | |||
type Response = S3Response; | |||
|
|||
fn is_retriable_error(&self, error: &Self::Error) -> bool { | |||
is_retriable_error(error) | |||
fn is_retriable_error(&self, _error: &Self::Error) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any limit to how many retries we can do? Worried that we will have infinite retries which will make vector unstable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point, the limit is configurable but by default it is basically infinite, should update our config to set that before we add this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged a change to set limit
// For now, retry request in all cases | ||
|
||
// error.status().is_server_error() | ||
// || StatusCode::TOO_MANY_REQUESTS.as_u16() == Into::<u16>::into(error.status()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we start with opening up just the conditions we see? I am sure not all 400 will be retriable? See error codes here: https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry meant to add this comment to the s3 sink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think then in the AWS case the only issue we see is a 400 Bad Request with ExpiredToken error code. In that case, we'd only update the logic to handle just that part cc @anil-db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to just handle the ExpiredToken issue we see
src/sinks/util/retries.rs
Outdated
@@ -193,12 +193,14 @@ where | |||
); | |||
Some(self.build_retry()) | |||
} else { | |||
// For now, retry request in all cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice.
it took me sometime to understand. can you add a bit more explanation referencing line 177/189.
basically what is happening at line 177/189 and so when we would end up here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM.
@@ -312,7 +314,12 @@ impl RetryLogic for S3RetryLogic { | |||
type Response = S3Response; | |||
|
|||
fn is_retriable_error(&self, error: &Self::Error) -> bool { | |||
is_retriable_error(error) | |||
let retry = is_retriable_error(error); | |||
emit!(CheckRetryEvent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we log the error with out decision to retry or not as well. so we know we got the error and its type etc. and our decision about. would help in improving is_retriable_error
src/sinks/util/retries.rs
Outdated
error!( | ||
message = "Unexpected error type; dropping the request.", | ||
// message = "Unexpected error type; dropping the request.", | ||
message = "Unexpected error type encountered.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append ..retrying
in message just to be clear for someone who just see this log.
status_code: error.error_code().unwrap_or(""), | ||
retry: true, | ||
}); | ||
true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for adding log here with error and our decision to retry.
Updating sink retry logic. For now, make it accept anything, but can eventually tune it for the specific issues that pop up to upstream
Added unit tests, which test two issues that we want to retry
104 (connection reset) request that can't be properly cast for Azure (ELK)
400 Bad Request due to token expiration (ELK)
Also double-checked with load test