Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't log "repeated timestamp but different value" message as error/warning #1478

Open
weeco opened this issue Jun 27, 2019 · 11 comments
Open
Labels
good first issue These are great first issues. If you are looking for a place to start, start here! hacktoberfest help wanted type/chore Something that needs to be done; not a bug or a feature

Comments

@weeco
Copy link
Contributor

weeco commented Jun 27, 2019

The problem:

When you receive multiple values for the same metric series at the same timestamp this will be rejected with a 400 error. This is a desired behaviour, however it also logs every occurance as error log (in the distributor) and at the warn level in the ingester.

Warn and error logs should only indicate problems. Under certain circumstances you cannot fix the root cause of this (fixing the reported metrics) and therefore this can quickly become a common scenario. However I'd still like to use the occurences of log messages at the warn/error level as metric which indicate problems which need attention.

Proposal:

The log message shows the affect metric series, which certainly helps in tracking the issue. Therefore I suggest to log these events at the info level rather than error or warn.

@gouthamve gouthamve added hacktoberfest help wanted type/chore Something that needs to be done; not a bug or a feature labels Sep 18, 2019
@jayapriya90
Copy link

Hi @weeco , I'd like to work on this issue.

@gouthamve
Copy link
Contributor

Hey @jayapriya90 Please go ahead!

So I don't have a good solution to the problem tbh, but the flow is like this.

The distributor calls Push API on the ingester, which returns an error and now in the distributor we log all the errors returned.

The error returned by the ingester: https://github.com/cortexproject/cortex/blob/master/pkg/distributor/distributor.go#L422

https://github.com/cortexproject/cortex/blob/master/pkg/distributor/distributor.go#L373-L390 and https://github.com/cortexproject/cortex/blob/master/pkg/distributor/http_server.go#L38-L48

Where we handle that error.

Though as the error is literally a string, I am not sure of a good way to handle it.

@bboreham
Copy link
Contributor

We could put errors like this in WriteResponse instead of error, and extend the distributor to understand such responses and send back an error to the caller without logging it.

@bboreham
Copy link
Contributor

bboreham commented Aug 20, 2020

There is some work in weaveworks/common#195 to rate-limit "high volume errors", however as it stands it would not affect 400 errors. (not directly relevant as it's for http and these are gRPC errors).

@bboreham
Copy link
Contributor

To update some of the earlier analysis since code may have moved:

The error logged by the distributor comes from here:

level.Error(logger).Log("msg", "push error", "err", err)

and looks like this:

level=error ts=2020-09-24T04:06:45.73264618Z caller=push.go:47 org_id=redacted trace_id=64b5e086737cda17 msg="push error" err="rpc error: code = Code(400) desc = user=redacted: sample with repeated timestamp but different value; last value: 284390.4, incoming value: 52708147.3 for series {__name__=\"envoy_http_downstream_cx_length_ms_sum\",redacted}"

That error has been decoded by gRPC so we can have as much detail as we wish, although it seems to me that by this point in the code we know it's an error sent by ingester (as opposed to a network timeout, say, which we would fail without logging a few lines above).

The error logged by the ingester comes from here:

if grpcUtils.IsCanceled(err) {
entry.WithField(errorKey, err).Debugln(gRPC)
} else {
entry.WithField(errorKey, err).Warnln(gRPC)
}

and looks like this:

level=warn ts=2020-09-24T04:11:41.138368379Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=297.688µs err="rpc error: code = Code(400) desc = user=redacted: sample with repeated timestamp but different value; last value: 195435, incoming value: 956673 for series {__name__=\"envoy_http_downstream_rq_time_count\", redacted}" msg="gRPC\n"

Seems we could distinguish InvalidArgument here same as we do Canceled.

@gouthamve gouthamve added the good first issue These are great first issues. If you are looking for a place to start, start here! label Feb 25, 2021
@dmares01
Copy link

I'm assuming this hasn't been fixed since this is still open, but just to clarify do you still need someone to take care of this? If so I would like to dip my toes in and see if I can help out

@pracucci
Copy link
Contributor

This issue is still open. What's your proposal to fix it?

@dmares01
Copy link

I do not have a proposal yet. My team just started using Cortex the past week and this was something we noticed. I figured if someone else wasn't already working on it that I would try to find a solution to it

@bboreham
Copy link
Contributor

bboreham commented Jul 9, 2021

@dmares01 by all means take a look. We don't think anyone is actively looking at this issue.

@MadhavJivrajani
Copy link

@bboreham 👋🏼

I was taking a look at this issue and trying to understand what needs to be done. I think I understand what you said above but had a few foggy areas that I was hoping you could shed some light on:

The logged ingester warning from:

if grpcUtils.IsCanceled(err) {
entry.WithField(errorKey, err).Debugln(gRPC)
} else {
entry.WithField(errorKey, err).Warnln(gRPC)
}

can be handled by implementing IsInvalidArgument similar to grpcUtils.IsCanceled, and if IsInvalidArgument returns true, we log at the Info level. But, this would mean that all statuses of InvalidArgument would then be logged at the info level - which I don't think is desired?

The logged distributor error:

level.Error(logger).Log("msg", "push error", "err", err)

similar question as the above for this as well, if we distinguish the error in some form, we would still be excluding other errors of that form.

I see that the error is constructed from

func makeMetricValidationError(errorType string, labels labels.Labels, err error) error {
return &validationError{
errorType: errorType,
err: err,
code: http.StatusBadRequest,
labels: labels,
}
}

One possible solution could be if we can maybe extract the errorType somehow as

newValueForTimestamp = "new-value-for-timestamp"

but this would mean exporting types, and I'm not sure if that's the best way to go about it. wdyt?

I might've completely misunderstood the approach mentioned by you as well, please lmk if that's the case :)

@bboreham
Copy link
Contributor

bboreham commented Feb 7, 2022

@MadhavJivrajani sounds good so far.
However Cortex depends on weaveworks/common; we can't export a string the other way round.

Maybe Cortex could pass in a filtering function (via an extension to weaveworks/common) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue These are great first issues. If you are looking for a place to start, start here! hacktoberfest help wanted type/chore Something that needs to be done; not a bug or a feature
Projects
None yet
Development

No branches or pull requests

7 participants