-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't log "repeated timestamp but different value" message as error/warning #1478
Comments
Hi @weeco , I'd like to work on this issue. |
Hey @jayapriya90 Please go ahead! So I don't have a good solution to the problem tbh, but the flow is like this. The distributor calls Push API on the ingester, which returns an error and now in the distributor we log all the errors returned. The error returned by the ingester: https://github.com/cortexproject/cortex/blob/master/pkg/distributor/distributor.go#L422 https://github.com/cortexproject/cortex/blob/master/pkg/distributor/distributor.go#L373-L390 and https://github.com/cortexproject/cortex/blob/master/pkg/distributor/http_server.go#L38-L48 Where we handle that error. Though as the error is literally a string, I am not sure of a good way to handle it. |
We could put errors like this in |
There is some work in weaveworks/common#195 to rate-limit "high volume errors", |
To update some of the earlier analysis since code may have moved: The error logged by the distributor comes from here: Line 47 in 8587ea6
and looks like this:
That error has been decoded by gRPC so we can have as much detail as we wish, although it seems to me that by this point in the code we know it's an error sent by ingester (as opposed to a network timeout, say, which we would fail without logging a few lines above). The error logged by the ingester comes from here: cortex/vendor/github.com/weaveworks/common/middleware/grpc_logging.go Lines 35 to 39 in 8587ea6
and looks like this:
Seems we could distinguish |
I'm assuming this hasn't been fixed since this is still open, but just to clarify do you still need someone to take care of this? If so I would like to dip my toes in and see if I can help out |
This issue is still open. What's your proposal to fix it? |
I do not have a proposal yet. My team just started using Cortex the past week and this was something we noticed. I figured if someone else wasn't already working on it that I would try to find a solution to it |
@dmares01 by all means take a look. We don't think anyone is actively looking at this issue. |
@bboreham 👋🏼 I was taking a look at this issue and trying to understand what needs to be done. I think I understand what you said above but had a few foggy areas that I was hoping you could shed some light on: The logged ingester warning from: cortex/vendor/github.com/weaveworks/common/middleware/grpc_logging.go Lines 35 to 39 in 8587ea6
can be handled by implementing The logged distributor error: Line 47 in 8587ea6
similar question as the above for this as well, if we distinguish the error in some form, we would still be excluding other errors of that form. I see that the error is constructed from Lines 34 to 41 in 6408117
One possible solution could be if we can maybe extract the Line 18 in 6408117
but this would mean exporting types, and I'm not sure if that's the best way to go about it. wdyt? I might've completely misunderstood the approach mentioned by you as well, please lmk if that's the case :) |
@MadhavJivrajani sounds good so far. Maybe Cortex could pass in a filtering function (via an extension to weaveworks/common) ? |
The problem:
When you receive multiple values for the same metric series at the same timestamp this will be rejected with a 400 error. This is a desired behaviour, however it also logs every occurance as error log (in the distributor) and at the warn level in the ingester.
Warn and error logs should only indicate problems. Under certain circumstances you cannot fix the root cause of this (fixing the reported metrics) and therefore this can quickly become a common scenario. However I'd still like to use the occurences of log messages at the warn/error level as metric which indicate problems which need attention.
Proposal:
The log message shows the affect metric series, which certainly helps in tracking the issue. Therefore I suggest to log these events at the
info
level rather than error or warn.The text was updated successfully, but these errors were encountered: