-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(seer grouping): Add debug logging for old excess frames check #83082
chore(seer grouping): Add debug logging for old excess frames check #83082
Conversation
"event_id": event.event_id, | ||
"project_id": event.project_id, | ||
"platform": event.platform, | ||
"has_too_many_frames_result": has_too_many_contributing_frames( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking that this isn't expensive to run here? I assume this log happens only on group creation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It runs on every event, but no, it's not expensive - all it does is check a few different pre-computed values (and collect metrics, if record_metrics
is true, but in this case it's false, so that's a moot point).
sentry/src/sentry/seer/similarity/utils.py
Lines 323 to 382 in 520f8d9
def has_too_many_contributing_frames( | |
event: Event | GroupEvent, | |
variants: dict[str, BaseVariant], | |
referrer: ReferrerOptions, | |
record_metrics: bool = True, | |
) -> bool: | |
platform = event.platform | |
shared_tags = {"referrer": referrer.value, "platform": platform} | |
contributing_variant, contributing_component = get_contributing_variant_and_component(variants) | |
# Ideally we're calling this function after we already know the event both has a stacktrace and | |
# is using it for grouping (in which case none of the below conditions should apply), but still | |
# worth checking that we have enough information to answer the question just in case | |
if ( | |
# Fingerprint, checksum, fallback variants | |
not isinstance(contributing_variant, ComponentVariant) | |
# Security violations, log-message-based grouping | |
or contributing_variant.variant_name == "default" | |
# Any ComponentVariant will have this, but this reassures mypy | |
or not contributing_component | |
# Exception-message-based grouping | |
or not hasattr(contributing_component, "frame_counts") | |
): | |
# We don't bother to collect a metric on this outcome, because we shouldn't have called the | |
# function in the first place | |
return False | |
# Certain platforms were backfilled before we added this filter, so to keep new events matching | |
# with the existing data, we turn off the filter for them (instead their stacktraces will be | |
# truncated) | |
if platform in EVENT_PLATFORMS_BYPASSING_FRAME_COUNT_CHECK: | |
if record_metrics: | |
metrics.incr( | |
"grouping.similarity.frame_count_filter", | |
sample_rate=options.get("seer.similarity.metrics_sample_rate"), | |
tags={**shared_tags, "outcome": "bypass"}, | |
) | |
return False | |
stacktrace_type = "in_app" if contributing_variant.variant_name == "app" else "system" | |
key = f"{stacktrace_type}_contributing_frames" | |
shared_tags["stacktrace_type"] = stacktrace_type | |
if contributing_component.frame_counts[key] > MAX_FRAME_COUNT: | |
if record_metrics: | |
metrics.incr( | |
"grouping.similarity.frame_count_filter", | |
sample_rate=options.get("seer.similarity.metrics_sample_rate"), | |
tags={**shared_tags, "outcome": "block"}, | |
) | |
return True | |
if record_metrics: | |
metrics.incr( | |
"grouping.similarity.frame_count_filter", | |
sample_rate=options.get("seer.similarity.metrics_sample_rate"), | |
tags={**shared_tags, "outcome": "pass"}, | |
) | |
return False |
If you still think it's an issue, I can move the call to where the log is, which would drastically cut down the number of times that it runs. I originally had it that way, but in the end, especially since it's hopefully only going to be in place for a day or two while I debug this, it just seemed easier to pass one metadata blob rather than all the stuff needed to compute said blob's values. LMK if you want me to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that as long as it won't cause a performance issue in prod I'm fine with it as is
…ehavior log (#83150) This is a follow-up to #83082, which added a temporary log to help debug how we're still landing in the old excess-frame-check logic. Some logs have come through and a) the stacktrace is what you'd expect, so we can stop sending it, and b) the new check is giving a false negative, so now we need to see the data on which it's basing its decision (the frame counts). Interestingly, so far all of the events in question come from the same project. I spot-checked one event and it looked normal... We'll see what this new log data turns up.
…83082) As of #82414, we should no longer be using `TooManyOnlySystemFramesException`, but the fact that we're still getting `did_call_seer` metrics tagged with `"over-threshold-frames"` says that we still are, somehow. This is currently a blocker on merging #82434, which removes `TooManyOnlySystemFramesException` and the code associated with it, so it'd be nice to figure out why it's happening so we can fix it. Try as I might, I can't replicate it, nor can I find the flaw in the logic, so this adds a temporary log to try to help figure out how in the heck it's happening. It also adds a `record_metrics` parameter to the `has_too_many_contributing_frames` helper, so that I can call it for purposes of the logging without messing up the metrics it collects. Once I figure it out what's happening, I'll back out all of these changes.
…ehavior log (#83150) This is a follow-up to #83082, which added a temporary log to help debug how we're still landing in the old excess-frame-check logic. Some logs have come through and a) the stacktrace is what you'd expect, so we can stop sending it, and b) the new check is giving a false negative, so now we need to see the data on which it's basing its decision (the frame counts). Interestingly, so far all of the events in question come from the same project. I spot-checked one event and it looked normal... We'll see what this new log data turns up.
As of #82414, we should no longer be using
TooManyOnlySystemFramesException
, but the fact that we're still gettingdid_call_seer
metrics tagged with"over-threshold-frames"
says that we still are, somehow. This is currently a blocker on merging #82434, which removesTooManyOnlySystemFramesException
and the code associated with it, so it'd be nice to figure out why it's happening so we can fix it.Try as I might, I can't replicate it, nor can I find the flaw in the logic, so this adds a temporary log to try to help figure out how in the heck it's happening. It also adds a
record_metrics
parameter to thehas_too_many_contributing_frames
helper, so that I can call it for purposes of the logging without messing up the metrics it collects. Once I figure it out what's happening, I'll back out all of these changes.