Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve Slack rate limiting logic when updating alert groups #5287

Open
wants to merge 24 commits into
base: dev
Choose a base branch
from

Conversation

joeyorlando
Copy link
Contributor

@joeyorlando joeyorlando commented Nov 21, 2024

What this PR does

https://www.loom.com/share/1ac33822301444748133ffe72638ddc4

The two asks in the original GH issue were:

  1. Make the error message clearer. We can identify if it's delivering or updating and being rate-limited. This is possible because Slack sets limits per API method. Also, this limit is a per-slack channel while we are posting messages & applying ratelimit per on-call integration, which confuses customers.
  2. Debounce update alert group message in Slack

Both of these have been addressed in this PR

Which issue(s) this PR closes

Closes https://github.com/grafana/oncall-private/issues/2947

Checklist

  • Unit, integration, and e2e (if applicable) tests updated
  • Documentation added (or pr:no public docs PR label added if not required)
  • Added the relevant release notes label (see labels prefixed w/ release:). These labels dictate how your PR will
    show up in the autogenerated release notes.

return (
self.rate_limited_in_slack_at is not None
and self.rate_limited_in_slack_at + SLACK_RATE_LIMIT_TIMEOUT > timezone.now()
)

def start_send_rate_limit_message_task(self, delay=SLACK_RATE_LIMIT_DELAY):
def start_send_rate_limit_message_task(self, error_message_verb: str, delay: int) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is currently only invoked in two spots, both of which have been updated (additionally, both of these spots that were invoking this function were already passing in a delay, hence why I removed the default of SLACK_RATE_LIMIT_DELAY)

@@ -61,8 +66,10 @@ class SlackMessage(models.Model):
related_name="slack_messages",
)

# ID of a latest celery task to update the message
active_update_task_id = models.CharField(max_length=100, null=True, default=None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field (conveniently) already existed, but it doesn't look like it was being used? (code search) in this PR I'm going to start reusing this field

Comment on lines 236 to 282
def update_alert_groups_message(self) -> None:
"""
Schedule an update task for the associated alert group's Slack message, respecting the debounce interval.

This method ensures that updates to the Slack message related to an alert group are not performed
too frequently, adhering to the `ALERT_GROUP_UPDATE_DEBOUNCE_INTERVAL_SECONDS` debounce interval.
It schedules a background task to update the message after the appropriate countdown.

The method performs the following steps:
- Checks if there's already an active update task (`active_update_task_id` is set). If so, exits to prevent
duplicate scheduling.
- Calculates the time since the last update (`last_updated` field) and determines the remaining time needed
to respect the debounce interval.
- Schedules the `update_alert_group_slack_message` task with the calculated countdown.
- Stores the task ID in `active_update_task_id` to prevent multiple tasks from being scheduled.
"""

if not self.alert_group:
logger.warning(
f"skipping update_alert_groups_message as SlackMessage {self.pk} has no alert_group associated with it"
)
return
elif self.active_update_task_id:
logger.info(
f"skipping update_alert_groups_message as SlackMessage {self.pk} has an active update task {self.active_update_task_id}"
)
return

now = timezone.now()

# we previously weren't updating the last_updated field for messages, so there will be cases
# where the last_updated field is None
last_updated = self.last_updated or now

time_since_last_update = (now - last_updated).total_seconds()
remaining_time = self.ALERT_GROUP_UPDATE_DEBOUNCE_INTERVAL_SECONDS - int(time_since_last_update)
countdown = max(remaining_time, 10)

logger.info(
f"updating message for alert_group {self.alert_group.pk} in {countdown} seconds "
f"(debounce interval: {self.ALERT_GROUP_UPDATE_DEBOUNCE_INTERVAL_SECONDS})"
)

task_id = celery_uuid()
update_alert_group_slack_message.apply_async((self.pk,), countdown=countdown, task_id=task_id)
self.active_update_task_id = task_id
self.save(update_fields=["active_update_task_id"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change

Comment on lines 32 to 79
def update_alert_group_slack_message(slack_message_pk: int) -> None:
"""
Background task to update the Slack message for an alert group.

This function is intended to be executed as a Celery task. It performs the following:
- Compares the current task ID with the `active_update_task_id` stored in the `SlackMessage`.
- If they do not match, it means a newer task has been scheduled, so the current task exits to prevent outdated updates.
- Uses the `AlertGroupSlackService` to perform the actual update of the Slack message.
- Upon successful completion, clears the `active_update_task_id` to allow future updates.

Args:
slack_message_pk (int): The primary key of the `SlackMessage` instance to update.
"""

from apps.slack.models import SlackMessage

current_task_id = update_alert_group_slack_message.request.id

return (
f"update_incident_slack_message rescheduled because of current task_id ({current_task_id})"
f" for alert_group {alert_group_pk} doesn't exist in cache"
logger.info(
f"update_alert_group_slack_message for slack message {slack_message_pk} started with task_id {current_task_id}"
)

try:
slack_message = SlackMessage.objects.get(pk=slack_message_pk)
except SlackMessage.DoesNotExist:
logger.warning(f"SlackMessage {slack_message_pk} doesn't exist")
return

if not current_task_id == slack_message.active_update_task_id:
logger.warning(
f"update_alert_group_slack_message skipped, because current_task_id ({current_task_id}) "
f"does not equal to active_update_task_id ({slack_message.active_update_task_id}) "
)
if not current_task_id == cached_task_id:
return (
f"update_incident_slack_message skipped, because of current task_id ({current_task_id})"
f" doesn't equal to cached task_id ({cached_task_id}) for alert_group {alert_group_pk}"
return

alert_group = slack_message.alert_group
if not alert_group:
logger.warning(
f"skipping update_alert_group_slack_message as SlackMessage {slack_message_pk} "
"doesn't have an alert group associated with it"
)
return

AlertGroupSlackService(slack_message.slack_team_identity).update_alert_group_slack_message(alert_group)

slack_message.active_update_task_id = None
slack_message.last_updated = timezone.now()
slack_message.save(update_fields=["active_update_task_id", "last_updated"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change



@shared_dedicated_queue_retry_task(autoretry_for=(Exception,), retry_backoff=True)
def update_incident_slack_message(slack_team_identity_pk: int, alert_group_pk: int) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been renamed to update_alert_group_slack_message.. leaving this around to allow the celery workers to process any remaining update_incident_slack_message tasks (before removing this in a subsequent PR)

@joeyorlando joeyorlando added the pr:no public docs Added to a PR that does not require public documentation updates label Nov 21, 2024
@@ -253,3 +263,54 @@ def send_slack_notification(
slack_user_identity.send_link_to_slack_message(slack_message)
except (SlackAPITokenError, SlackAPIMethodNotSupportedForChannelTypeError):
pass

def update_alert_groups_message(self, bypass_debounce: bool) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change

Comment on lines +124 to +127
# NOTE: very important. We need to debounce the update_alert_groups_message call here. This is because
# we may possibly receive a flood of incoming alerts. We do not want to trigger a Slack message update
# for each of these, and hence we should instead debounce them
alert_group_slack_message.update_alert_groups_message(bypass_debounce=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most important line of the PR

engine/apps/slack/scenarios/distribute_alerts.py Outdated Show resolved Hide resolved
github-merge-queue bot pushed a commit that referenced this pull request Nov 29, 2024
# What this PR does

Related to #5287

Few random "clean-ups", type improvements, etc.

Additionally, fixes a change made in #5292; we should wait to read from
`slack_message.channel.slack_id`, until we've performed the
data-migration mentioned in that PR (in the mean-time we should continue
to use `slack_message._channel_id`).

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
@@ -34,41 +30,12 @@ def __init__(
else:
self._slack_client = SlackClient(slack_team_identity)

def update_alert_group_slack_message(self, alert_group: "AlertGroup") -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decided to move all of this logic to the new update_alert_group_slack_message Celery task

Comment on lines +370 to +372
f"{error_message_verb} Alert Groups in Slack, for integration {integration.verbal_name}, is "
f"temporarily rate-limited (due to a Slack rate-limit). Meanwhile, you can still find new Alert Groups "
f'in the <{integration.new_incidents_web_link}|"Alert Groups" web page>'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another main change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:no public docs Added to a PR that does not require public documentation updates release:patch PR will be added to "Other Changes" section of release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants