Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC-353] Alerts in Dagster+ #23963

Merged
merged 1 commit into from
Aug 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
321 changes: 319 additions & 2 deletions docs/docs-beta/docs/dagster-plus/deployment/alerts.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,322 @@
---
title: "Dagster+ alerts"
title: Setting up alerts on Dagster+
sidebar_position: 30
sidebar_label: "Dagster+ Alerts"
---
[comment]: <> (This file is automatically generated by `dagster-plus/deployment/alerts/generate_alerts_doc.py`)

# Dagster+ alerts
Dagster+ allows you to configure alerts to automatically fire in response to a range of events. These alerts can be sent to a variety of different services, depending on your organization's needs.

These alerts can be configured in the Dagster+ UI, or using the `dagster-cloud` CLI tool.

<details>
<summary>Prerequisites</summary>
- **Organization**, **Admin**, or **Editor** permissions on Dagster+
</details>

## Configuring a notification service

To start, you'll need to configure a service to send alerts. Dagster+ current supports sending alerts through email, Microsoft Teams, PagerDuty, and Slack.

<Tabs groupId="notification_service">
<TabItem value='email' label='Email'>
No additional configuration is required to send emails from Dagster+.

All alert emails will be sent by `"[email protected]"` or `"no-reply@<subdomain>.dagster.cloud"`. Alerts can be configured to be sent to any number of emails.
</TabItem>
<TabItem value='microsoft_teams' label='Microsoft Teams'>
Create an incoming webhook by following the [Microsoft Teams documentation](https://learn.microsoft.com/en-us/microsoftteams/platform/webhooks-and-connectors/how-to/add-incoming-webhook?tabs=newteams%2Cdotnet).

This will provide you with a **webhook URL** which will be required when configuring alerts in the UI (after selecting "Microsoft Teams" as your Notification Service) or using the CLI (in the `notification_service` configuration).

</TabItem>
<TabItem value='pagerduty' label='PagerDuty'>
:::note
You will need sufficient permissions in PagerDuty to add or edit services.
:::

In PagerDuty, you can either:

- [Create a new service](https://support.pagerduty.com/main/docs/services-and-integrations#create-a-service), and add Dagster+ as an integration, or
- [Edit an existing service](https://support.pagerduty.com/main/docs/services-and-integrations#edit-service-settings) to include Dagster+ as an integration

When configuring the integration, choose **Dagster+** as the integration type, and choose an integration name in the format `dagster-plus-{your_service_name}`.

After adding your new integration, you will be taken to a screen containing an **Integration Key**. This value will be required when configuring alerts in the UI (after selecting "PagerDuty" as your Notification Service) or using the CLI (in the `notification_service` configuration).

</TabItem>
<TabItem value='slack' label='Slack'>
:::note
You will need sufficient permissions in Slack to add apps to your workspace.
:::
Navigate to **Deployment > Alerts** in the Dagster+ UI and click **Connect to Slack**. From there, you can complete the installation process.

When setting up an alert, you can choose a Slack channel to send those alerts to. Make sure to invite the `@Dagster+` bot to any channel that you'd like to receive an alert in.

</TabItem>
</Tabs>

## Alerting when a run fails
You can set up alerts to notify you when a run fails.

By default, these alerts will target all runs in the deployment, but they can be scoped to runs with a specific tag.
<Tabs groupId="ui_or_code">
<TabItem value='ui' label='In the UI'>
1. In the Dagster UI, click **Deployment**.
2. Click the **Alerts** tab.
3. Click **Add alert policy**.
4. Select **Run alert** from the dropdown.

5. Select **Job failure**.

If desired, add **tags** in the format `{key}:{value}` to filter the runs that will be considered.

</TabItem>
<TabItem value='code' label='In code'>
Execute the following command to sync the configured alert policy to your Dagster+ deployment.

```bash
dagster-cloud deployment alert-policies sync -a /path/to/alert_policies.yaml
```
<Tabs groupId="notification_service">
<TabItem value='email' label='Email'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-email.yaml" language="yaml" />
</TabItem>
<TabItem value='microsoft_teams' label='Microsoft Teams'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-microsoft_teams.yaml" language="yaml" />
</TabItem>
<TabItem value='pagerduty' label='PagerDuty'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-pagerduty.yaml" language="yaml" />
</TabItem>
<TabItem value='slack' label='Slack'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-slack.yaml" language="yaml" />
</TabItem>
</Tabs>

</TabItem>
</Tabs>

## Alerting when a run is taking too long to complete
You can set up alerts to notify you whenever a run takes more than some threshold amount of time.

By default, these alerts will target all runs in the deployment, but they can be scoped to runs with a specific tag.
<Tabs groupId="ui_or_code">
<TabItem value='ui' label='In the UI'>
1. In the Dagster UI, click **Deployment**.
2. Click the **Alerts** tab.
3. Click **Add alert policy**.
4. Select **Run alert** from the dropdown.

5. Select **Job running over** and how many hours to alert after.

If desired, add **tags** in the format `{key}:{value}` to filter the runs that will be considered.

</TabItem>
<TabItem value='code' label='In code'>
Execute the following command to sync the configured alert policy to your Dagster+ deployment.

```bash
dagster-cloud deployment alert-policies sync -a /path/to/alert_policies.yaml
```
<Tabs groupId="notification_service">
<TabItem value='email' label='Email'>
<CodeExample filePath="dagster-plus/deployment/alerts/job-running-over-one-hour-email.yaml" language="yaml" />
</TabItem>
<TabItem value='microsoft_teams' label='Microsoft Teams'>
<CodeExample filePath="dagster-plus/deployment/alerts/job-running-over-one-hour-microsoft_teams.yaml" language="yaml" />
</TabItem>
<TabItem value='pagerduty' label='PagerDuty'>
<CodeExample filePath="dagster-plus/deployment/alerts/job-running-over-one-hour-pagerduty.yaml" language="yaml" />
</TabItem>
<TabItem value='slack' label='Slack'>
<CodeExample filePath="dagster-plus/deployment/alerts/job-running-over-one-hour-slack.yaml" language="yaml" />
</TabItem>
</Tabs>

</TabItem>
</Tabs>

## Alerting when an asset fails to materialize
You can set up alerts to notify you when an asset materialization attempt fails.

By default, these alerts will target all assets in the deployment, but they can be scoped to a specific asset or group of assets.
<Tabs groupId="ui_or_code">
<TabItem value='ui' label='In the UI'>
1. In the Dagster UI, click **Deployment**.
2. Click the **Alerts** tab.
3. Click **Add alert policy**.
4. Select **Asset alert** from the dropdown.

5. Select **Failure** under the **Materializations** heading.

If desired, select a **target** from the dropdown menu to scope this alert to a specific asset or group.

</TabItem>
<TabItem value='code' label='In code'>
Execute the following command to sync the configured alert policy to your Dagster+ deployment.

```bash
dagster-cloud deployment alert-policies sync -a /path/to/alert_policies.yaml
```
<Tabs groupId="notification_service">
<TabItem value='email' label='Email'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-email.yaml" language="yaml" />
</TabItem>
<TabItem value='microsoft_teams' label='Microsoft Teams'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-microsoft_teams.yaml" language="yaml" />
</TabItem>
<TabItem value='pagerduty' label='PagerDuty'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-pagerduty.yaml" language="yaml" />
</TabItem>
<TabItem value='slack' label='Slack'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-slack.yaml" language="yaml" />
</TabItem>
</Tabs>

</TabItem>
</Tabs>

## Alerting when an asset check fails
You can set up alerts to notify you when an asset check on an asset fails.

By default, these alerts will target all assets in the deployment, but they can be scoped to checks on a specific asset or group of assets.
<Tabs groupId="ui_or_code">
<TabItem value='ui' label='In the UI'>
1. In the Dagster UI, click **Deployment**.
2. Click the **Alerts** tab.
3. Click **Add alert policy**.
4. Select **Asset alert** from the dropdown.

5. Select **Failed (ERROR)** under the **Asset Checks** heading.

If desired, select a **target** from the dropdown menu to scope this alert to a specific asset or group.

</TabItem>
<TabItem value='code' label='In code'>
Execute the following command to sync the configured alert policy to your Dagster+ deployment.

```bash
dagster-cloud deployment alert-policies sync -a /path/to/alert_policies.yaml
```
<Tabs groupId="notification_service">
<TabItem value='email' label='Email'>
<CodeExample filePath="dagster-plus/deployment/alerts/asset-check-failed-email.yaml" language="yaml" />
</TabItem>
<TabItem value='microsoft_teams' label='Microsoft Teams'>
<CodeExample filePath="dagster-plus/deployment/alerts/asset-check-failed-microsoft_teams.yaml" language="yaml" />
</TabItem>
<TabItem value='pagerduty' label='PagerDuty'>
<CodeExample filePath="dagster-plus/deployment/alerts/asset-check-failed-pagerduty.yaml" language="yaml" />
</TabItem>
<TabItem value='slack' label='Slack'>
<CodeExample filePath="dagster-plus/deployment/alerts/asset-check-failed-slack.yaml" language="yaml" />
</TabItem>
</Tabs>

</TabItem>
</Tabs>

## Alerting when a schedule or sensor tick fails
You can set up alerts to fire when any schedule or sensor tick across your entire deployment fails.

Alerts are sent only when a schedule/sensor transitions from **success** to **failure**, so only the initial failure will trigger the alert.
<Tabs groupId="ui_or_code">
<TabItem value='ui' label='In the UI'>
1. In the Dagster UI, click **Deployment**.
2. Click the **Alerts** tab.
3. Click **Add alert policy**.
4. Select **Schedule/Sensor alert** from the dropdown.
</TabItem>
<TabItem value='code' label='In code'>
Execute the following command to sync the configured alert policy to your Dagster+ deployment.

```bash
dagster-cloud deployment alert-policies sync -a /path/to/alert_policies.yaml
```
<Tabs groupId="notification_service">
<TabItem value='email' label='Email'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-email.yaml" language="yaml" />
</TabItem>
<TabItem value='microsoft_teams' label='Microsoft Teams'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-microsoft_teams.yaml" language="yaml" />
</TabItem>
<TabItem value='pagerduty' label='PagerDuty'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-pagerduty.yaml" language="yaml" />
</TabItem>
<TabItem value='slack' label='Slack'>
<CodeExample filePath="dagster-plus/deployment/alerts/schedule-sensor-failure-slack.yaml" language="yaml" />
</TabItem>
</Tabs>

</TabItem>
</Tabs>

## Alerting when a code location fails to load
You can set up alerts to fire when any code location fails to load due to an error.
<Tabs groupId="ui_or_code">
<TabItem value='ui' label='In the UI'>
1. In the Dagster UI, click **Deployment**.
2. Click the **Alerts** tab.
3. Click **Add alert policy**.
4. Select **Code location error alert** from the dropdown.
</TabItem>
<TabItem value='code' label='In code'>
Execute the following command to sync the configured alert policy to your Dagster+ deployment.

```bash
dagster-cloud deployment alert-policies sync -a /path/to/alert_policies.yaml
```
<Tabs groupId="notification_service">
<TabItem value='email' label='Email'>
<CodeExample filePath="dagster-plus/deployment/alerts/code-location-error-email.yaml" language="yaml" />
</TabItem>
<TabItem value='microsoft_teams' label='Microsoft Teams'>
<CodeExample filePath="dagster-plus/deployment/alerts/code-location-error-microsoft_teams.yaml" language="yaml" />
</TabItem>
<TabItem value='pagerduty' label='PagerDuty'>
<CodeExample filePath="dagster-plus/deployment/alerts/code-location-error-pagerduty.yaml" language="yaml" />
</TabItem>
<TabItem value='slack' label='Slack'>
<CodeExample filePath="dagster-plus/deployment/alerts/code-location-error-slack.yaml" language="yaml" />
</TabItem>
</Tabs>

</TabItem>
</Tabs>

## Alerting when a Hybrid agent becomes unavailable
:::note
This is only available for [Hybrid](/todo) deployments.
:::

You can set up alerts to fire if your Hybrid agent hasn't sent a heartbeat in the last 5 minutes.
<Tabs groupId="ui_or_code">
<TabItem value='ui' label='In the UI'>
1. In the Dagster UI, click **Deployment**.
2. Click the **Alerts** tab.
3. Click **Add alert policy**.
4. Select **Code location error alert** from the dropdown.
</TabItem>
<TabItem value='code' label='In code'>
Execute the following command to sync the configured alert policy to your Dagster+ deployment.

```bash
dagster-cloud deployment alert-policies sync -a /path/to/alert_policies.yaml
```
<Tabs groupId="notification_service">
<TabItem value='email' label='Email'>
<CodeExample filePath="dagster-plus/deployment/alerts/code-location-error-email.yaml" language="yaml" />
</TabItem>
<TabItem value='microsoft_teams' label='Microsoft Teams'>
<CodeExample filePath="dagster-plus/deployment/alerts/code-location-error-microsoft_teams.yaml" language="yaml" />
</TabItem>
<TabItem value='pagerduty' label='PagerDuty'>
<CodeExample filePath="dagster-plus/deployment/alerts/code-location-error-pagerduty.yaml" language="yaml" />
</TabItem>
<TabItem value='slack' label='Slack'>
<CodeExample filePath="dagster-plus/deployment/alerts/code-location-error-slack.yaml" language="yaml" />
</TabItem>
</Tabs>

</TabItem>
</Tabs>
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# alert_policies.yaml

alert_policies:
alert_targets:
- asset_key_target:
asset_key:
- s3
- report
- asset_group_target:
asset_group: transformed
location_name: prod
repo_name: __repository__
description: Sends an email when an asset check fails.
event_types:
- ASSET_CHECK_SEVERITY_ERROR
name: asset-check-failed-email
notification_service:
email_addresses:
- [email protected]
- [email protected]
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# alert_policies.yaml

alert_policies:
alert_targets:
- asset_key_target:
asset_key:
- s3
- report
- asset_group_target:
asset_group: transformed
location_name: prod
repo_name: __repository__
description: Sends a Microsoft Teams webhook when an asset check fails.
event_types:
- ASSET_CHECK_SEVERITY_ERROR
name: asset-check-failed-microsoft_teams
notification_service:
webhook_url: https://yourdomain.webhook.office.com/...
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# alert_policies.yaml

alert_policies:
alert_targets:
- asset_key_target:
asset_key:
- s3
- report
- asset_group_target:
asset_group: transformed
location_name: prod
repo_name: __repository__
description: Sends a PagerDuty alert when an asset check fails.
event_types:
- ASSET_CHECK_SEVERITY_ERROR
name: asset-check-failed-pagerduty
notification_service:
integration_key: <pagerduty_integration_key>
Loading
Loading