Skip to content

Commit

Permalink
Merge pull request #10 from AndrewFarley/add-more-alarms-allow-custom…
Browse files Browse the repository at this point in the history
…izing-some-alarms

Adding tags, KMS Support, Customizing some periods
  • Loading branch information
dubiety authored Apr 26, 2021
2 parents 87d718e + d77cf02 commit 2451df9
Show file tree
Hide file tree
Showing 5 changed files with 188 additions and 58 deletions.
95 changes: 53 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,30 +3,32 @@
[![Build Status](https://travis-ci.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms.svg?branch=master)](https://travis-ci.org/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms)
[![Latest Release](https://img.shields.io/github/release/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms.svg)](https://github.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms/releases)

Terraform module that configures important elasticsearch alerts using CloudWatch and sends them to an SNS topic.

Create a set of sane Elasticsearch CloudWatch alerts for monitoring the health of an elasticsearch cluster.
Terraform module that configures the [recommended Amazon ElasticSearch Alarms](https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/cloudwatch-alarms.html) using CloudWatch and sends alerts to an SNS topic. By default, this module creates an SNS topic, but it can be configured to point to an existing SNS topic (see [example](./examples/use-existing-sns/main.tf))

`v1.x` supports terraform `v0.12` syntax!

This project is inspired by [CloudPosse](https://github.com/cloudposse)

It's 100% Open Source and licensed under the [APACHE2](LICENSE).

## Usage

| area | metric | comparison operator | threshold | rationale |
|------------|---------------------------|---------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Sharding | ClusterStatus.red | `>=` | 1 | At least one primary shard and its replicas are not allocated to a node for 1 minute 1 consecutive time. Threshold should always be 1. |
| Sharding | ClusterStatus.yellow | `>=` | 1 | At least one replica shard is not allocated to a node for 1 minute 1 consecutive time. Threshold should always be 1. |
| Storage | FreeStorageSpace | `<=` | 20480 MB | A node in your cluster is down to 20 GiB of free storage space for 1 minute 1 consecutive time. This value is in MiB, so rather than 20480, we recommend setting it to 25% of the storage space for each node. |
| Storage | ClusterIndexWritesBlocked | `>=` | 1 | The cluster is blocking write requests for 5 minutes 1 consecutive time. Threshold should always be 1. |
| Node Count | Nodes | `<` | `x` | `x` is the number of nodes in your cluster. This alarm indicates that at least one node in your cluster has been unreachable for one day. |
| Snapshot | AutomatedSnapshotFailure | `>=` | 1 | An automated snapshot failed for 1 minute 1 consecutive time. This failure is often the result of a red cluster health status. |
| CPU | CPUUtilization | `>=` | 80 % | CPU utilization average is >= 80% for 15 minutes, 3 consecutive times for the node cluster. |
| Memory | JVMMemoryPressure | `>=` | 80 % | JVMMemoryPressure maximum is >= 80% for 15 minutes, 1 consecutive time. |
| CPU | MasterCPUUtilization | `>=` | 80 % | Dedicated master nodes' CPU utilization is >= 80% for 15 minutes, 3 consecutive times. |
| Memory | MasterJVMMemoryPressure | `>=` | 80 % | Dedicated master nodes' maximum JVM memory usage is >= 80% for 15 minutes, 1 consecutive time. |
## Metrics and Alarms

| area | metric | operator | threshold | rationale |
|------------|---------------------------|----------|-----------|----------------------------------------------------------------------------------------------------------------------------------------|
| Sharding | ClusterStatus.red | `>=` | 1 | At least one primary shard and its replicas are not allocated to a node |
| Sharding | ClusterStatus.yellow | `>=` | 1 | At least one replica shard is not allocated to a node |
| Storage | FreeStorageSpace | `<=` | 20480 MB | A node in your cluster is down to low storage space. |
| Storage | ClusterIndexWritesBlocked | `>=` | 1 | Your cluster is blocking write requests. |
| Node Count | Nodes | `<` | `x` | This alarm indicates that at least one node in your cluster has been unreachable for one day |
| Snapshot | AutomatedSnapshotFailure | `>=` | 1 | An automated snapshot failed. This failure is often the result of a red cluster health status. |
| CPU | CPUUtilization | `>=` | 80 % | 100% CPU utilization isn't uncommon, but sustained high usage is problematic. Consider using larger instance types or more instances. |
| Memory | JVMMemoryPressure | `>=` | 80 % | The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. |
| CPU | MasterCPUUtilization | `>=` | 80 % | Consider using larger instance types for your dedicated master nodes. |
| Memory | MasterJVMMemoryPressure | `>=` | 80 % | Consider using larger instance types for your dedicated master nodes. |
| KMS | KMSKeyError | `>=` | 1 | The KMS encryption key that is used to encrypt data at rest in your domain is disabled. Re-enable it to restore normal operations |
| Memory | KMSKeyInaccessible | `>=` | 80 % | The KMS encryption key that is used to encrypt data at rest in your domain has been deleted or has revoked its grants to Amazon ES |

For more information please see: [recommended Amazon ElasticSearch Alarms](https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/cloudwatch-alarms.html).

## Examples

Expand All @@ -53,6 +55,9 @@ resource "aws_elasticsearch_domain" "es" {
module "es_alarms" {
source = "github::https://github.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms.git?ref=master"
domain_name = "example"
tags = {
Domain = "TestDomain"
}
}
```

Expand All @@ -64,37 +69,43 @@ module "es_alarms" {
domain_name = "example"
sns_topic = "arn:aws:sns:us-east-1:123456123456:sns-to-slack" # < Put your full SNS ARN here, if necessary read from var or a resource
create_sns_topic = false
tags = {
Domain = "TestDomain"
}
}
```


## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|:----:|:-----:|:-----:|
| `domain_name` | The Elasticserach domain name you want to monitor. | string | - | yes |
| `alarm_name_postfix` | Alarm name postfix | string | `""` | no |
| `alarm_name_prefix` | Alarm name prefix | string | `""` | no |
| `cpu_utilization_threshold` | The maximum percentage of CPU utilization | string | `80` | no |
| `free_storage_space_threshold` | The minimum amount of available storage space in MiB. | string | `20480` | no |
| `jvm_memory_pressure_threshold` | The maximum percentage of the Java heap used for all data nodes in the cluster | string | `80` | no |
| `master_cpu_utilization_threshold` | The maximum percentage of CPU utilization of master nodes | string | `""` | no |
| `master_jvm_memory_pressure_threshold` | The maximum percentage of the Java heap used for master nodes in the cluster | string | `""` | no |
| `min_available_nodes` | The minimum available (reachable) nodes to have | string | `1` | no |
| `monitor_automated_snapshot_failure` | Enable monitoring of automated snapshot failure | string | `true` | no |
| `monitor_cluster_index_writes_blocked` | Enable monitoring of cluster index writes being blocked | string | `true` | no |
| `monitor_cluster_status_is_red` | Enable monitoring of cluster status is in red | string | `true` | no |
| `monitor_cluster_status_is_yellow` | Enable monitoring of cluster status is in yellow | string | `true` | no |
| `monitor_cpu_utilization_too_high` | Enable monitoring of CPU utilization is too high | string | `true` | no |
| `monitor_free_storage_space_too_low` | Enable monitoring of cluster average free storage is to low | string | `true` | no |
| `monitor_insufficient_available_nodes` | Enable monitoring insufficient available nodes | string | `false` | no |
| `monitor_jvm_memory_pressure_too_high` | Enable monitoring of JVM memory pressure is too high | string | `true` | no |
| `monitor_master_cpu_utilization_too_high` | Enable monitoring of CPU utilization of master nodes are too high. Only enable this when dedicated master is enabled | string | `false` | no |
| `monitor_master_jvm_memory_pressure_too_high` | Enable monitoring of JVM memory pressure of master nodes are too high. Only enable this wwhen dedicated master is enabled | string | `false` | no |
| `create_sns_topic` | Will create an SNS topic, if you set this to false you MUST set `sns_topic` to a FULL ARN | string | `true` | no |
| `sns_topic` | SNS topic you want to specify. If leave empty, it will use a prefix and a timestamp appended. If `create_sns_topic` is set to false, this MUST be a FULL ARN | string | `""` | no |
| `sns_topic_postfix` | SNS topic postfix | string | `""` | no |
| `sns_topic_prefix` | SNS topic prefix | string | `""` | no |
| Name | Description | Type | Default | Required |
|-----------------------------------------------|-------------|:----:|:-------:|:--------:|
| `domain_name` | The Elasticserach domain name you want to monitor. | string | - | yes |
| `alarm_cluster_status_is_yellow_periods` | The number of periods before triggering the cluster status is yellow, raise this if desired to make less noisy | number | `1` | no |
| `alarm_free_storage_space_too_low_periods` | The number of periods before triggering the disk space is low, raise this if desired to make less noisy | number | `1` | no |
| `alarm_name_postfix` | Alarm name postfix | string | `""` | no |
| `alarm_name_prefix` | Alarm name prefix | string | `""` | no |
| `cpu_utilization_threshold` | The maximum percentage of CPU utilization | string | `80` | no |
| `free_storage_space_threshold` | The minimum amount of available storage space in MiB. | string | `20480` | no |
| `jvm_memory_pressure_threshold` | The maximum percentage of the Java heap used for all data nodes in the cluster | string | `80` | no |
| `master_cpu_utilization_threshold` | The maximum percentage of CPU utilization of master nodes | string | `""` | no |
| `master_jvm_memory_pressure_threshold` | The maximum percentage of the Java heap used for master nodes in the cluster | string | `""` | no |
| `min_available_nodes` | The minimum available (reachable) nodes to have, set to non-zero to enable alarm | string | `0` | no |
| `monitor_automated_snapshot_failure` | Enable monitoring of automated snapshot failure | bool | `true` | no |
| `monitor_cluster_index_writes_blocked` | Enable monitoring of cluster index writes being blocked | bool | `true` | no |
| `monitor_cluster_status_is_red` | Enable monitoring of cluster status is in red | bool | `true` | no |
| `monitor_cluster_status_is_yellow` | Enable monitoring of cluster status is in yellow | bool | `true` | no |
| `monitor_cpu_utilization_too_high` | Enable monitoring of CPU utilization is too high | bool | `true` | no |
| `monitor_free_storage_space_too_low` | Enable monitoring of cluster average free storage is to low | bool | `true` | no |
| `monitor_jvm_memory_pressure_too_high` | Enable monitoring of JVM memory pressure is too high | bool | `true` | no |
| `monitor_kms` | Enable monitoring of KMS-related metrics, enable if using KMS | bool | `false` | no |
| `monitor_master_cpu_utilization_too_high` | Enable monitoring of CPU utilization of master nodes are too high. Only enable this when dedicated master is enabled | bool | `false` | no |
| `monitor_master_jvm_memory_pressure_too_high` | Enable monitoring of JVM memory pressure of master nodes are too high. Only enable this wwhen dedicated master is enabled | bool | `false` | no |
| `create_sns_topic` | Will create an SNS topic, if you set this to false you MUST set `sns_topic` to a FULL ARN | bool | `true` | no |
| `sns_topic` | SNS topic you want to specify. If leave empty, it will use a prefix and a timestamp appended. If `create_sns_topic` is set to false, this MUST be a FULL ARN | string | `""` | no |
| `sns_topic_postfix` | SNS topic postfix | string | `""` | no |
| `sns_topic_prefix` | SNS topic prefix | string | `""` | no |
| `tags` | Tags to associate with all created resources | map | `{}` | no |

## Outputs

Expand Down
Loading

0 comments on commit 2451df9

Please sign in to comment.