Skip to content

Commit

Permalink
Change from averages to statistic_values
Browse files Browse the repository at this point in the history
  • Loading branch information
calh committed Jul 12, 2022
1 parent 479be5f commit b8c310e
Show file tree
Hide file tree
Showing 7 changed files with 179 additions and 45 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
deploy.zip
51 changes: 48 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ There are quite a few things going on under the hood of Aurora, some of which mi
consuming extra resources without much explanation.

For each Aurora Postgres instance, there are `RDS processes`, `Aurora Storage Daemon`,
`rsdadmin` background processes, aurora runtimes, and `OS processes`. You can see
`rsdadmin` background processes, aurora runtimes and `OS processes`. You can see
a glimpse of them in the RDS dashboard, under Monitoring -> OS Process List.

After spending months tracking down unexplained CPU utilization, I discovered
Expand All @@ -18,6 +18,11 @@ and then publishes custom CloudWatch metrics for a given RDS instance.

(Neat screenshot here)

It also pulls the overall CPU metrics like user CPU, system, IRQ, nice, etc
and publishes them as a separate metric.

Inspiration for this project was taken from the [rds top script](https://gist.github.com/matheusoliveira/0e9b13d2fca6e7ab993c03e946806503).

While this was written for Aurora Postgres, it could be tailored for MySQL as well.

### First Local Test
Expand All @@ -30,12 +35,12 @@ $ bundle install
$ export AWS_ACCESS_KEY_ID=...
$ export AWS_SECRET_ACCESS_KEY=...
$ export AWS_DEFAULT_REGION=...
$ bundle exec ruby runner.rb
$ bundle exec ruby runner.rb my-instance-name
```

Wait a few minutes, and then check out your [CloudWatch custom metrics](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2).

There should be an `RDS_OS_Metrics` custom namespace with everything fun in it.
There should be `RDS_OS_Metrics` and `RDS_CPU_Metrics` custom namespaces with everything fun in it.

### First Deployment

Expand All @@ -46,6 +51,9 @@ $ ./script/ci_build
$ ./script/create_function --profile me --region us-east-1 --name rdsosmetrics
```

Note: You only need one function deployed. EventBridge Rules can execute
the same function for many instance names.

### Create an EventBridge Rule

[Create a Rule](https://us-east-1.console.aws.amazon.com/events/home?region=us-east-1#/rules/create)
Expand Down Expand Up @@ -98,6 +106,10 @@ Everything else I group into an Other category.
}
```

Note: The metrics have minimum, maximum, sum and count published. Average
is derived in CloudWatch. Since it's hard to catch bursty CPU activity
over a whole minute, try using stat=Maximum or Sum to see what looks better.

And one for memory, although this isn't as interesting:


Expand All @@ -120,11 +132,44 @@ And one for memory, although this isn't as interesting:
}
```

Switching to overall CPU metrics, create something like:

```
{
"metrics": [
[ "RDS_CPU_Metrics", "nice", "rds_instance", "prod-writer" ],
[ ".", "irq", ".", "." ],
[ ".", "guest", ".", "." ],
[ ".", "idle", ".", ".", { "visible": false } ],
[ ".", "steal", ".", "." ],
[ ".", "user", ".", "." ],
[ ".", "wait", ".", "." ],
[ ".", "total", ".", "." ],
[ ".", "system", ".", "." ]
],
"view": "timeSeries",
"stacked": false,
"region": "us-east-1",
"stat": "Average",
"period": 60
}
```

### Updating New Code

```
$ ./script/ci_build
$ ./script/update_function --profile me --region us-east-1 --name rdsosmetrics
```

### Wishlist Items

If anyone would like to add features to this script, here are a few things
that would be great to have:

* The option to publish per-second granularity metrics for each `RDSOSMetrics` record.
This would allow CloudWatch to offer better statistics like percentiles, IQM, WM, PR, etc.
* Option to publish more metrics like uptime, disk IO, network IO, load average. Although
most of these are available in some other form, they're a little more annoying to work with.
(EX: Load average comes from Logs Insights, which is annoying to run stats and aggregations with)
170 changes: 129 additions & 41 deletions handler.rb
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
require 'json'
require 'bigdecimal'
require 'aws-sdk-rds'
require 'aws-sdk-cloudwatchlogs'
require 'aws-sdk-cloudwatch'
Expand Down Expand Up @@ -39,9 +40,105 @@ def handler(event:, context:)
start_time: (Time.now - ChronicDuration.parse(interval)).to_i * 1000
})

publish_rds_os_metrics(instance_id, events)
publish_rds_cpu_metrics(instance_id, events)

rescue => e
puts "Exception: #{e.message}"
raise e
end

# Take a process name, categorize it and return the
# dimensions of a CW metric for this PID
def parse_process_dimension(instance_id, name)
dimension = [
{ name: "rds_instance", value: instance_id }
]
case name
when /^postgres: postgres/, "postgres"
dimension.push({ name: "service", value: "postgres"})
when /^postgres: rdsadmin/, /^postgres: aurora/
dimension.push({ name: "service", value: "postgres-aurora"})
when /^postgres: /, "pg_controldata"
dimension.push({ name: "service", value: "postgres-background"})
when "Aurora Storage Daemon"
dimension.push({ name: "service", value: "aurora-storage"})
when "RDS processes"
dimension.push({ name: "service", value: "rds-processes"})
when "OS processes"
dimension.push({ name: "service", value: "os-processes"})
else
puts "Can't figure out what this process is: #{name}"
end

dimension
end

# Publish total CPU metrics for guest, irq, system,
# wait, idle, user, steal, nice, and total.
def publish_rds_cpu_metrics(instance_id, events)
sums = {}
minimums = {}
maximums = {}
event_count = 0

events.events.each do |event|
timestamp = Time.at(event.timestamp / 1000)
data = JSON.parse(event.message)
data['cpuUtilization'].each do |metric, value|
dimension = [
{ name: "rds_instance", value: instance_id },
{ name: "metric", value: metric }
]
sums[dimension] ||= 0
sums[dimension] += value

minimums[dimension] ||= BigDecimal('Infinity')
if value.to_f < minimums[dimension]
minimums[dimension] = value.to_f
end

maximums[dimension] ||= BigDecimal('-Infinity')
if value.to_f > maximums[dimension]
maximums[dimension] = value.to_f
end
end
event_count += 1
end

cw = Aws::CloudWatch::Client.new
sums.keys.each do |dimension|
metric_name = dimension.last[:value]
cw.put_metric_data({
namespace: "RDS_CPU_Metrics",
metric_data: [{
metric_name: metric_name,
timestamp: Time.now,
unit: "Percent",
statistic_values: {
sample_count: event_count,
sum: sums[dimension],
minimum: minimums[dimension],
maximum: maximums[dimension]
},
# divide by event count for average
# NOTE: statistic_values and value are mutually exclusive
#value: (sums[dimension].to_f / event_count.to_f),
dimensions: dimension[0..-2]
}]
})
end

end

# Publish per-process (categoried) CPU and memory
# utilization
def publish_rds_os_metrics(instance_id, events)
# Aggregation of all metrics for this time interval
# [ dimensions ] => value
sums = {}
minimums = {}
maximums = {}
event_count = 0

events.events.each do |event|
Expand All @@ -53,66 +150,57 @@ def handler(event:, context:)
# interested in just percentages
sums[ dimension + [{name:"metric",value:"CPU"}] ] ||= 0
sums[ dimension + [{name:"metric",value:"CPU"}] ] += process['cpuUsedPc'].to_f
#if process['cpuUsedPc'].to_f > sums[ dimension + [{name:"metric",value:"CPU"}] ].to_f
# sums[ dimension + [{name:"metric",value:"CPU"}] ] = process['cpuUsedPc'].to_f
#end

minimums[ dimension + [{name:"metric",value:"CPU"}] ] ||= BigDecimal('Infinity')
if process['cpuUsedPc'].to_f < minimums[ dimension + [{name:"metric",value:"CPU"}] ]
minimums[ dimension + [{name:"metric",value:"CPU"}] ] = process['cpuUsedPc'].to_f
end

maximums[ dimension + [{name:"metric",value:"CPU"}] ] ||= BigDecimal('-Infinity')
if process['cpuUsedPc'].to_f > maximums[ dimension + [{name:"metric",value:"CPU"}] ]
maximums[ dimension + [{name:"metric",value:"CPU"}] ] = process['cpuUsedPc'].to_f
end

sums[ dimension + [{name:"metric",value:"Memory"}] ] ||= 0
sums[ dimension + [{name:"metric",value:"Memory"}] ] += process['memoryUsedPc'].to_f
#if process['memoryUsedPc'].to_f > sums[ dimension + [{name:"metric",value:"Memory"}] ]
# sums[ dimension + [{name:"metric",value:"Memory"}] ] = process['memoryUsedPc'].to_f
#end

minimums[ dimension + [{name:"metric",value:"Memory"}] ] ||= BigDecimal('Infinity')
if process['memoryUsedPc'].to_f < minimums[ dimension + [{name:"metric",value:"Memory"}] ]
minimums[ dimension + [{name:"metric",value:"Memory"}] ] = process['memoryUsedPc'].to_f
end

maximums[ dimension + [{name:"metric",value:"Memory"}] ] ||= BigDecimal('-Infinity')
if process['memoryUsedPc'].to_f > maximums[ dimension + [{name:"metric",value:"Memory"}] ]
maximums[ dimension + [{name:"metric",value:"Memory"}] ] = process['memoryUsedPc'].to_f
end

end
event_count += 1
end

# Iterate over the sums and publish average statistics
# for this time interval
cw = Aws::CloudWatch::Client.new
sums.each do |dimension, value|
metric_name = dimension.pop[:value]
sums.keys.each do |dimension|
metric_name = dimension.last[:value]
cw.put_metric_data({
namespace: "RDS_OS_Metrics",
metric_data: [{
metric_name: metric_name,
timestamp: Time.now,
unit: "Percent",
statistic_values: {
sample_count: event_count,
sum: sums[dimension],
minimum: minimums[dimension],
maximum: maximums[dimension]
},
# divide by event count for average
value: (value.to_f / event_count.to_f),
# NOTE: Do we want to use the max instead?
#value: value.to_f,
dimensions: dimension
# NOTE: statistic_values and value are mutually exclusive
#value: (sums[dimension].to_f / event_count.to_f),
dimensions: dimension[0..-2]
}]
})
end

rescue => e
puts "Exception: #{e.message}"
raise e
end

# Take a process name, categorize it and return the
# dimensions of a CW metric for this PID
def parse_process_dimension(instance_id, name)
dimension = [
{ name: "rds_instance", value: instance_id }
]
case name
when /^postgres: postgres/, "postgres"
dimension.push({ name: "service", value: "postgres"})
when /^postgres: rdsadmin/, /^postgres: aurora/
dimension.push({ name: "service", value: "postgres-aurora"})
when /^postgres: /, "pg_controldata"
dimension.push({ name: "service", value: "postgres-background"})
when "Aurora Storage Daemon"
dimension.push({ name: "service", value: "aurora-storage"})
when "RDS processes"
dimension.push({ name: "service", value: "rds-processes"})
when "OS processes"
dimension.push({ name: "service", value: "os-processes"})
else
puts "Can't figure out what this process is: #{name}"
end

dimension
end
2 changes: 1 addition & 1 deletion runner.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# Set this to your db instance name
event = {
"instance_id" => "my-prod-writer"
"instance_id" => ARGV[0]
}

require "./handler.rb"
Expand Down
Binary file added screenshots/cpu_metrics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/os_memory_metrics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/os_metrics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b8c310e

Please sign in to comment.