Change from averages to statistic_values

calh · Jul 12, 2022 · b8c310e · b8c310e
1 parent 479be5f
commit b8c310e
Show file tree

Hide file tree

Showing 7 changed files with 179 additions and 45 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+deploy.zip
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ There are quite a few things going on under the hood of Aurora, some of which mi
 consuming extra resources without much explanation.
 
 For each Aurora Postgres instance, there are `RDS processes`, `Aurora Storage Daemon`, 
-`rsdadmin` background processes, aurora runtimes, and `OS processes`.  You can see 
+`rsdadmin` background processes, aurora runtimes and `OS processes`.  You can see 
 a glimpse of them in the RDS dashboard, under Monitoring -> OS Process List.
 
 After spending months tracking down unexplained CPU utilization, I discovered
@@ -18,6 +18,11 @@ and then publishes custom CloudWatch metrics for a given RDS instance.
 
 (Neat screenshot here)
 
+It also pulls the overall CPU metrics like user CPU, system, IRQ, nice, etc 
+and publishes them as a separate metric.
+
+Inspiration for this project was taken from the [rds top script](https://gist.github.com/matheusoliveira/0e9b13d2fca6e7ab993c03e946806503).
+
 While this was written for Aurora Postgres, it could be tailored for MySQL as well.  
 
 ### First Local Test
@@ -30,12 +35,12 @@ $ bundle install
 $ export AWS_ACCESS_KEY_ID=...
 $ export AWS_SECRET_ACCESS_KEY=...
 $ export AWS_DEFAULT_REGION=...
-$ bundle exec ruby runner.rb
+$ bundle exec ruby runner.rb my-instance-name
 ```
 
 Wait a few minutes, and then check out your [CloudWatch custom metrics](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2).
 
-There should be an `RDS_OS_Metrics` custom namespace with everything fun in it.
+There should be `RDS_OS_Metrics` and `RDS_CPU_Metrics` custom namespaces with everything fun in it.
 
 ### First Deployment
 
@@ -46,6 +51,9 @@ $ ./script/ci_build
 $ ./script/create_function --profile me --region us-east-1 --name rdsosmetrics
 ```
 
+Note:  You only need one function deployed.  EventBridge Rules can execute 
+the same function for many instance names.
+
 ### Create an EventBridge Rule
 
 [Create a Rule](https://us-east-1.console.aws.amazon.com/events/home?region=us-east-1#/rules/create)
@@ -98,6 +106,10 @@ Everything else I group into an Other category.
 }
 ```
 
+Note: The metrics have minimum, maximum, sum and count published.  Average
+is derived in CloudWatch.  Since it's hard to catch bursty CPU activity
+over a whole minute, try using stat=Maximum or Sum to see what looks better.
+
 And one for memory, although this isn't as interesting:
 
 
@@ -120,11 +132,44 @@ And one for memory, although this isn't as interesting:
 }
 ```
 
+Switching to overall CPU metrics, create something like:
+
+```
+{
+  "metrics": [
+    [ "RDS_CPU_Metrics", "nice", "rds_instance", "prod-writer" ],
+    [ ".", "irq", ".", "." ],
+    [ ".", "guest", ".", "." ],
+    [ ".", "idle", ".", ".", { "visible": false } ],
+    [ ".", "steal", ".", "." ],
+    [ ".", "user", ".", "." ],
+    [ ".", "wait", ".", "." ],
+    [ ".", "total", ".", "." ],
+    [ ".", "system", ".", "." ]
+  ],
+  "view": "timeSeries",
+  "stacked": false,
+  "region": "us-east-1",
+  "stat": "Average",
+  "period": 60
+}
+
+```
+
 ### Updating New Code
 
 ```
 $ ./script/ci_build
 $ ./script/update_function --profile me --region us-east-1 --name rdsosmetrics
 ```
 
+### Wishlist Items
+
+If anyone would like to add features to this script, here are a few things
+that would be great to have:
 
+* The option to publish per-second granularity metrics for each `RDSOSMetrics` record.
+This would allow CloudWatch to offer better statistics like percentiles, IQM, WM, PR, etc. 
+* Option to publish more metrics like uptime, disk IO, network IO, load average.  Although
+most of these are available in some other form, they're a little more annoying to work with.
+(EX:  Load average comes from Logs Insights, which is annoying to run stats and aggregations with)
diff --git a/handler.rb b/handler.rb
@@ -1,4 +1,5 @@
 require 'json'
+require 'bigdecimal'
 require 'aws-sdk-rds'
 require 'aws-sdk-cloudwatchlogs'
 require 'aws-sdk-cloudwatch'
@@ -39,9 +40,105 @@ def handler(event:, context:)
     start_time: (Time.now - ChronicDuration.parse(interval)).to_i * 1000
   })
 
+  publish_rds_os_metrics(instance_id, events)
+  publish_rds_cpu_metrics(instance_id, events)
+
+rescue => e
+  puts "Exception: #{e.message}"
+  raise e
+end
+
+# Take a process name, categorize it and return the 
+# dimensions of a CW metric for this PID
+def parse_process_dimension(instance_id, name)
+  dimension = [
+    { name: "rds_instance", value: instance_id }
+  ]
+  case name
+  when /^postgres: postgres/, "postgres"
+    dimension.push({ name: "service", value: "postgres"})
+  when /^postgres: rdsadmin/, /^postgres: aurora/
+    dimension.push({ name: "service", value: "postgres-aurora"})
+  when /^postgres: /, "pg_controldata"
+    dimension.push({ name: "service", value: "postgres-background"})
+  when "Aurora Storage Daemon"
+    dimension.push({ name: "service", value: "aurora-storage"})
+  when "RDS processes"
+    dimension.push({ name: "service", value: "rds-processes"})
+  when "OS processes"
+    dimension.push({ name: "service", value: "os-processes"})
+  else
+    puts "Can't figure out what this process is: #{name}"
+  end
+
+  dimension
+end
+
+# Publish total CPU metrics for guest, irq, system, 
+# wait, idle, user, steal, nice, and total.
+def publish_rds_cpu_metrics(instance_id, events)
+  sums = {}
+  minimums = {}
+  maximums = {}
+  event_count = 0
+
+  events.events.each do |event|
+    timestamp = Time.at(event.timestamp / 1000)
+    data = JSON.parse(event.message)
+    data['cpuUtilization'].each do |metric, value|
+      dimension = [
+        { name: "rds_instance", value: instance_id },
+        { name: "metric", value: metric }
+      ]
+      sums[dimension] ||= 0
+      sums[dimension] += value
+
+      minimums[dimension] ||= BigDecimal('Infinity')
+      if value.to_f < minimums[dimension]
+        minimums[dimension] = value.to_f
+      end
+
+      maximums[dimension] ||= BigDecimal('-Infinity')
+      if value.to_f > maximums[dimension]
+        maximums[dimension] = value.to_f
+      end
+    end
+    event_count += 1
+  end
+
+  cw = Aws::CloudWatch::Client.new
+  sums.keys.each do |dimension|
+    metric_name = dimension.last[:value]
+    cw.put_metric_data({
+      namespace: "RDS_CPU_Metrics",
+      metric_data: [{
+        metric_name: metric_name,
+        timestamp: Time.now,
+        unit: "Percent",
+        statistic_values: {
+          sample_count: event_count,
+          sum: sums[dimension],
+          minimum: minimums[dimension],
+          maximum: maximums[dimension]
+        },
+        # divide by event count for average
+        # NOTE: statistic_values and value are mutually exclusive
+        #value: (sums[dimension].to_f / event_count.to_f),
+        dimensions: dimension[0..-2]
+      }]
+    })
+  end
+
+end
+
+# Publish per-process (categoried) CPU and memory 
+# utilization
+def publish_rds_os_metrics(instance_id, events)
   # Aggregation of all metrics for this time interval
   # [ dimensions ] => value
   sums = {}
+  minimums = {}
+  maximums = {}
   event_count = 0
 
   events.events.each do |event|
@@ -53,66 +150,57 @@ def handler(event:, context:)
       # interested in just percentages
       sums[ dimension + [{name:"metric",value:"CPU"}] ] ||= 0 
       sums[ dimension + [{name:"metric",value:"CPU"}] ] += process['cpuUsedPc'].to_f
-      #if process['cpuUsedPc'].to_f > sums[ dimension + [{name:"metric",value:"CPU"}] ].to_f
-      #  sums[ dimension + [{name:"metric",value:"CPU"}] ] = process['cpuUsedPc'].to_f
-      #end
+
+      minimums[ dimension + [{name:"metric",value:"CPU"}] ] ||= BigDecimal('Infinity')
+      if process['cpuUsedPc'].to_f < minimums[ dimension + [{name:"metric",value:"CPU"}] ]
+        minimums[ dimension + [{name:"metric",value:"CPU"}] ] = process['cpuUsedPc'].to_f
+      end
+
+      maximums[ dimension + [{name:"metric",value:"CPU"}] ] ||= BigDecimal('-Infinity')
+      if process['cpuUsedPc'].to_f > maximums[ dimension + [{name:"metric",value:"CPU"}] ]
+        maximums[ dimension + [{name:"metric",value:"CPU"}] ] = process['cpuUsedPc'].to_f
+      end
 
       sums[ dimension + [{name:"metric",value:"Memory"}] ] ||= 0
       sums[ dimension + [{name:"metric",value:"Memory"}] ] += process['memoryUsedPc'].to_f
-      #if process['memoryUsedPc'].to_f > sums[ dimension + [{name:"metric",value:"Memory"}] ]
-      #  sums[ dimension + [{name:"metric",value:"Memory"}] ] = process['memoryUsedPc'].to_f
-      #end
+
+      minimums[ dimension + [{name:"metric",value:"Memory"}] ] ||= BigDecimal('Infinity')
+      if process['memoryUsedPc'].to_f < minimums[ dimension + [{name:"metric",value:"Memory"}] ]
+        minimums[ dimension + [{name:"metric",value:"Memory"}] ] = process['memoryUsedPc'].to_f
+      end
+
+      maximums[ dimension + [{name:"metric",value:"Memory"}] ] ||= BigDecimal('-Infinity')
+      if process['memoryUsedPc'].to_f > maximums[ dimension + [{name:"metric",value:"Memory"}] ]
+        maximums[ dimension + [{name:"metric",value:"Memory"}] ] = process['memoryUsedPc'].to_f
+      end
+
     end
     event_count += 1
   end
 
   # Iterate over the sums and publish average statistics
   # for this time interval
   cw = Aws::CloudWatch::Client.new
-  sums.each do |dimension, value|
-    metric_name = dimension.pop[:value]
+  sums.keys.each do |dimension|
+    metric_name = dimension.last[:value]
     cw.put_metric_data({
       namespace: "RDS_OS_Metrics",
       metric_data: [{
         metric_name: metric_name,
         timestamp: Time.now,
         unit: "Percent",
+        statistic_values: {
+          sample_count: event_count,
+          sum: sums[dimension],
+          minimum: minimums[dimension],
+          maximum: maximums[dimension]
+        },
         # divide by event count for average
-        value: (value.to_f / event_count.to_f),
-        # NOTE:  Do we want to use the max instead?
-        #value: value.to_f,
-        dimensions: dimension
+        # NOTE: statistic_values and value are mutually exclusive
+        #value: (sums[dimension].to_f / event_count.to_f),
+        dimensions: dimension[0..-2]
       }]
     })
   end
 
-rescue => e
-  puts "Exception: #{e.message}"
-  raise e
-end
-
-# Take a process name, categorize it and return the 
-# dimensions of a CW metric for this PID
-def parse_process_dimension(instance_id, name)
-  dimension = [
-    { name: "rds_instance", value: instance_id }
-  ]
-  case name
-  when /^postgres: postgres/, "postgres"
-    dimension.push({ name: "service", value: "postgres"})
-  when /^postgres: rdsadmin/, /^postgres: aurora/
-    dimension.push({ name: "service", value: "postgres-aurora"})
-  when /^postgres: /, "pg_controldata"
-    dimension.push({ name: "service", value: "postgres-background"})
-  when "Aurora Storage Daemon"
-    dimension.push({ name: "service", value: "aurora-storage"})
-  when "RDS processes"
-    dimension.push({ name: "service", value: "rds-processes"})
-  when "OS processes"
-    dimension.push({ name: "service", value: "os-processes"})
-  else
-    puts "Can't figure out what this process is: #{name}"
-  end
-
-  dimension
 end
diff --git a/runner.rb b/runner.rb
@@ -3,7 +3,7 @@
 
 # Set this to your db instance name
 event = {
-  "instance_id" => "my-prod-writer"
+  "instance_id" => ARGV[0]
 }
 
 require "./handler.rb"

diff --git a/screenshots/cpu_metrics.png b/screenshots/cpu_metrics.png
diff --git a/screenshots/os_memory_metrics.png b/screenshots/os_memory_metrics.png
diff --git a/screenshots/os_metrics.png b/screenshots/os_metrics.png