Skip to content

Latest commit

 

History

History
127 lines (94 loc) · 5.9 KB

README.md

File metadata and controls

127 lines (94 loc) · 5.9 KB

doctor

Kava application and node infrastructure health monitoring daemon with configurable metric collection backends (e.g. stdout, file or AWS Cloudwatch).

Usage

System Overview

Configuration

$ doctor --help
Usage of doctor:
      --autoheal                                          whether doctor should take active measures to attempt to heal the kava process (e.g. place on standby if it falls significantly behind live)
      --autoheal_blockchain_service_name string           the name of the systemd service running the blockchain. this is the service that gets restarted in the autoheal process (default "kava")
      --autoheal_initial_delay_seconds int                initial delay before autoheal attempts a restart. useful for allowing longer startup time for the chain, like during statesync initialization
      --autoheal_restart_delay_seconds int                number of seconds autohealing routines will wait to restart the endpoint, effective from the last time it was restarted and over riding the values downtime_restart_threshold_seconds no_new_blocks_restart_threshold_seconds (default 2700)
      --autoheal_sync_latency_tolerance_seconds int       how far behind live the node is allowed to fall before autohealing actions are attempted (default 120)
      --autoheal_sync_to_live_tolerance_seconds int       how close to the current time the node must resync to before being considered in sync again (default 12)
      --aws_region string                                 aws region to use for sending metrics to CloudWatch (default "us-east-1")
      --config_filepath string                            filepath to json config file to use (default "~/.kava/doctor/config.json")
      --debug                                             controls whether debug logging is enabled
      --default_monitoring_interval_seconds int           default interval doctor will use for the various monitoring routines (default 5)
      --downtime_restart_threshold_seconds int            how many continuous seconds the endpoint being monitored has to be offline or unresponsive before autohealing will be attempted (default 300)
      --health_check_timeout_seconds int                  max number of seconds doctor will wait for a health check response from the endpoint (default 10)
      --interactive                                       controls whether an interactive terminal UI is displayed
      --kava_api_address string                           URL of the endpoint that doctor should monitor (default "https://rpc.data.kava.io")
      --max_metric_samples_to_retain_per_node int         maximum number of metric samples that will be kept in memory per node (default 10000)
      --metric_collectors string                          where to send collected metrics to, multiple collectors can be specified as a comma separated list, supported collectors are [file cloudwatch] (default "file")
      --metric_namespace string                           top level namespace to use for grouping all metrics sent to cloudwatch (default "kava")
      --metric_samples_to_use_for_synthetic_metrics int   number of metric samples to use when calculating synthetic metrics such as the node hash rate (default 60)
      --no_new_blocks_restart_threshold_seconds int       how many continuous seconds the endpoint being monitored has not produce a new bloc before autohealing will be attempted (default 300)

Doctor can be configured using any combination of command line flags (detailed above), environment variables, and json configuration file.

By default Doctor will look for configuration file located at ~/.kava/doctor/config.json.

An example configuration file is provided below:

{
    "kava_api_address": "https://rpc.data.kava.io",
    "debug": true,
    "interactive": true,
    "default_monitoring_interval_seconds": 3,
    "max_metric_samples_to_retain_per_node": 10000,
    "metric_samples_to_use_for_synthetic_metrics": 60,
    "metric_collectors": "file,cloudwatch",
    "metric_namespace": "kava/mainnet-archive",
    "aws_region": "us-east-1",
    "autoheal": true,
    "autoheal_blockchain_service_name": "kava",
    "autoheal_sync_latency_tolerance_seconds": 120,
    "autoheal_sync_to_live_tolerance_seconds": 12,
    "downtime_restart_threshold_seconds": 300,
    "no_new_blocks_restart_threshold_seconds": 300,
    "health_check_timeout_seconds": 10,
    "autoheal_restart_delay_seconds": 2700
}

Any configuration provided via environment variables will override file based configuration:

DOCTOR_DEBUG=false doctor

Flags override any settings in configuration file or environment variables with command line flags:

doctor --debug=true

Interactive Mode

Startup Screen

Startup Screen

Interactive mode is still Work in Progress, Sync Metrics, Uptime Metric and Messages areas are in a v1 state.

Metrics Display

Daemon Mode

$ doctor --debug
https://rpc.data.kava.io uptime 100.000000%
doctor 2022/07/29 15:52:29 cli.go:195: node state {NodeInfo:{Id:06ff9460163caac703c44da1b2e3108e1ba087cd Moniker:kava-outbound-archive} SyncInfo:{LatestBlockHeight:894449 LatestBlockTime:2022-07-29 22:52:22.782040666 +0000 UTC CatchingUp:false}}
https://rpc.data.kava.io node 06ff9460163caac703c44da1b2e3108e1ba087cd is synched up to block 894449, 0 seconds behind live, hashing 0.172968 blocks per second, status check took 284 milliseconds
https://rpc.data.kava.io uptime 100.000000%

Development

Dependencies

Building

To build a docker container with the kava and doctor binaries installed:

make build

To build the doctor binary for executing on your local dev machine:

make install

Running

To run a dockerized kava node with doctor monitoring the kava application

make run

Testing

To run the unit tests:

make test