🚀 Enterprise-grade Ansible automation for AxonOps monitoring and management platform
Overview • Features • Quick Start • Installation • Configuration • Usage • Examples
This repository provides production-ready Ansible playbooks to automate the configuration of AxonOps - a comprehensive management platform for Apache Cassandra® and Apache Kafka®. With these playbooks, you can programmatically configure alerts, dashboards, backups, and monitoring rules without manual GUI interaction.
Note: This project configures AxonOps settings on SaaS or self-hosted installations. For installing AxonOps itself, see:
This automation framework configures:
- 📊 100+ Pre-defined Metric Alerts - CPU, memory, disk, latency, timeouts, and Cassandra/Kafka-specific metrics
- 📝 20+ Log Alert Rules - Node failures, SSL issues, repairs, disk space, and error patterns
- 🔔 Multi-Channel Alert Routing - Slack, PagerDuty, OpsGenie, ServiceNow, Microsoft Teams
- 💾 Automated Backup Schedules - S3, Azure Blob, SFTP with retention policies
- 🏥 Service Health Checks - TCP ports, shell scripts, SSL certificates, system maintenance
- 🔧 Advanced Features - Adaptive repair, commit log archiving, agent tolerance settings
- Multi-Cluster Support - Configure all clusters in your organization or target specific ones
- Hierarchical Configuration - Organization-wide defaults with cluster-specific overrides
- Idempotent Operations - Safe to run multiple times
- YAML Validation - Built-in schema validation for all configurations
- Enterprise Integrations - Native support for major alerting and incident management platforms
- Cross-Platform - Support for both Apache Cassandra and Apache Kafka
Metric Alerts (Click to expand)
- CPU usage (warning: 90%, critical: 99%)
- Memory utilization (warning: 85%, critical: 95%)
- Disk usage per mount point (warning: 75%, critical: 90%)
- IO wait times (warning: 20%, critical: 50%)
- Garbage collection duration (warning: 5s, critical: 10s)
- NTP time drift monitoring
- Coordinator read/write latencies (per consistency level)
- Read/write timeouts and unavailables
- Dropped messages (mutations, reads, hints)
- Thread pool congestion (blocked tasks, pending requests)
- Compaction backlogs
- Tombstone scanning thresholds
- SSTable counts and bloom filter efficiency
- Hint creation rates
- Cache hit rates
- Broker availability
- Controller status
- Network processor utilization
- Request queue sizes
- Offline/under-replicated partitions
- Authentication failures
- Metadata errors
Log Alerts (Click to expand)
- Node DOWN events
- TLS/SSL handshake failures
- Gossip message drops
- Stream session failures
- SSTable corruption
- Disk space issues
- JVM memory problems
- Large partition warnings
- Repair monitoring
- Jemalloc loading issues
Service Checks (Click to expand)
- Schema agreement validation
- Node status monitoring
- SSL certificate expiration
- System reboot requirements
- AWS maintenance events
- CQL connectivity tests
- Custom shell script checks
# 1. Clone the repository
git clone https://github.com/axonops/axonops-config-automation.git
cd axonops-config-automation
# 2. Set your environment variables
export AXONOPS_ORG='your-organization'
export AXONOPS_TOKEN='your-api-token'
# 3. Run the playbooks
make endpoints # Configure alert integrations
make routes # Set up alert routing
make metrics-alerts # Create metric-based alerts
make log-alerts # Create log-based alerts
make service-checks # Configure health checks
make backups # Set up backup schedules
- Ansible >= 2.10
- Python >= 3.8
- make (or use the provided
make.sh
script)
RedHat/RockyLinux (8+)
sudo dnf -y install epel-release
sudo dnf -y install ansible make
Debian/Ubuntu
sudo apt update
sudo apt -y install ansible make
Using Virtualenv
virtualenv ~/py-axonops
source ~/py-axonops/bin/activate
pip3 install -r requirements.txt
Using Pipenv (Recommended)
pipenv install
export PIPENV=true
Configure your environment using the provided template:
# Copy and edit the environment template
cp export_tokens.sh export_tokens.sh.local
vim export_tokens.sh.local
# Source your configuration
source ./export_tokens.sh.local
# Organization name (mandatory)
export AXONOPS_ORG='example'
# For AxonOps SaaS
export AXONOPS_TOKEN='your-api-token'
# For AxonOps On-Premise
export AXONOPS_URL='https://your-axonops-instance.com'
export AXONOPS_USERNAME='your-username'
export AXONOPS_PASSWORD='your-password'
config/
├── YOUR_ORG_NAME/ # Organization-level configs
│ ├── alert_endpoints.yml # Alert integrations (Slack, PagerDuty, etc.)
│ ├── metric_alert_rules.yml # Default metric alerts for all clusters
│ ├── log_alert_rules.yml # Default log alerts for all clusters
│ └── service_checks.yml # Default service checks for all clusters
│ │
│ └── YOUR_CLUSTER_NAME/ # Cluster-specific overrides
│ ├── metric_alert_rules.yml # Additional/override metric alerts
│ ├── log_alert_rules.yml # Additional/override log alerts
│ ├── service_checks.yml # Additional/override service checks
│ ├── backups.yml # Backup configurations
│ └── kafka_metrics_alert_rules.yml # Kafka-specific alerts
- Organization Level: Configurations in
config/ORG_NAME/
apply to all clusters - Cluster Level: Configurations in
config/ORG_NAME/CLUSTER_NAME/
override or extend organization settings
make help # Show all available commands
make validate # Validate YAML configurations
make endpoints # Configure alert integrations
make routes # Set up alert routing rules
make metrics-alerts # Create metric-based alerts
make log-alerts # Create log-based alerts
make service-checks # Configure service health checks
make backups # Set up backup schedules
make check # Run pre-commit tests
You can run playbooks using either environment variables or command-line overrides:
# Using environment variables (after sourcing export_tokens.sh)
make metrics-alerts
# Using command-line overrides
make metrics-alerts AXONOPS_ORG=myorg AXONOPS_CLUSTER=prod-cluster
# Target all clusters (omit AXONOPS_CLUSTER)
make metrics-alerts AXONOPS_ORG=myorg
Always validate your configurations before applying:
make validate
This will check all YAML files against their schemas and report any errors.
Slack Integration
# config/YOUR_ORG/alert_endpoints.yml
slack:
- name: ops-team-alerts
webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
present: true
- name: dev-team-alerts
webhook_url: https://hooks.slack.com/services/YOUR/OTHER/URL
present: true
PagerDuty Integration
# config/YOUR_ORG/alert_endpoints.yml
pagerduty:
- name: critical-incidents
integration_key: YOUR-PAGERDUTY-INTEGRATION-KEY
present: true
CPU Usage Alert
# config/YOUR_ORG/metric_alert_rules.yml
axonops_alert_rules:
- name: CPU usage per host
dashboard: System
chart: CPU usage per host
operator: '>='
critical_value: 99
warning_value: 90
duration: 1h
description: Detected High CPU usage
present: true
Cassandra Latency Alert
# config/YOUR_ORG/metric_alert_rules.yml
axonops_alert_rules:
- name: Read latency critical
dashboard: Coordinator
chart: Coordinator Read Latency - LOCAL_QUORUM 99thPercentile
operator: '>='
critical_value: 2000000 # 2 seconds in microseconds
warning_value: 1000000 # 1 second in microseconds
duration: 15m
description: High read latency detected
present: true
Node Down Detection
# config/YOUR_ORG/log_alert_rules.yml
axonops_log_alert_rules:
- name: Node Down
content: "is now DOWN"
source: "/var/log/cassandra/system.log"
warning_value: 1
critical_value: 5
duration: 5m
description: "Cassandra node marked as DOWN"
level: error,warning
present: true
CQL Port Check
# config/YOUR_ORG/service_checks.yml
tcp_checks:
- name: cql_client_port
target: "{{.comp_listen_address}}:{{.comp_native_transport_port}}"
interval: 3m
timeout: 1m
present: true
Custom Shell Script
# config/YOUR_ORG/service_checks.yml
shell_checks:
- name: Check schema agreement
interval: 5m
timeout: 1m
present: true
command: |
#!/bin/bash
SCRIPT_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
source $SCRIPT_PATH/common.sh
schemas=$(nodetool gossipinfo | grep -i schema | awk '{print $2}' | sort | uniq | wc -l)
if [[ $schemas -gt 1 ]]; then
echo "CRITICAL - Multiple schema versions detected: $schemas"
exit 2
fi
echo "OK - Schema agreement confirmed"
exit 0
S3 Backup Schedule
# config/YOUR_ORG/YOUR_CLUSTER/backups.yml
backups:
- name: Daily S3 backup
remote_type: s3
datacenters:
- dc1
remote_path: my-backup-bucket/cassandra-backups
local_retention: 10d
remote_retention: 60d
tag: "daily-backup"
timeout: 10h
remote: true
schedule: true
schedule_expr: "0 1 * * *" # 1 AM daily
s3_region: us-east-1
s3_storage_class: STANDARD_IA
present: true
Azure Blob Snapshot
# config/YOUR_ORG/YOUR_CLUSTER/backups.yml
backups:
- name: Critical table snapshot
remote_type: azure
datacenters:
- dc1
remote_path: backups-container/cassandra
tables:
- 'critical_keyspace.important_table'
local_retention: 7d
remote_retention: 30d
tag: "critical-data"
timeout: 2h
remote: true
schedule: false # Immediate snapshot
azure_account: mystorageaccount
azure_use_msi: true
present: true
Route Configuration
# config/YOUR_ORG/alert_routes.yml
axonops_alert_routes:
# Send all critical/error to PagerDuty
- name: critical-to-pagerduty
endpoint: critical-incidents
endpoint_type: pagerduty
severities:
- error
- critical
override: false
present: true
# Send warnings to Slack
- name: warnings-to-slack
endpoint: ops-team-alerts
endpoint_type: slack
severities:
- warning
override: false
present: true
# Route backup alerts to dedicated channel
- name: backup-alerts
endpoint: backup-notifications
endpoint_type: slack
tags:
- backup
severities:
- info
- warning
- error
- critical
override: true # Override default routing
present: true
In addition to Ansible playbooks, a Python CLI is available for specific operations:
# Configure adaptive repair
python cli/axonops.py adaptive-repair \
--cluster my-cluster \
--enabled true \
--percentage 20
# View current settings
python cli/axonops.py adaptive-repair \
--cluster my-cluster \
--show
You can override any Ansible variable:
# Custom API timeout
make metrics-alerts ANSIBLE_EXTRA_VARS="api_timeout=60"
# Dry run mode
make metrics-alerts ANSIBLE_EXTRA_VARS="check_mode=true"
Authentication Errors
- Verify your API token has DBA-level access or above
- Check token expiration
- For on-premise, ensure URL includes protocol (https://)
Configuration Not Applied
- Run
make validate
to check YAML syntax - Ensure
present: true
is set for items you want to create - Check that cluster names match exactly (case-sensitive)
Module Import Errors
- Ensure you're using Python 3.8+
- Install dependencies:
pip install -r requirements.txt
- For pipenv users: ensure
PIPENV=true
is exported
- Start with Organization Defaults: Define common alerts at the org level
- Use Cluster Overrides Sparingly: Only for cluster-specific requirements
- Validate Before Applying: Always run
make validate
first - Version Control: Commit your
config/
directory to track changes - Test in Non-Production: Apply to test clusters before production
- Regular Reviews: Periodically review and update alert thresholds
- Documentation: AxonOps Docs
- Issues: GitHub Issues
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- 📚 Complete Alert Reference Guide - Detailed documentation of all pre-configured alerts, thresholds, and configurations
- 🔧 AxonOps Documentation - Official AxonOps platform documentation
This project may contain trademarks or logos for projects, products, or services. Any use of third-party trademarks or logos are subject to those third-party's policies. AxonOps is a registered trademark of AxonOps Limited. Apache, Apache Cassandra, Cassandra, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Elasticsearch is a trademark of Elasticsearch B.V., registered in the U.S. and in other countries. Docker is a trademark or registered trademark of Docker, Inc. in the United States and/or other countries.