Skip to content

Automate your AxonOps™ monitoring setup with pre-configured alerts, service checks, and backup schedules for Cassandra® & Kafka®

License

Notifications You must be signed in to change notification settings

axonops/axonops-config-automation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AxonOps Configuration Automation

🚀 Enterprise-grade Ansible automation for AxonOps monitoring and management platform

OverviewFeaturesQuick StartInstallationConfigurationUsageExamples


Overview

This repository provides production-ready Ansible playbooks to automate the configuration of AxonOps - a comprehensive management platform for Apache Cassandra® and Apache Kafka®. With these playbooks, you can programmatically configure alerts, dashboards, backups, and monitoring rules without manual GUI interaction.

Note: This project configures AxonOps settings on SaaS or self-hosted installations. For installing AxonOps itself, see:

What Gets Configured?

This automation framework configures:

  • 📊 100+ Pre-defined Metric Alerts - CPU, memory, disk, latency, timeouts, and Cassandra/Kafka-specific metrics
  • 📝 20+ Log Alert Rules - Node failures, SSL issues, repairs, disk space, and error patterns
  • 🔔 Multi-Channel Alert Routing - Slack, PagerDuty, OpsGenie, ServiceNow, Microsoft Teams
  • 💾 Automated Backup Schedules - S3, Azure Blob, SFTP with retention policies
  • 🏥 Service Health Checks - TCP ports, shell scripts, SSL certificates, system maintenance
  • 🔧 Advanced Features - Adaptive repair, commit log archiving, agent tolerance settings

Features

🎯 Key Capabilities

  • Multi-Cluster Support - Configure all clusters in your organization or target specific ones
  • Hierarchical Configuration - Organization-wide defaults with cluster-specific overrides
  • Idempotent Operations - Safe to run multiple times
  • YAML Validation - Built-in schema validation for all configurations
  • Enterprise Integrations - Native support for major alerting and incident management platforms
  • Cross-Platform - Support for both Apache Cassandra and Apache Kafka

📋 Pre-Configured Monitoring

Metric Alerts (Click to expand)

System & Performance

  • CPU usage (warning: 90%, critical: 99%)
  • Memory utilization (warning: 85%, critical: 95%)
  • Disk usage per mount point (warning: 75%, critical: 90%)
  • IO wait times (warning: 20%, critical: 50%)
  • Garbage collection duration (warning: 5s, critical: 10s)
  • NTP time drift monitoring

Cassandra-Specific

  • Coordinator read/write latencies (per consistency level)
  • Read/write timeouts and unavailables
  • Dropped messages (mutations, reads, hints)
  • Thread pool congestion (blocked tasks, pending requests)
  • Compaction backlogs
  • Tombstone scanning thresholds
  • SSTable counts and bloom filter efficiency
  • Hint creation rates
  • Cache hit rates

Kafka-Specific

  • Broker availability
  • Controller status
  • Network processor utilization
  • Request queue sizes
  • Offline/under-replicated partitions
  • Authentication failures
  • Metadata errors
Log Alerts (Click to expand)
  • Node DOWN events
  • TLS/SSL handshake failures
  • Gossip message drops
  • Stream session failures
  • SSTable corruption
  • Disk space issues
  • JVM memory problems
  • Large partition warnings
  • Repair monitoring
  • Jemalloc loading issues
Service Checks (Click to expand)
  • Schema agreement validation
  • Node status monitoring
  • SSL certificate expiration
  • System reboot requirements
  • AWS maintenance events
  • CQL connectivity tests
  • Custom shell script checks

Quick Start

# 1. Clone the repository
git clone https://github.com/axonops/axonops-config-automation.git
cd axonops-config-automation

# 2. Set your environment variables
export AXONOPS_ORG='your-organization'
export AXONOPS_TOKEN='your-api-token'

# 3. Run the playbooks
make endpoints          # Configure alert integrations
make routes            # Set up alert routing
make metrics-alerts    # Create metric-based alerts
make log-alerts        # Create log-based alerts
make service-checks    # Configure health checks
make backups          # Set up backup schedules

Installation

Prerequisites

  • Ansible >= 2.10
  • Python >= 3.8
  • make (or use the provided make.sh script)

System-Specific Installation

RedHat/RockyLinux (8+)
sudo dnf -y install epel-release
sudo dnf -y install ansible make
Debian/Ubuntu
sudo apt update
sudo apt -y install ansible make
Using Virtualenv
virtualenv ~/py-axonops
source ~/py-axonops/bin/activate
pip3 install -r requirements.txt
Using Pipenv (Recommended)
pipenv install
export PIPENV=true

Environment Configuration

Configure your environment using the provided template:

# Copy and edit the environment template
cp export_tokens.sh export_tokens.sh.local
vim export_tokens.sh.local

# Source your configuration
source ./export_tokens.sh.local

Required Variables

# Organization name (mandatory)
export AXONOPS_ORG='example'

# For AxonOps SaaS
export AXONOPS_TOKEN='your-api-token'

# For AxonOps On-Premise
export AXONOPS_URL='https://your-axonops-instance.com'
export AXONOPS_USERNAME='your-username'
export AXONOPS_PASSWORD='your-password'

Configuration

Directory Structure

config/
├── YOUR_ORG_NAME/                      # Organization-level configs
│   ├── alert_endpoints.yml             # Alert integrations (Slack, PagerDuty, etc.)
│   ├── metric_alert_rules.yml          # Default metric alerts for all clusters
│   ├── log_alert_rules.yml             # Default log alerts for all clusters
│   └── service_checks.yml              # Default service checks for all clusters
│   │
│   └── YOUR_CLUSTER_NAME/              # Cluster-specific overrides
│       ├── metric_alert_rules.yml      # Additional/override metric alerts
│       ├── log_alert_rules.yml         # Additional/override log alerts
│       ├── service_checks.yml          # Additional/override service checks
│       ├── backups.yml                 # Backup configurations
│       └── kafka_metrics_alert_rules.yml  # Kafka-specific alerts

Configuration Hierarchy

  1. Organization Level: Configurations in config/ORG_NAME/ apply to all clusters
  2. Cluster Level: Configurations in config/ORG_NAME/CLUSTER_NAME/ override or extend organization settings

Usage

Available Commands

make help              # Show all available commands
make validate          # Validate YAML configurations
make endpoints         # Configure alert integrations
make routes           # Set up alert routing rules
make metrics-alerts   # Create metric-based alerts
make log-alerts       # Create log-based alerts  
make service-checks   # Configure service health checks
make backups          # Set up backup schedules
make check            # Run pre-commit tests

Running Playbooks

You can run playbooks using either environment variables or command-line overrides:

# Using environment variables (after sourcing export_tokens.sh)
make metrics-alerts

# Using command-line overrides
make metrics-alerts AXONOPS_ORG=myorg AXONOPS_CLUSTER=prod-cluster

# Target all clusters (omit AXONOPS_CLUSTER)
make metrics-alerts AXONOPS_ORG=myorg

Validation

Always validate your configurations before applying:

make validate

This will check all YAML files against their schemas and report any errors.

Examples

Alert Endpoints

Slack Integration
# config/YOUR_ORG/alert_endpoints.yml
slack:
  - name: ops-team-alerts
    webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
    present: true
  
  - name: dev-team-alerts
    webhook_url: https://hooks.slack.com/services/YOUR/OTHER/URL
    present: true
PagerDuty Integration
# config/YOUR_ORG/alert_endpoints.yml
pagerduty:
  - name: critical-incidents
    integration_key: YOUR-PAGERDUTY-INTEGRATION-KEY
    present: true

Metric Alerts

CPU Usage Alert
# config/YOUR_ORG/metric_alert_rules.yml
axonops_alert_rules:
  - name: CPU usage per host
    dashboard: System
    chart: CPU usage per host
    operator: '>='
    critical_value: 99
    warning_value: 90
    duration: 1h
    description: Detected High CPU usage
    present: true
Cassandra Latency Alert
# config/YOUR_ORG/metric_alert_rules.yml
axonops_alert_rules:
  - name: Read latency critical
    dashboard: Coordinator
    chart: Coordinator Read Latency - LOCAL_QUORUM 99thPercentile
    operator: '>='
    critical_value: 2000000  # 2 seconds in microseconds
    warning_value: 1000000   # 1 second in microseconds
    duration: 15m
    description: High read latency detected
    present: true

Log Alerts

Node Down Detection
# config/YOUR_ORG/log_alert_rules.yml
axonops_log_alert_rules:
  - name: Node Down
    content: "is now DOWN"
    source: "/var/log/cassandra/system.log"
    warning_value: 1
    critical_value: 5
    duration: 5m
    description: "Cassandra node marked as DOWN"
    level: error,warning
    present: true

Service Checks

CQL Port Check
# config/YOUR_ORG/service_checks.yml
tcp_checks:
  - name: cql_client_port
    target: "{{.comp_listen_address}}:{{.comp_native_transport_port}}"
    interval: 3m
    timeout: 1m
    present: true
Custom Shell Script
# config/YOUR_ORG/service_checks.yml
shell_checks:
  - name: Check schema agreement
    interval: 5m
    timeout: 1m
    present: true
    command: |
      #!/bin/bash
      SCRIPT_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
      source $SCRIPT_PATH/common.sh
      schemas=$(nodetool gossipinfo | grep -i schema | awk '{print $2}' | sort | uniq | wc -l)
      if [[ $schemas -gt 1 ]]; then
        echo "CRITICAL - Multiple schema versions detected: $schemas"
        exit 2
      fi
      echo "OK - Schema agreement confirmed"
      exit 0

Backups

S3 Backup Schedule
# config/YOUR_ORG/YOUR_CLUSTER/backups.yml
backups:
  - name: Daily S3 backup
    remote_type: s3
    datacenters: 
      - dc1
    remote_path: my-backup-bucket/cassandra-backups
    local_retention: 10d
    remote_retention: 60d
    tag: "daily-backup"
    timeout: 10h
    remote: true
    schedule: true
    schedule_expr: "0 1 * * *"  # 1 AM daily
    s3_region: us-east-1
    s3_storage_class: STANDARD_IA
    present: true
Azure Blob Snapshot
# config/YOUR_ORG/YOUR_CLUSTER/backups.yml
backups:
  - name: Critical table snapshot
    remote_type: azure
    datacenters:
      - dc1
    remote_path: backups-container/cassandra
    tables:
      - 'critical_keyspace.important_table'
    local_retention: 7d
    remote_retention: 30d
    tag: "critical-data"
    timeout: 2h
    remote: true
    schedule: false  # Immediate snapshot
    azure_account: mystorageaccount
    azure_use_msi: true
    present: true

Alert Routing

Route Configuration
# config/YOUR_ORG/alert_routes.yml
axonops_alert_routes:
  # Send all critical/error to PagerDuty
  - name: critical-to-pagerduty
    endpoint: critical-incidents
    endpoint_type: pagerduty
    severities:
      - error
      - critical
    override: false
    present: true
  
  # Send warnings to Slack
  - name: warnings-to-slack
    endpoint: ops-team-alerts
    endpoint_type: slack
    severities:
      - warning
    override: false
    present: true
  
  # Route backup alerts to dedicated channel
  - name: backup-alerts
    endpoint: backup-notifications
    endpoint_type: slack
    tags:
      - backup
    severities:
      - info
      - warning
      - error
      - critical
    override: true  # Override default routing
    present: true

Advanced Configuration

Using the CLI Tool

In addition to Ansible playbooks, a Python CLI is available for specific operations:

# Configure adaptive repair
python cli/axonops.py adaptive-repair \
  --cluster my-cluster \
  --enabled true \
  --percentage 20

# View current settings
python cli/axonops.py adaptive-repair \
  --cluster my-cluster \
  --show

Custom Ansible Variables

You can override any Ansible variable:

# Custom API timeout
make metrics-alerts ANSIBLE_EXTRA_VARS="api_timeout=60"

# Dry run mode
make metrics-alerts ANSIBLE_EXTRA_VARS="check_mode=true"

Troubleshooting

Common Issues

Authentication Errors
  • Verify your API token has DBA-level access or above
  • Check token expiration
  • For on-premise, ensure URL includes protocol (https://)
Configuration Not Applied
  • Run make validate to check YAML syntax
  • Ensure present: true is set for items you want to create
  • Check that cluster names match exactly (case-sensitive)
Module Import Errors
  • Ensure you're using Python 3.8+
  • Install dependencies: pip install -r requirements.txt
  • For pipenv users: ensure PIPENV=true is exported

Best Practices

  1. Start with Organization Defaults: Define common alerts at the org level
  2. Use Cluster Overrides Sparingly: Only for cluster-specific requirements
  3. Validate Before Applying: Always run make validate first
  4. Version Control: Commit your config/ directory to track changes
  5. Test in Non-Production: Apply to test clusters before production
  6. Regular Reviews: Periodically review and update alert thresholds

Support

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Additional Resources


This project may contain trademarks or logos for projects, products, or services. Any use of third-party trademarks or logos are subject to those third-party's policies. AxonOps is a registered trademark of AxonOps Limited. Apache, Apache Cassandra, Cassandra, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Elasticsearch is a trademark of Elasticsearch B.V., registered in the U.S. and in other countries. Docker is a trademark or registered trademark of Docker, Inc. in the United States and/or other countries.

About

Automate your AxonOps™ monitoring setup with pre-configured alerts, service checks, and backup schedules for Cassandra® & Kafka®

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 7

Languages