|
| 1 | +--- |
| 2 | +title: Monitoring and alerting |
| 3 | +description: Shows you how to set up to monitor your database for potential issues, and how to raise alerts |
| 4 | +--- |
| 5 | + |
| 6 | +There are many variables to consider when monitoring an active production database. This document is designed to be a helpful starting point. We plan to expand this guide to be more helpful. If you have any recommendations, feel free to [create an issue](https://github.com/questdb/documentation/issues) or a PR on GitHub. |
| 7 | + |
| 8 | +## Basic health check |
| 9 | + |
| 10 | +QuestDB comes with an out-of-the-box health check HTTP endpoint: |
| 11 | + |
| 12 | +```shell title="GET health status of local instance" |
| 13 | +curl -v http://127.0.0.1:9003 |
| 14 | +``` |
| 15 | + |
| 16 | +Getting an OK response means the QuestDB process is up and running. This method |
| 17 | +provides no further information. |
| 18 | + |
| 19 | +If you allocate 8 vCPUs/cores or less to QuestDB, the HTTP server thread may not |
| 20 | +be able to get enough CPU time to respond in a timely manner. Your load balancer |
| 21 | +may flag the instance as dead. In such a case, create an isolated thread pool |
| 22 | +just for the health check service (the `min` HTTP server), by setting this |
| 23 | +configuration option: |
| 24 | + |
| 25 | +```text |
| 26 | +http.min.worker.count=1 |
| 27 | +``` |
| 28 | + |
| 29 | +## Alert on critical errors |
| 30 | + |
| 31 | +QuestDB includes a log writer that sends any message logged at critical level to |
| 32 | +Prometheus Alertmanager over a TCP/IP socket. To configure this writer, add it |
| 33 | +to the `writers` config alongside other log writers. This is the basic setup: |
| 34 | + |
| 35 | +```ini title="log.conf" |
| 36 | +writers=stdout,alert |
| 37 | +w.alert.class=io.questdb.log.LogAlertSocketWriter |
| 38 | +w.alert.level=CRITICAL |
| 39 | +``` |
| 40 | + |
| 41 | +For more details, see the |
| 42 | +[Logging and metrics page](/docs/operations/logging-metrics/#prometheus-alertmanager). |
| 43 | + |
| 44 | +## Detect suspended tables |
| 45 | + |
| 46 | +QuestDB exposes a Prometheus gauge called `questdb_suspended_tables`. You can set up |
| 47 | +to alert whenever this gauge shows an above-zero value. |
| 48 | + |
| 49 | +## Detect slow ingestion |
| 50 | + |
| 51 | +QuestDB ingests data in two stages: first it records everything to the |
| 52 | +Write-Ahead Log. This step is optimized for throughput and usually isn't the |
| 53 | +bottleneck. The next step is inserting the data to the table, and this can |
| 54 | +take longer if the data is out of order, or touches different time partitions. |
| 55 | +You can monitor the overall performance of this process of applying the WAL |
| 56 | +data to tables. QuestDB exposes two Prometheus counters for this: |
| 57 | + |
| 58 | +1. `questdb_wal_apply_seq_txn_total`: sum of all committed transaction sequence numbers |
| 59 | +2. `questdb_wal_apply_writer_txn_total`: sum of all transaction sequence numbers applied to tables |
| 60 | + |
| 61 | +Both of these numbers are continuously growing as the data is ingested. When |
| 62 | +they are equal, all WAL data has been applied to the tables. While data is being |
| 63 | +actively ingested, the second counter will lag behind the first one. A steady |
| 64 | +difference between them is a sign of healthy rate of WAL application, the |
| 65 | +database keeping up with the demand. However, if the difference continuously |
| 66 | +rises, this indicates that either a table has become suspended and WAL can't be |
| 67 | +applied to it, or QuestDB is not able to keep up with the ingestion rate. All of |
| 68 | +the data is still safely stored, but a growing portion of it is not yet visible |
| 69 | +to queries. |
| 70 | + |
| 71 | +You can create an alert that detects a steadily increasing difference between |
| 72 | +these two numbers. It won't tell you which table is experiencing issues, but it |
| 73 | +is a low-impact way to detect there's a problem which needs further diagnosing. |
| 74 | + |
| 75 | +## Detect slow queries |
| 76 | + |
| 77 | +QuestDB maintains a table called `_query_trace`, which records each executed |
| 78 | +query and the time it took. You can query this table to find slow queries. |
| 79 | + |
| 80 | +Read more on query tracing on the |
| 81 | +[Concepts page](/docs/concept/query-tracing/). |
0 commit comments