Document metrics that help detect when WAL apply lag is increasing (#191)

mtopolnik · goodroot · web-flow · commit d80cce300ffe · 2025-07-10T13:30:23.000+02:00
Adds three new metrics to the table of all metrics:

1.  counter: `questdb_wal_apply_seq_txn_total`
2. counter: `questdb_wal_apply_writer_txn_total`
3. gauge: `questdb_suspended_tables`

---------

Co-authored-by: goodroot &lt;9484709+goodroot@users.noreply.github.com&gt;
diff --git a/documentation/operations/logging-metrics.md b/documentation/operations/logging-metrics.md
@@ -6,7 +6,7 @@ description: Configure and understand QuestDB logging and metrics, including log
 import { ConfigTable } from "@theme/ConfigTable"
 import httpMinimalConfig from "./_http-minimal.config.json"
 
-This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus. 
+This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus.
 
 - [Logging](/docs/operations/logging-metrics/#logging)
 - [Metrics](/docs/operations/logging-metrics/#metrics)
@@ -48,10 +48,10 @@ QuestDB provides the following types of log information:
 For more information, see the
 [QuestDB source code](https://github.com/questdb/questdb/blob/master/core/src/main/java/io/questdb/log/LogLevel.java).
 
-
 ### Example log messages
 
 Advisory:
+
 ```
 2023-02-24T14:59:45.076113Z A server-main Config:
 2023-02-24T14:59:45.076130Z A server-main  - http.enabled : true
@@ -60,23 +60,27 @@ Advisory:
 ```
 
 Critical:
+
 ```
 2022-08-08T11:15:13.040767Z C i.q.c.p.WriterPool could not open [table=`sys.text_import_log`, thread=1, ex=could not open read-write [file=/opt/homebrew/var/questdb/db/sys.text_import_log/_todo_], errno=13]
 ```
 
 Error:
+
 ```
 2023-02-24T14:59:45.059012Z I i.q.c.t.t.InputFormatConfiguration loading input format config [resource=/text_loader.json]
 2023-03-20T08:38:17.076744Z E i.q.c.l.u.AbstractLineProtoUdpReceiver could not set receive buffer size [fd=140, size=8388608, errno=55]
 ```
 
 Info:
+
 ```
 2020-04-15T16:42:32.879970Z I i.q.c.TableReader new transaction [txn=2, transientRowCount=1, fixedRowCount=1, maxTimestamp=1585755801000000, attempts=0]
 2020-04-15T16:42:32.880051Z I i.q.g.FunctionParser call to_timestamp('2020-05-01:15:43:21','yyyy-MM-dd:HH:mm:ss') -> to_timestamp(Ss)
 ```
 
 Debug:
+
 ```
 2023-03-31T11:47:05.723715Z D i.q.g.FunctionParser call cast(investmentMill,INT) -> cast(Li)
 2023-03-31T11:47:05.723729Z D i.q.g.FunctionParser call rnd_symbol(4,4,4,2) -> rnd_symbol(iiii)
@@ -206,10 +210,10 @@ The following configuration options can be set in your `server.conf`:
 
 On systems with
 [8 Cores and less](/docs/operations/capacity-planning/#cpu-cores), contention
-for threads might increase the latency of health check service responses. If you use 
-a load balancer thinks the QuestDB service is dead with nothing apparent in the
-QuestDB logs, you may need to configure a dedicated thread pool for the health
-check service. To do so, increase `http.min.worker.count` to `1`.
+for threads might increase the latency of health check service responses. If you
+use a load balancer, and it thinks the QuestDB service is dead with nothing
+apparent in the QuestDB logs, you may need to configure a dedicated thread pool
+for the health check service. To do so, increase `http.min.worker.count` to `1`.
 
 :::
 
diff --git a/documentation/operations/monitoring-alerting.md b/documentation/operations/monitoring-alerting.md
@@ -0,0 +1,81 @@
+---
+title: Monitoring and alerting
+description: Shows you how to set up to monitor your database for potential issues, and how to raise alerts
+---
+
+There are many variables to consider when monitoring an active production database. This document is designed to be a helpful starting point. We plan to expand this guide to be more helpful. If you have any recommendations, feel free to [create an issue](https://github.com/questdb/documentation/issues) or a PR on GitHub. 
+
+## Basic health check
+
+QuestDB comes with an out-of-the-box health check HTTP endpoint:
+
+```shell title="GET health status of local instance"
+curl -v http://127.0.0.1:9003
+```
+
+Getting an OK response means the QuestDB process is up and running. This method
+provides no further information.
+
+If you allocate 8 vCPUs/cores or less to QuestDB, the HTTP server thread may not
+be able to get enough CPU time to respond in a timely manner. Your load balancer
+may flag the instance as dead. In such a case, create an isolated thread pool
+just for the health check service (the `min` HTTP server), by setting this
+configuration option:
+
+```text
+http.min.worker.count=1
+```
+
+## Alert on critical errors
+
+QuestDB includes a log writer that sends any message logged at critical level to
+Prometheus Alertmanager over a TCP/IP socket. To configure this writer, add it
+to the `writers` config alongside other log writers. This is the basic setup:
+
+```ini title="log.conf"
+writers=stdout,alert
+w.alert.class=io.questdb.log.LogAlertSocketWriter
+w.alert.level=CRITICAL
+```
+
+For more details, see the
+[Logging and metrics page](/docs/operations/logging-metrics/#prometheus-alertmanager).
+
+## Detect suspended tables
+
+QuestDB exposes a Prometheus gauge called `questdb_suspended_tables`. You can set up
+to alert whenever this gauge shows an above-zero value.
+
+## Detect slow ingestion
+
+QuestDB ingests data in two stages: first it records everything to the
+Write-Ahead Log. This step is optimized for throughput and usually isn't the
+bottleneck. The next step is inserting the data to the table, and this can
+take longer if the data is out of order, or touches different time partitions.
+You can monitor the overall performance of this process of applying the WAL
+data to tables. QuestDB exposes two Prometheus counters for this:
+
+1. `questdb_wal_apply_seq_txn_total`: sum of all committed transaction sequence numbers
+2. `questdb_wal_apply_writer_txn_total`: sum of all transaction sequence numbers applied to tables
+
+Both of these numbers are continuously growing as the data is ingested. When
+they are equal, all WAL data has been applied to the tables. While data is being
+actively ingested, the second counter will lag behind the first one. A steady
+difference between them is a sign of healthy rate of WAL application, the
+database keeping up with the demand. However, if the difference continuously
+rises, this indicates that either a table has become suspended and WAL can't be
+applied to it, or QuestDB is not able to keep up with the ingestion rate. All of
+the data is still safely stored, but a growing portion of it is not yet visible
+to queries.
+
+You can create an alert that detects a steadily increasing difference between
+these two numbers. It won't tell you which table is experiencing issues, but it
+is a low-impact way to detect there's a problem which needs further diagnosing.
+
+## Detect slow queries
+
+QuestDB maintains a table called `_query_trace`, which records each executed
+query and the time it took. You can query this table to find slow queries.
+
+Read more on query tracing on the
+[Concepts page](/docs/concept/query-tracing/).
diff --git a/documentation/sidebars.js b/documentation/sidebars.js
@@ -470,6 +470,7 @@ module.exports = {
           ]
         },
         "operations/logging-metrics",
+        "operations/monitoring-alerting",
         "operations/data-retention",
         "operations/design-for-performance",
         "operations/updating-data",
diff --git a/documentation/third-party-tools/prometheus.md b/documentation/third-party-tools/prometheus.md
@@ -278,7 +278,7 @@ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' aler
 To run QuestDB and point it towards Alertmanager for alerting, first create a
 file `./conf/log.conf` with the following contents. `172.17.0.2` in this case is
 the IP address of the docker container for alertmanager that was discovered by
-running the `docker inspect ` command above.
+running the `docker inspect` command above.
 
 ```ini title="./conf/log.conf"
 # Which writers to enable

Original file line number	Diff line number	Diff line change
`@@ -470,6 +470,7 @@ module.exports = {`
`470`	`470`	`]`
`471`	`471`	`},`
`472`	`472`	`"operations/logging-metrics",`
	`473`	`+ "operations/monitoring-alerting",`
`473`	`474`	`"operations/data-retention",`
`474`	`475`	`"operations/design-for-performance",`
`475`	`476`	`"operations/updating-data",`