Skip to content

Commit d80cce3

Browse files
mtopolnikgoodroot
andauthored
Document metrics that help detect when WAL apply lag is increasing (#191)
Adds three new metrics to the table of all metrics: 1. counter: `questdb_wal_apply_seq_txn_total` 2. counter: `questdb_wal_apply_writer_txn_total` 3. gauge: `questdb_suspended_tables` --------- Co-authored-by: goodroot <[email protected]>
1 parent 2474194 commit d80cce3

File tree

4 files changed

+93
-7
lines changed

4 files changed

+93
-7
lines changed

documentation/operations/logging-metrics.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ description: Configure and understand QuestDB logging and metrics, including log
66
import { ConfigTable } from "@theme/ConfigTable"
77
import httpMinimalConfig from "./_http-minimal.config.json"
88

9-
This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus.
9+
This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus.
1010

1111
- [Logging](/docs/operations/logging-metrics/#logging)
1212
- [Metrics](/docs/operations/logging-metrics/#metrics)
@@ -48,10 +48,10 @@ QuestDB provides the following types of log information:
4848
For more information, see the
4949
[QuestDB source code](https://github.com/questdb/questdb/blob/master/core/src/main/java/io/questdb/log/LogLevel.java).
5050

51-
5251
### Example log messages
5352

5453
Advisory:
54+
5555
```
5656
2023-02-24T14:59:45.076113Z A server-main Config:
5757
2023-02-24T14:59:45.076130Z A server-main - http.enabled : true
@@ -60,23 +60,27 @@ Advisory:
6060
```
6161

6262
Critical:
63+
6364
```
6465
2022-08-08T11:15:13.040767Z C i.q.c.p.WriterPool could not open [table=`sys.text_import_log`, thread=1, ex=could not open read-write [file=/opt/homebrew/var/questdb/db/sys.text_import_log/_todo_], errno=13]
6566
```
6667

6768
Error:
69+
6870
```
6971
2023-02-24T14:59:45.059012Z I i.q.c.t.t.InputFormatConfiguration loading input format config [resource=/text_loader.json]
7072
2023-03-20T08:38:17.076744Z E i.q.c.l.u.AbstractLineProtoUdpReceiver could not set receive buffer size [fd=140, size=8388608, errno=55]
7173
```
7274

7375
Info:
76+
7477
```
7578
2020-04-15T16:42:32.879970Z I i.q.c.TableReader new transaction [txn=2, transientRowCount=1, fixedRowCount=1, maxTimestamp=1585755801000000, attempts=0]
7679
2020-04-15T16:42:32.880051Z I i.q.g.FunctionParser call to_timestamp('2020-05-01:15:43:21','yyyy-MM-dd:HH:mm:ss') -> to_timestamp(Ss)
7780
```
7881

7982
Debug:
83+
8084
```
8185
2023-03-31T11:47:05.723715Z D i.q.g.FunctionParser call cast(investmentMill,INT) -> cast(Li)
8286
2023-03-31T11:47:05.723729Z D i.q.g.FunctionParser call rnd_symbol(4,4,4,2) -> rnd_symbol(iiii)
@@ -206,10 +210,10 @@ The following configuration options can be set in your `server.conf`:
206210
207211
On systems with
208212
[8 Cores and less](/docs/operations/capacity-planning/#cpu-cores), contention
209-
for threads might increase the latency of health check service responses. If you use
210-
a load balancer thinks the QuestDB service is dead with nothing apparent in the
211-
QuestDB logs, you may need to configure a dedicated thread pool for the health
212-
check service. To do so, increase `http.min.worker.count` to `1`.
213+
for threads might increase the latency of health check service responses. If you
214+
use a load balancer, and it thinks the QuestDB service is dead with nothing
215+
apparent in the QuestDB logs, you may need to configure a dedicated thread pool
216+
for the health check service. To do so, increase `http.min.worker.count` to `1`.
213217
214218
:::
215219
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
title: Monitoring and alerting
3+
description: Shows you how to set up to monitor your database for potential issues, and how to raise alerts
4+
---
5+
6+
There are many variables to consider when monitoring an active production database. This document is designed to be a helpful starting point. We plan to expand this guide to be more helpful. If you have any recommendations, feel free to [create an issue](https://github.com/questdb/documentation/issues) or a PR on GitHub.
7+
8+
## Basic health check
9+
10+
QuestDB comes with an out-of-the-box health check HTTP endpoint:
11+
12+
```shell title="GET health status of local instance"
13+
curl -v http://127.0.0.1:9003
14+
```
15+
16+
Getting an OK response means the QuestDB process is up and running. This method
17+
provides no further information.
18+
19+
If you allocate 8 vCPUs/cores or less to QuestDB, the HTTP server thread may not
20+
be able to get enough CPU time to respond in a timely manner. Your load balancer
21+
may flag the instance as dead. In such a case, create an isolated thread pool
22+
just for the health check service (the `min` HTTP server), by setting this
23+
configuration option:
24+
25+
```text
26+
http.min.worker.count=1
27+
```
28+
29+
## Alert on critical errors
30+
31+
QuestDB includes a log writer that sends any message logged at critical level to
32+
Prometheus Alertmanager over a TCP/IP socket. To configure this writer, add it
33+
to the `writers` config alongside other log writers. This is the basic setup:
34+
35+
```ini title="log.conf"
36+
writers=stdout,alert
37+
w.alert.class=io.questdb.log.LogAlertSocketWriter
38+
w.alert.level=CRITICAL
39+
```
40+
41+
For more details, see the
42+
[Logging and metrics page](/docs/operations/logging-metrics/#prometheus-alertmanager).
43+
44+
## Detect suspended tables
45+
46+
QuestDB exposes a Prometheus gauge called `questdb_suspended_tables`. You can set up
47+
to alert whenever this gauge shows an above-zero value.
48+
49+
## Detect slow ingestion
50+
51+
QuestDB ingests data in two stages: first it records everything to the
52+
Write-Ahead Log. This step is optimized for throughput and usually isn't the
53+
bottleneck. The next step is inserting the data to the table, and this can
54+
take longer if the data is out of order, or touches different time partitions.
55+
You can monitor the overall performance of this process of applying the WAL
56+
data to tables. QuestDB exposes two Prometheus counters for this:
57+
58+
1. `questdb_wal_apply_seq_txn_total`: sum of all committed transaction sequence numbers
59+
2. `questdb_wal_apply_writer_txn_total`: sum of all transaction sequence numbers applied to tables
60+
61+
Both of these numbers are continuously growing as the data is ingested. When
62+
they are equal, all WAL data has been applied to the tables. While data is being
63+
actively ingested, the second counter will lag behind the first one. A steady
64+
difference between them is a sign of healthy rate of WAL application, the
65+
database keeping up with the demand. However, if the difference continuously
66+
rises, this indicates that either a table has become suspended and WAL can't be
67+
applied to it, or QuestDB is not able to keep up with the ingestion rate. All of
68+
the data is still safely stored, but a growing portion of it is not yet visible
69+
to queries.
70+
71+
You can create an alert that detects a steadily increasing difference between
72+
these two numbers. It won't tell you which table is experiencing issues, but it
73+
is a low-impact way to detect there's a problem which needs further diagnosing.
74+
75+
## Detect slow queries
76+
77+
QuestDB maintains a table called `_query_trace`, which records each executed
78+
query and the time it took. You can query this table to find slow queries.
79+
80+
Read more on query tracing on the
81+
[Concepts page](/docs/concept/query-tracing/).

documentation/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -470,6 +470,7 @@ module.exports = {
470470
]
471471
},
472472
"operations/logging-metrics",
473+
"operations/monitoring-alerting",
473474
"operations/data-retention",
474475
"operations/design-for-performance",
475476
"operations/updating-data",

documentation/third-party-tools/prometheus.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,7 @@ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' aler
278278
To run QuestDB and point it towards Alertmanager for alerting, first create a
279279
file `./conf/log.conf` with the following contents. `172.17.0.2` in this case is
280280
the IP address of the docker container for alertmanager that was discovered by
281-
running the `docker inspect ` command above.
281+
running the `docker inspect` command above.
282282

283283
```ini title="./conf/log.conf"
284284
# Which writers to enable

0 commit comments

Comments
 (0)