diff --git a/documentation/operations/logging-metrics.md b/documentation/operations/logging-metrics.md index 07ba5f56..cc4162af 100644 --- a/documentation/operations/logging-metrics.md +++ b/documentation/operations/logging-metrics.md @@ -6,7 +6,7 @@ description: Configure and understand QuestDB logging and metrics, including log import { ConfigTable } from "@theme/ConfigTable" import httpMinimalConfig from "./_http-minimal.config.json" -This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus. +This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus. - [Logging](/docs/operations/logging-metrics/#logging) - [Metrics](/docs/operations/logging-metrics/#metrics) @@ -48,10 +48,10 @@ QuestDB provides the following types of log information: For more information, see the [QuestDB source code](https://github.com/questdb/questdb/blob/master/core/src/main/java/io/questdb/log/LogLevel.java). - ### Example log messages Advisory: + ``` 2023-02-24T14:59:45.076113Z A server-main Config: 2023-02-24T14:59:45.076130Z A server-main - http.enabled : true @@ -60,23 +60,27 @@ Advisory: ``` Critical: + ``` 2022-08-08T11:15:13.040767Z C i.q.c.p.WriterPool could not open [table=`sys.text_import_log`, thread=1, ex=could not open read-write [file=/opt/homebrew/var/questdb/db/sys.text_import_log/_todo_], errno=13] ``` Error: + ``` 2023-02-24T14:59:45.059012Z I i.q.c.t.t.InputFormatConfiguration loading input format config [resource=/text_loader.json] 2023-03-20T08:38:17.076744Z E i.q.c.l.u.AbstractLineProtoUdpReceiver could not set receive buffer size [fd=140, size=8388608, errno=55] ``` Info: + ``` 2020-04-15T16:42:32.879970Z I i.q.c.TableReader new transaction [txn=2, transientRowCount=1, fixedRowCount=1, maxTimestamp=1585755801000000, attempts=0] 2020-04-15T16:42:32.880051Z I i.q.g.FunctionParser call to_timestamp('2020-05-01:15:43:21','yyyy-MM-dd:HH:mm:ss') -> to_timestamp(Ss) ``` Debug: + ``` 2023-03-31T11:47:05.723715Z D i.q.g.FunctionParser call cast(investmentMill,INT) -> cast(Li) 2023-03-31T11:47:05.723729Z D i.q.g.FunctionParser call rnd_symbol(4,4,4,2) -> rnd_symbol(iiii) @@ -206,10 +210,10 @@ The following configuration options can be set in your `server.conf`: On systems with [8 Cores and less](/docs/operations/capacity-planning/#cpu-cores), contention -for threads might increase the latency of health check service responses. If you use -a load balancer thinks the QuestDB service is dead with nothing apparent in the -QuestDB logs, you may need to configure a dedicated thread pool for the health -check service. To do so, increase `http.min.worker.count` to `1`. +for threads might increase the latency of health check service responses. If you +use a load balancer, and it thinks the QuestDB service is dead with nothing +apparent in the QuestDB logs, you may need to configure a dedicated thread pool +for the health check service. To do so, increase `http.min.worker.count` to `1`. ::: diff --git a/documentation/operations/monitoring-alerting.md b/documentation/operations/monitoring-alerting.md new file mode 100644 index 00000000..da7f4762 --- /dev/null +++ b/documentation/operations/monitoring-alerting.md @@ -0,0 +1,81 @@ +--- +title: Monitoring and alerting +description: Shows you how to set up to monitor your database for potential issues, and how to raise alerts +--- + +There are many variables to consider when monitoring an active production database. This document is designed to be a helpful starting point. We plan to expand this guide to be more helpful. If you have any recommendations, feel free to [create an issue](https://github.com/questdb/documentation/issues) or a PR on GitHub. + +## Basic health check + +QuestDB comes with an out-of-the-box health check HTTP endpoint: + +```shell title="GET health status of local instance" +curl -v http://127.0.0.1:9003 +``` + +Getting an OK response means the QuestDB process is up and running. This method +provides no further information. + +If you allocate 8 vCPUs/cores or less to QuestDB, the HTTP server thread may not +be able to get enough CPU time to respond in a timely manner. Your load balancer +may flag the instance as dead. In such a case, create an isolated thread pool +just for the health check service (the `min` HTTP server), by setting this +configuration option: + +```text +http.min.worker.count=1 +``` + +## Alert on critical errors + +QuestDB includes a log writer that sends any message logged at critical level to +Prometheus Alertmanager over a TCP/IP socket. To configure this writer, add it +to the `writers` config alongside other log writers. This is the basic setup: + +```ini title="log.conf" +writers=stdout,alert +w.alert.class=io.questdb.log.LogAlertSocketWriter +w.alert.level=CRITICAL +``` + +For more details, see the +[Logging and metrics page](/docs/operations/logging-metrics/#prometheus-alertmanager). + +## Detect suspended tables + +QuestDB exposes a Prometheus gauge called `questdb_suspended_tables`. You can set up +to alert whenever this gauge shows an above-zero value. + +## Detect slow ingestion + +QuestDB ingests data in two stages: first it records everything to the +Write-Ahead Log. This step is optimized for throughput and usually isn't the +bottleneck. The next step is inserting the data to the table, and this can +take longer if the data is out of order, or touches different time partitions. +You can monitor the overall performance of this process of applying the WAL +data to tables. QuestDB exposes two Prometheus counters for this: + +1. `questdb_wal_apply_seq_txn_total`: sum of all committed transaction sequence numbers +2. `questdb_wal_apply_writer_txn_total`: sum of all transaction sequence numbers applied to tables + +Both of these numbers are continuously growing as the data is ingested. When +they are equal, all WAL data has been applied to the tables. While data is being +actively ingested, the second counter will lag behind the first one. A steady +difference between them is a sign of healthy rate of WAL application, the +database keeping up with the demand. However, if the difference continuously +rises, this indicates that either a table has become suspended and WAL can't be +applied to it, or QuestDB is not able to keep up with the ingestion rate. All of +the data is still safely stored, but a growing portion of it is not yet visible +to queries. + +You can create an alert that detects a steadily increasing difference between +these two numbers. It won't tell you which table is experiencing issues, but it +is a low-impact way to detect there's a problem which needs further diagnosing. + +## Detect slow queries + +QuestDB maintains a table called `_query_trace`, which records each executed +query and the time it took. You can query this table to find slow queries. + +Read more on query tracing on the +[Concepts page](/docs/concept/query-tracing/). diff --git a/documentation/sidebars.js b/documentation/sidebars.js index 43e4e087..5cf993eb 100644 --- a/documentation/sidebars.js +++ b/documentation/sidebars.js @@ -470,6 +470,7 @@ module.exports = { ] }, "operations/logging-metrics", + "operations/monitoring-alerting", "operations/data-retention", "operations/design-for-performance", "operations/updating-data", diff --git a/documentation/third-party-tools/prometheus.md b/documentation/third-party-tools/prometheus.md index ae87592c..868e8e08 100644 --- a/documentation/third-party-tools/prometheus.md +++ b/documentation/third-party-tools/prometheus.md @@ -278,7 +278,7 @@ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' aler To run QuestDB and point it towards Alertmanager for alerting, first create a file `./conf/log.conf` with the following contents. `172.17.0.2` in this case is the IP address of the docker container for alertmanager that was discovered by -running the `docker inspect ` command above. +running the `docker inspect` command above. ```ini title="./conf/log.conf" # Which writers to enable