Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add observability metrics to OpenWEC server #189

Open
vruello opened this issue Nov 1, 2024 · 18 comments
Open

Add observability metrics to OpenWEC server #189

vruello opened this issue Nov 1, 2024 · 18 comments
Labels
enhancement New feature or request

Comments

@vruello
Copy link
Contributor

vruello commented Nov 1, 2024

Observability metrics need to be produced and exposed by the OpenWEC server.

Which metrics?

I think the following metrics would be interesting to have:

  • The number of HTTP requests received per second, total and per action (enumerate, events, heartbeat)
  • The response time of each HTTP requests per second, total and per action (enumerate, events, heartbeat)
  • The number of events received per second, total and per subscription
  • The number of events that could not be handled (because an output failed) per second (or bigger), total and per subscription. Would be helpful to detect if there is a problem with an output (for example if the file system is full).
  • The total number of machines seen per subscription (already covered by "openwec stats")
  • The number of active machines (received an event "recently") per subscription (already covered by "openwec stats")
  • The number of alive machines (received an heartbeat "recently") per subscription (already covered by "openwec stats")
  • The number of dead machines (didn't receive anything "recently") per subscription (already covered by "openwec stats")

From a developer's point of view, it would also be interesting to optionally add more timing metrics, for example to measure the amount of time spent in parts of the code. For example, when we receive a batch of events, it would be interesting to know how much time we spend decrypting, decompressing, parsing xml, formatting events, writing formatted events to each output, generating response and encrypting response.

Feel free to suggest other metrics!

Which protocol/format?

There are multiple ways to expose/transmit metrics. After a brief state of the art, I think we need to choose between:

Both have pros and cons:

  • OpenMetrics:
    • Pros:
      • No network overhead
      • (maybe) the new standard?
      • Integrates well in Kubernetes environments
    • Cons:
      • Need to expose (another) HTTP server
      • Limited features (specially for timing measures)
  • statsd:
    • Pros:
      • Advanced features (because everything is calculated on the statsd server)
    • Cons:
      • (maybe) a lot of monitoring traffic?
      • (maybe) impact on performances?

Which library?

  • statsd: Cadence
  • OpenMetrics/Prometheus: prometheus_client
  • both: metrics-rs

I'm currently working on a prototype with prometheus_client where the OpenWEC server would expose a HTTP server dedicated to metrics (different listening addr/port).

@vruello vruello added the enhancement New feature or request label Nov 1, 2024
@tarokkk
Copy link

tarokkk commented Nov 1, 2024

Prometheus (I'm not sure about the rust client) also supports a push-based model via remote_write.

@vruello
Copy link
Contributor Author

vruello commented Nov 2, 2024

Prometheus (I'm not sure about the rust client) also supports a push-based model via remote_write.

The Prometheus documentation states that "The remote write protocol is not intended for use by applications to push metrics to Prometheus remote-write-compatible receivers. It is intended that a Prometheus remote-write-compatible sender scrapes instrumented applications or exporters and sends remote write messages to a server." (https://prometheus.io/docs/specs/remote_write_spec/#background).

An application can push metrics to Prometheus Pushgateway (https://prometheus.io/docs/instrumenting/pushing/). However, the OpenWEC server does not seem to fit in the cases where the pushgateway should be used: https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway.

@vruello
Copy link
Contributor Author

vruello commented Nov 2, 2024

After a little more thought, I think metrics should be split in 2 categories:

  • metrics intended for users
  • metrics intended for developers

Metrics intended for developers should probably be provided using tracing (and perhaps tracing-opentelemetry). This would require some work to transition from logging/log4rs but it would give us a lot of powerful tools and features to analyze the behavior of the server in depth.

@vruello
Copy link
Contributor Author

vruello commented Nov 3, 2024

I have two working prototypes:

  • One with prometheus_client. I am not very happy with it. A lot of boilerplate code is needed to run the HTTP server needed to expose the metrics. Also, metrics "objects" (counters, histograms, ...) have to be passed into each function call which is annoying.
  • One with metrics and metrics-exporter-prometheus. I really like the idea that metrics are independent of the exporter (prometheus here). It is also very easy to use (macros like log crate). Maybe it is a bit slower than prometheus_client, but I don't think that it is significant.

I will test the second prototype in a real environment and see if it affects performance or not.

@vruello
Copy link
Contributor Author

vruello commented Nov 22, 2024

After a few weeks of testing "in production", I am quite confident that the production of prometheus metrics using metrics-rs and metrics-exporter-prometheus has a negligible impact on performance.

The default buckets used to store the "request_duration" histogram are not really suitable for openwec. In our production environment, openwecd answers 50% of http requests in less than 1ms... I think it's fine to keep the "standard" (0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10) as the default, but it is probably better to change them in the configuration if you need an accurate view.

@MrAnno
Copy link
Contributor

MrAnno commented Nov 22, 2024

metrics-rs looks pretty promising, helping manage the registry of metric objects without being too intrusive. It's also protocol-agnostic, which may come in handy later.

metrics intended for users
The number of events received per second, total and per subscription

As far as I can see, openwec_event_received_total will be able to handle everything related to the input side of events, EPS will be just a rate() call on the Prometheus side. The only important thing for this metric is to create small enough partitions in order to allow informative queries on it later. High cardinality in partitions (labels) should be avoided though (especially if they are often active within a reasonable time range), because that would dramatically increase the stored data on the Prometheus side, and would also make queries slower.

All things considered, I think adding subscription to the labels will never be considered high cardinality and is crucial for providing helpful metrics. The next very important partition would be computer/machine/source (probably machine is the good name in the OpenWEC context), because even in the most basic use cases, one wants to know where the events are coming from, how many are there, which client is the loudest in a given time interval, or which group of clients is generating the most events.
But depending on the use case, machine may become high cardinality, so I think that label should be configurable: on by default, but could be disabled in extremely big high-scale environments.

(Just random nitpick: I usually see unitless metric names in plural: openwec_events_received_total or even more popular, openwec_received_events_total. That way, they look more consistent next to metrics that have units, because https://prometheus.io/docs/practices/naming/ says: "should have a suffix describing the unit, in plural form".)

@vruello
Copy link
Contributor Author

vruello commented Nov 23, 2024

All things considered, I think adding subscription to the labels will never be considered high cardinality and is crucial for providing helpful metrics.

I agree 👍

The next very important partition would be computer/machine/source (probably machine is the good name in the OpenWEC context), because even in the most basic use cases, one wants to know where the events are coming from, how many are there, which client is the loudest in a given time interval, or which group of clients is generating the most events. But depending on the use case, machine may become high cardinality, so I think that label should be configurable: on by default, but could be disabled in extremely big high-scale environments.

I am not sure about this. I agree that this kind of metric might be interesting in some specific cases, but the Prometheus documentation explicitly states not to use labels to store unbounded sets of values (https://prometheus.io/docs/practices/naming/#labels, https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels).
I see many environments with >10 subscriptions, >10000 machines, >2 openwec nodes, which would result in at least 200,000 time series. While this number could probably be handled by Prometheus, I'm not sure that it's worth the cost.

For now, I would consider adding an option to add a "machine" label, but I won't make it on by default. Does this sound acceptable?

@MrAnno
Copy link
Contributor

MrAnno commented Nov 23, 2024

Of course, it sounds absolutely reasonable. I didn't know that environments of such huge size were common (the fact is really cool at the same time).

@vruello
Copy link
Contributor Author

vruello commented Nov 23, 2024

I updated the pull request.

There is now a monitoring.count_received_events_per_machine option which, when set, adds a "machine" label to openwec_received_events_total (I also changed the name as you suggested).

I have added some metrics related to the amount of data received by openwec (raw http request body size, http request body size after decryption (kerberos only) and decompression, received events size). There is an option to add a "machine" label for each metric.

The configuration looks like this:

[monitoring]
listen_address = "127.0.0.1"
listen_port = 10000
# Defaults to false
count_received_events_per_machine = true
# Defaults to false
count_event_size_per_machine = true
# Defaults to false
count_http_request_body_network_size_per_machine = true
# Defaults to false
count_http_request_body_real_size_per_machine = true

As mentioned earlier, enabling the "per_machine" options results in a potentially HUGE increase in metric cardinality.

@MrAnno feel free to suggest name changes for metrics, I'm not used to Prometheus metric naming patterns.

@vruello
Copy link
Contributor Author

vruello commented Nov 23, 2024

I'm thinking about removing the "method" label, since "POST" is the only HTTP method supported by WEF and thus by openwec (https://github.com/cea-sec/openwec/blob/main/server/src/lib.rs#L76). Does anyone see a reason to keep it?

@MrAnno
Copy link
Contributor

MrAnno commented Nov 23, 2024

I'm thinking about removing the "method" label, since "POST" is the only HTTP method supported by WEF

Keeping the method label may be useful if you use the non-prefixed (no namespace) metric name http_request_*, because that metric will likely combine all kinds of HTTP-based services, and it may make things easier on the query side.
But adding the openwec_ prefix to all metrics would solve such dilemmas.

I'm not a Prometheus expert either, so I'm just thinking out loud:

openwec_?_events_total{subscription_name, subscription_uuid, machine}
openwec_?_event_bytes_total{subscription_name, subscription_uuid, machine}
openwec_?_messages_total{action}
openwec_?_event_failures_total{subscription_uuid, subscription_name, driver="file/kafka/etc"}

(openwec_)?http_request_duration_seconds{method, status, uri or path}
(openwec_)?http_requests_total{method, status, uri or path}
openwec_http_request_body_real_size_bytes_total{method, machine, uri or path}
openwec_http_request_body_network_size_bytes_total{method, machine, uri or path}

You may want to publish metrics for the output side later, so the first 4 metrics may require something to
distinguish between the input and output sides.

For example, in AxoSyslog, we use input_events and output_events, but there are more popular phrases like ingress and egress or ingested and emitted, but received/sent sound good too in my opinion.

I saw you already introduced openwec_event_output_failures_total, I personally like that direction :)

@vruello
Copy link
Contributor Author

vruello commented Nov 24, 2024

Keeping the method label may be useful if you use the non-prefixed (no namespace) metric name http_request_*, because that metric will likely combine all kinds of HTTP-based services, and it may make things easier on the query side. But adding the openwec_ prefix to all metrics would solve such dilemmas.

I think it is clearer to prefix any metric generated by openwec with openwec_*, so we can keep only relevant labels. Regarding path vs uri, I would prefer to use uri since this term is already used in the openwec terminology.

Thanks for your name suggestions!

openwec_?_event_failures_total{subscription_uuid, subscription_name, driver="file/kafka/etc"}

openwec_event_(output_)failures_total only has subscription_uuid and subscription_name labels because it counts events that could not be successfully sent to every output. Maybe it should be renamed to openwec_?_failed_events_total? Note that we should also increment this counter if an event cannot be formatted.

Knowing which driver failed is interesting though, so I would be tempted to add another metric that counts the number of failures for each driver (one failure for each batch of events that could not be sent), perhaps openwec_?_driver_failures_total (by subscription_name, subscription_uuid, driver). driver would be a string that looks like Kafka(KafkaConfiguration { topic: "openwec", options: {"bootstrap.servers": "localhost:19092"} }.

As mentioned earlier, formatting can also fail in some cases, and it is only reported as a warning by openwec (it is better to lose a strange event rather than to block the machine's event stream). We should add a metric that counts the number of format failures by subscription and format, lets say openwec_?_format_failures_total (by subscription_name, subscription_uuid, format).

You may want to publish metrics for the output side later, so the first 4 metrics may require something to
distinguish between the input and output sides.

I was thinking about this, but since openwec has no filtering capabilities, I don't see in what situation the number of events "in" would be different from the number of events "out" for a given subscription, except when an output fails (and in that case we have openwec_?_failed_events_total). Do you have a specific use case in mind?

I don't have a strong opinion about "input" vs "received" vs "ingress"... "received" felt natural, but "sent" does not work well with errors. I would go for "input"/"output".

That would lead to:

openwec_input_events_total{subscription_name, subscription_uuid, machine}
openwec_input_event_bytes_total{subscription_name, subscription_uuid, machine}
openwec_input_messages_total{action}

# Total number of events that could not be sent by subscription
openwec_output_failed_events_total{subscription_uuid, subscription_name}
# Total number of output driver failures by subscription and driver
openwec_output_driver_failures_total{subscription_uuid, subscription_name,driver}
# Total number of output format failures by subscription and format
openwec_output_format_failures_total{subscription_uuid, subscription_name,format}

openwec_http_request_duration_seconds{status, uri}
openwec_http_requests_total{status, uri}
openwec_http_request_body_real_size_bytes_total{machine, uri}
openwec_http_request_body_network_size_bytes_total{machine, uri}

@MrAnno
Copy link
Contributor

MrAnno commented Nov 24, 2024

events that could not be successfully sent to every output.

Ah, I understand it now.

I was thinking about this, but since openwec has no filtering capabilities, I don't see in what situation the number of events "in" would be different from the number of events "out" for a given subscription, except when an output fails

I see. In OpenWEC, 1 subscription can be wired to multiple outputs, and those outputs may have different formats. There is a 1->N mapping and no filtering, but failures are possible. In that case, failure metrics are perfect enough to provide full observability of what is happening inside.

Naming those metrics cleanly seems difficult though, maybe we could try merging two of the mentioned metrics in a way that the sum() of it would make sense:

openwec_output_event_failures_total{subscription_name, reason="format/delivery", driver}

Naming the "could not be sent to ALL outputs" metric seems more difficult. That metric somewhat belongs to the input side because it should be compared against openwec_input_events_total and not the output side, where everything is multiplied by the number of outputs. Maybe openwec_input_events_incomplete_delivery_total or openwec_event_partial_delivery_total? They sound strange, I don't know :(

That would lead to:

Everything looks nice and neat in my opinion :)

@vruello
Copy link
Contributor Author

vruello commented Nov 25, 2024

I see. In OpenWEC, 1 subscription can be wired to multiple outputs, and those outputs may have different formats.

Yes. In OpenWEC terminology, an output is equal to a driver AND a format. One subscription is wired to multiple outputs.

Naming those metrics cleanly seems difficult though, maybe we could try merging two of the mentioned metrics in a way that the sum() of it would make sense:

I don't think merging the two metrics is a good idea because they don't have the same impact:

  • Counting format errors can be used to understand why some events are missing in some outputs. However, there is nothing you can do to fix this except call MS for help. To be honest, I don't think there's any real interest in following this counter closely, since these events are invalid by nature and nothing can be done...

  • On the other hand, a driver failure does not mean that the events are lost. Because output drivers are faillible (network error, output service unavailable, disk full...), the batch of events won't be acked unless openwec has successfully sent events to all output (drivers). Thus, counting these errors is crucial to signal that there might be a problem with an output driver. A large continuous increase in this counter should be considered a red flag.

Moreover, I can imagine scenarios where a batch of events causes multiple format failures (it contains invalid events) and multiple driver failures (many output drivers fail). So I don't think it makes sense to do sum() or avg() on these two counters.

Last but not least, I would say that you want to know which format "failed" when a format error occurs (it might be used by multiple outputs), so you want the "format" label. Similarly, you want to know which driver "failed" when a driver failure occurs, so you want the "driver" label.

Naming the "could not be sent to ALL outputs" metric seems more difficult. That metric somewhat belongs to the input side because it should be compared against openwec_input_events_total and not the output side, where everything is multiplied by the number of outputs.

It seems odd to call it "input" when it counts events that could not be successfully sent to outputs. However, I think I understand what you mean. One might think that openwec_output_failed_events_total is "per output", but it is not.

After a bit more thought, I don't think this metric is of any interest when we already have openwec_output_driver_failures_total. More specifically, a "global" output failure of "X" events can only be caused by (at least) 1 output driver failure, so we just need to count the failures and we are good. Besides, the number of events is irrelevant: as said before, they are not lost, they will be sent back later...

So that would lead to:

openwec_input_events_total{subscription_name, subscription_uuid, machine}
openwec_input_event_bytes_total{subscription_name, subscription_uuid, machine}
openwec_input_messages_total{action}

# Total number of output driver failures by subscription and driver
openwec_output_driver_failures_total{subscription_uuid, subscription_name,driver}
# Total number of output format failures by subscription and format
openwec_output_format_failures_total{subscription_uuid, subscription_name,format}

openwec_http_request_duration_seconds{status, uri}
openwec_http_requests_total{status, uri}
openwec_http_request_body_real_size_bytes_total{machine, uri}
openwec_http_request_body_network_size_bytes_total{machine, uri}

@MrAnno
Copy link
Contributor

MrAnno commented Nov 25, 2024

The metrics you proposed look consistent and understandable to me. 👍🏻

So if I understand it correctly, openwec_output_driver_failures_total can be used as an indicator that something is wrong with a given output (unreliable, misconfigured, etc.), but this won't indicate event loss.

If one's question is "how many events have I lost?", openwec_output_format_failures_total should be the metric to check.

My thought process is something like this:

  • If we want to monitor and display statistics about the usual event flow, we can use openwec_http_* and openwec_input_*, their labels will provide perfect granularity to make informative and beautiful diagrams.
  • If we want to monitor and alert on different kinds of anomalies and malfunctions, we'll need information about
    • the output's health (openwec_output_driver_failures_total)
    • dropped events (openwec_output_format_failures_total, or something like openwec_dropped_events_total may help with proper labeling if other message loss scenarios are possible within OpenWEC)
    • active/alive/dead clients
    • processing delay (openwec_http_request_duration_seconds)

@vruello
Copy link
Contributor Author

vruello commented Nov 25, 2024

So if I understand it correctly, openwec_output_driver_failures_total can be used as an indicator that something is wrong with a given output (unreliable, misconfigured, etc.), but this won't indicate event loss.

Yes! However, if the counter keeps increasing, it probably means that openwec will stop accepting events from that subscription until the problem is fixed. In this situation, there is no guarantee that the Windows machines will not drop events without having sent them before. An acceptable retention period for Windows event logs should be configured to avoid event loss.

If one's question is "how many events have I lost?", openwec_output_format_failures_total should be the metric to check.

dropped events (openwec_output_format_failures_total, or something like openwec_dropped_events_total may help with proper labeling if other message loss scenarios are possible within OpenWEC)

I just checked the code, and the behavior is a little different than what I said earlier.

It all happens in the get_formatted_events function (https://github.com/cea-sec/openwec/blob/main/server/src/logic.rs#L289):

  • We are given a list of events and a list of formats.
  • We create an EventData instance for each event. If at least one format requires the events to be parsed, the EventData constructor will try to parse the event. The parsing cannot fail (we implement FromStr): it will always return something, but it might be almost empty if the event is malformed. If something bad happens, the parsed event will contain metadata about what happened and a copy of the original event.
  • We then build a hashmap formatted_events that associates each format with a list of strings (formatted events) to send. To do this, for each format, we iterate through the list of EventData instances and call format.format() method to retrieve an Option<String>. If the result is None, then the event could not be formatted and there is nothing to do but ignore it (so this is THE case where we drop an event).

We may want to count:

  • when an event cannot be parsed normally. The event is not dropped but it may be incomplete.
  • when an event cannot be formatted successfully. The event is dropped. It probably means that there is a problem in the formatter and it should not happen in production.

So, to be precise, we might also want to count parsing errors!

# Total number of event parsing failures by subscription
# Type is derived from https://github.com/cea-sec/openwec/blob/main/server/src/event.rs#L51
openwec_input_event_parsing_failures_total{subscription_uuid, subscription_name,type}

# Total number of output driver failures by subscription and driver
openwec_output_driver_failures_total{subscription_uuid, subscription_name,driver}
# Total number of output format failures by subscription and format
openwec_output_format_failures_total{subscription_uuid, subscription_name,format}

active/alive/dead clients

Currently, we retrieve these numbers from the database. It means that they are common to all openwec nodes, which is not true for the other metrics. I don't know if this is a problem, but I don't see how we could do it any other way.

In Prometheus, these numbers are represented as gauges. With metrics-rs, the easiest way to implement this would be to add a background task that frequently queries the database and updates the gauge values.

I would also add that these numbers are interesting for monitoring the usual event flow, especially the sum of active + alive clients which basically tells you how many machines are currently sending events. They can also be used to detect a problem that would prevent machines from talking to openwec (network, authentication, ...).

I think that these three numbers could be represented as a unique gauge openwec_machines with a "status" label (it makes sense to sum them to get the total number of machines).

@vruello
Copy link
Contributor Author

vruello commented Nov 26, 2024

I have updated the PR.

Metric Type Labels Description
openwec_input_events_total Counter subscription_uuid, subscription_name, machine (optional*) The total number of events received by openwec
openwec_input_event_bytes_total Counter subscription_uuid, subscription_name, machine (optional*) The total size of all events received by openwec
openwec_input_messages_total Counter action (one of "enumerate", "heartbeat", "events") The total number of messages received by openwec
openwec_input_event_parsing_failures_total Counter subscription_uuid, subscription_name, type The total number of event parsing failures
openwec_http_requests_total Counter uri, code The total number of HTTP requests handled by openwec
openwec_http_request_duration_seconds Histogram uri Histogram of response duration for HTTP requests
openwec_http_request_body_network_size_bytes_total Counter uri, machine (optional*) The total size of all http requests body received by openwec
openwec_http_request_body_real_size_bytes_total Counter uri, machine (optional*) The total size of all http requests body received by openwec after decryption and decompression
openwec_output_driver_failures_total Counter subscription_uuid, subscription_name, driver The total number of output driver failures
openwec_output_format_failures_total Counter subscription_uuid, subscription_name, format The total number of output format failures

@MrAnno
Copy link
Contributor

MrAnno commented Nov 26, 2024

add a background task that frequently queries the database and updates the gauge values.

It sounds reasonable, hopefully not too expensive:)

I think that these three numbers could be represented as a unique gauge openwec_machines with a "status" label (it makes sense to sum them to get the total number of machines).

Sounds perfect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants