-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add observability metrics to OpenWEC server #189
Comments
Prometheus (I'm not sure about the rust client) also supports a push-based model via remote_write. |
The Prometheus documentation states that "The remote write protocol is not intended for use by applications to push metrics to Prometheus remote-write-compatible receivers. It is intended that a Prometheus remote-write-compatible sender scrapes instrumented applications or exporters and sends remote write messages to a server." (https://prometheus.io/docs/specs/remote_write_spec/#background). An application can push metrics to Prometheus Pushgateway (https://prometheus.io/docs/instrumenting/pushing/). However, the OpenWEC server does not seem to fit in the cases where the pushgateway should be used: https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway. |
After a little more thought, I think metrics should be split in 2 categories:
Metrics intended for developers should probably be provided using |
I have two working prototypes:
I will test the second prototype in a real environment and see if it affects performance or not. |
After a few weeks of testing "in production", I am quite confident that the production of prometheus metrics using The default buckets used to store the "request_duration" histogram are not really suitable for openwec. In our production environment, openwecd answers 50% of http requests in less than 1ms... I think it's fine to keep the "standard" (0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10) as the default, but it is probably better to change them in the configuration if you need an accurate view. |
As far as I can see, All things considered, I think adding (Just random nitpick: I usually see unitless metric names in plural: |
I agree 👍
I am not sure about this. I agree that this kind of metric might be interesting in some specific cases, but the Prometheus documentation explicitly states not to use labels to store unbounded sets of values (https://prometheus.io/docs/practices/naming/#labels, https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels). For now, I would consider adding an option to add a "machine" label, but I won't make it on by default. Does this sound acceptable? |
Of course, it sounds absolutely reasonable. I didn't know that environments of such huge size were common (the fact is really cool at the same time). |
I updated the pull request. There is now a I have added some metrics related to the amount of data received by openwec (raw http request body size, http request body size after decryption (kerberos only) and decompression, received events size). There is an option to add a "machine" label for each metric. The configuration looks like this: [monitoring]
listen_address = "127.0.0.1"
listen_port = 10000
# Defaults to false
count_received_events_per_machine = true
# Defaults to false
count_event_size_per_machine = true
# Defaults to false
count_http_request_body_network_size_per_machine = true
# Defaults to false
count_http_request_body_real_size_per_machine = true As mentioned earlier, enabling the "per_machine" options results in a potentially HUGE increase in metric cardinality. @MrAnno feel free to suggest name changes for metrics, I'm not used to Prometheus metric naming patterns. |
I'm thinking about removing the "method" label, since "POST" is the only HTTP method supported by WEF and thus by openwec (https://github.com/cea-sec/openwec/blob/main/server/src/lib.rs#L76). Does anyone see a reason to keep it? |
Keeping the method label may be useful if you use the non-prefixed (no namespace) metric name I'm not a Prometheus expert either, so I'm just thinking out loud:
You may want to publish metrics for the output side later, so the first 4 metrics may require something to For example, in AxoSyslog, we use I saw you already introduced |
I think it is clearer to prefix any metric generated by openwec with Thanks for your name suggestions!
Knowing which driver failed is interesting though, so I would be tempted to add another metric that counts the number of failures for each driver (one failure for each batch of events that could not be sent), perhaps As mentioned earlier, formatting can also fail in some cases, and it is only reported as a warning by openwec (it is better to lose a strange event rather than to block the machine's event stream). We should add a metric that counts the number of format failures by subscription and format, lets say
I was thinking about this, but since openwec has no filtering capabilities, I don't see in what situation the number of events "in" would be different from the number of events "out" for a given subscription, except when an output fails (and in that case we have I don't have a strong opinion about "input" vs "received" vs "ingress"... "received" felt natural, but "sent" does not work well with errors. I would go for "input"/"output". That would lead to:
|
Ah, I understand it now.
I see. In OpenWEC, 1 subscription can be wired to multiple outputs, and those outputs may have different formats. There is a 1->N mapping and no filtering, but failures are possible. In that case, failure metrics are perfect enough to provide full observability of what is happening inside. Naming those metrics cleanly seems difficult though, maybe we could try merging two of the mentioned metrics in a way that the
Naming the "could not be sent to ALL outputs" metric seems more difficult. That metric somewhat belongs to the input side because it should be compared against
Everything looks nice and neat in my opinion :) |
Yes. In OpenWEC terminology, an output is equal to a driver AND a format. One subscription is wired to multiple outputs.
I don't think merging the two metrics is a good idea because they don't have the same impact:
Moreover, I can imagine scenarios where a batch of events causes multiple format failures (it contains invalid events) and multiple driver failures (many output drivers fail). So I don't think it makes sense to do sum() or avg() on these two counters. Last but not least, I would say that you want to know which format "failed" when a format error occurs (it might be used by multiple outputs), so you want the "format" label. Similarly, you want to know which driver "failed" when a driver failure occurs, so you want the "driver" label.
It seems odd to call it "input" when it counts events that could not be successfully sent to outputs. However, I think I understand what you mean. One might think that After a bit more thought, I don't think this metric is of any interest when we already have So that would lead to:
|
The metrics you proposed look consistent and understandable to me. 👍🏻 So if I understand it correctly, If one's question is "how many events have I lost?", My thought process is something like this:
|
Yes! However, if the counter keeps increasing, it probably means that openwec will stop accepting events from that subscription until the problem is fixed. In this situation, there is no guarantee that the Windows machines will not drop events without having sent them before. An acceptable retention period for Windows event logs should be configured to avoid event loss.
I just checked the code, and the behavior is a little different than what I said earlier. It all happens in the
We may want to count:
So, to be precise, we might also want to count parsing errors!
Currently, we retrieve these numbers from the database. It means that they are common to all openwec nodes, which is not true for the other metrics. I don't know if this is a problem, but I don't see how we could do it any other way. In Prometheus, these numbers are represented as gauges. With metrics-rs, the easiest way to implement this would be to add a background task that frequently queries the database and updates the gauge values. I would also add that these numbers are interesting for monitoring the usual event flow, especially the sum of active + alive clients which basically tells you how many machines are currently sending events. They can also be used to detect a problem that would prevent machines from talking to openwec (network, authentication, ...). I think that these three numbers could be represented as a unique gauge |
I have updated the PR.
|
It sounds reasonable, hopefully not too expensive:)
Sounds perfect! |
Observability metrics need to be produced and exposed by the OpenWEC server.
Which metrics?
I think the following metrics would be interesting to have:
From a developer's point of view, it would also be interesting to optionally add more timing metrics, for example to measure the amount of time spent in parts of the code. For example, when we receive a batch of events, it would be interesting to know how much time we spend decrypting, decompressing, parsing xml, formatting events, writing formatted events to each output, generating response and encrypting response.
Feel free to suggest other metrics!
Which protocol/format?
There are multiple ways to expose/transmit metrics. After a brief state of the art, I think we need to choose between:
Both have pros and cons:
Which library?
I'm currently working on a prototype with
prometheus_client
where the OpenWEC server would expose a HTTP server dedicated to metrics (different listening addr/port).The text was updated successfully, but these errors were encountered: