-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate replacing libbeat monitoring usage with OpenTelemetry instrumentation #14488
Comments
For context below, APM Server monitoring builds on libbeat monitoring, and relies on the There's a few approaches we could take to migrating to the OTel API, which have varying trade-offs.
(1) is probably the least amount of work. It will not remove any of the limitations that come with the existing libbeat monitoring template. Moreover, since the OTel API enables users to provide dimensions (attributes) and complex metrics, developers may be surprised when these are dropped or don't behave as they should. |
I would say the main risks of not implementing this are:
The second issue is more pressing IMO. I'll see if I can do a quick spike on (1) from #14488 (comment), as I think we could get something working reasonably quickly to address that issue. |
It's fairly hacky, but I got something working here: https://github.com/axw/apm-server/tree/otel-adapter I've updated all the apm-server code that was registering metrics with libbeat monitoring to use opentelemetry-go metrics instead. Then I've configured a ManualReader, and registered libbeat monitoring functions to use that and adapt otel metrics to libbeat metrics. The adaptation works by translating go-docappender metrics (a bit brittle, would need comprehensive unit tests) to libbeat equivalents. It also directly copies across any metrics that fit the following constraints:
This should address the symptoms of #8383 since the metrics will be persistent across reloads, whereas previously we would recreate all the metrics on each reload. @marclop I'd be keen to hear what you think about this approach. It's not the most robust thing we could do, but I think it'll be an incremental improvement and I expect we should be able to complete it in a week or so. |
To add to the risks section: a libbeat dependency was also the cause for us not being able to release EDOT on Windows which indicates to me that there are some headaches we can avoid by working on this. Have we estimated the time required for a production-ready implementation? |
@axw just scanned the linked branch and the approach looks good to me. I had forgotten how rudimentary libbeat metric registries are and how they don't have dimensions. I think that estimate is accurate and we should be able to have unit tests and test it with a decent amount of confidence to make the change. |
This feels like hiding the libbeat dependency instead of replacing it with otel (which was the goal of this issue) :(
fully agree 👍 Mostly reposting here what we discussed last week but would it make sense to produce the otel metrics both to the libbeat monitoring and directly ? |
The ultimate goal is indeed to remove it, but I don't think we can commit to that amount of work for 9.0. To remove libbeat monitoring we will need to migrate to the OTel API. Once we're ready to fully switch off it, we'll just remove (or perhaps change) the adapter code. It is hiding (aka encapsulating) the use of libbeat monitoring. That is a legitimate design pattern that will simplify its removal later.
@kruskall what would "and directly" mean here? We already capture some metrics with the OTel API, but they go through |
We have had issues related to libbeat monitoring plaguing us for years, notably #8383.
We should consider removing our dependency on libbeat for monitoring apm-server, and instrumenting APM Server with the OpenTelemetry API instead. We would need to arrange for the metrics to be written to the monitoring Elasticsearch cluster with the same schema as produced by libbeat; we would also need to expose these metrics via the stats API, and reported in logs.
Hat tip to @marclop for the idea.
The text was updated successfully, but these errors were encountered: