Stable configuration - OOM killed, CPU Usage #38901
vkrebs-wktaa
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
We have a collector instance running which regularly crashes because it gets OOM killed by the OS.
I am running 0.121.0.
I could increase the memory, but I want to understand the memory handling of the OpenTelemetry collector.
In my understanding, it should be possible to run it and drop incoming items instead of crashing.
Here is my config:
As you can see, I am sending the internal metrics to the instance itself, which I first thought might cause the problem, but it did not.
The instance has 2GB/4CPU and the memory limiter seems to detect correctly:
There are 5 OpenTelemetry Java agents sending data to this collector. The metric
otelcol_processor_incoming_items
says there are around 300k items per hour.Sometimes there are peaks with more items and the CPU usage goes to 100% for about 5 minutes.
Then I can see in the corresponding metric that spans are refused.
In the logs, I can see that GC is happening quite often. But occasionally, the collector gets OOM killed.
played around with the memory_limiter settings but could not find stable values.
My assumption would be that it is possible to run (a maybe too small) instance without crashing and just dropping items. But 10 million items a day on a 2GB and 4 CPU VM should be no problem, I think.
I run everything in an containerized environment on azure VMs.
Does anybody have any hints or comments?
Thank You
Volker
Beta Was this translation helpful? Give feedback.
All reactions