Stable configuration - OOM killed, CPU Usage #38901

vkrebs-wktaa · 2025-03-24T13:15:06Z

vkrebs-wktaa
Mar 24, 2025

Hello,

We have a collector instance running which regularly crashes because it gets OOM killed by the OS.
I am running 0.121.0.

I could increase the memory, but I want to understand the memory handling of the OpenTelemetry collector.

In my understanding, it should be possible to run it and drop incoming items instead of crashing.

Here is my config:

receivers:
  otlp:
    protocols:
      http:
        endpoint: "0.0.0.0:4318"

processors:
  memory_limiter:
    check_interval: 0.5s
    limit_percentage: 70
    spike_limit_percentage: 25
  batch:
  filter:
    error_mode: ignore
    traces:
      span:
        - attributes["url.path"] == "/management/health"

exporters:
  azuremonitor:
    # APPLICATIONINSIGHTS_CONNECTION_STRING env var is used
    spaneventsenabled: true
    sending_queue:
      enabled: true

service:
  telemetry:
    logs:
      level: INFO
    # We send our metrics to ourselfs
    metrics:
      readers:
        - periodic:
            exporter:
              otlp:
                protocol: http/protobuf
                endpoint: http://localhost:4318
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter,filter,batch]
      exporters: [azuremonitor]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter,batch]
      exporters: [azuremonitor]
    logs:
      receivers: [otlp]
      processors: [memory_limiter,batch]
      exporters: [azuremonitor]

As you can see, I am sending the internal metrics to the instance itself, which I first thought might cause the problem, but it did not.

The instance has 2GB/4CPU and the memory limiter seems to detect correctly:

2025-03-18T06:35:00.955Z#011info#[email protected]/memorylimiter.go:148#011Using percentage memory limiter#011{"otelcol.component.kind": "Processor", "total_memory_mib": 2048, "limit_percentage": 70, "spike_limit_percentage": 25}
2025-03-18T06:35:00.955Z#011info#[email protected]/memorylimiter.go:74#011Memory limiter configured#011{"otelcol.component.kind": "Processor", "limit_mib": 1433, "spike_limit_mib": 512, "check_interval": 0.5}
2025-03-18T06:35:00.956Z#011info#[email protected]/service.go:258#011Starting otelcol-contrib...#011{"Version": "0.121.0", "NumCPU": 4}

There are 5 OpenTelemetry Java agents sending data to this collector. The metric otelcol_processor_incoming_items says there are around 300k items per hour.

Sometimes there are peaks with more items and the CPU usage goes to 100% for about 5 minutes.
Then I can see in the corresponding metric that spans are refused.
In the logs, I can see that GC is happening quite often. But occasionally, the collector gets OOM killed.
played around with the memory_limiter settings but could not find stable values.

My assumption would be that it is possible to run (a maybe too small) instance without crashing and just dropping items. But 10 million items a day on a 2GB and 4 CPU VM should be no problem, I think.

I run everything in an containerized environment on azure VMs.

Does anybody have any hints or comments?

Thank You
Volker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable configuration - OOM killed, CPU Usage #38901

{{title}}

Replies: 0 comments

Select a reply

Stable configuration - OOM killed, CPU Usage #38901

vkrebs-wktaa Mar 24, 2025

Replies: 0 comments

vkrebs-wktaa
Mar 24, 2025