Docker metrics collector fail to parse JSON from Docker API due to buffering multiple JSON payloads #24639

valeriansaliou · 2024-12-10T16:18:57Z

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf1014886c0ff9f882f4a2691d5ae8ad8131c

Operating system and Environment details

Debian 12
Running in a single-node Nomad setup, no Consul, no Vault. Simple 1 server + 1 client setup on same server.

Issue

I am periodically seeing those logs in my Nomad client logs:

    2024-12-10T13:08:03.571-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character '_' looking for beginning of value"
    2024-12-10T13:08:15.973-0300 [DEBUG] client: received stale allocation information; retrying: index=16 min_index=16
    2024-12-10T13:08:33.572-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character 'v' looking for beginning of value"
    2024-12-10T13:09:03.573-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character 'd' in exponent of numeric literal"
    2024-12-10T13:09:33.574-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character 'o' looking for beginning of value"

When those logs appear, metrics in Nomad's Web UI for allocations are all stuck at 0% CPU and 0% MB RAM used.

When using the default telemetry collection interval of 1s, those typically work for a few hours / days and then start randomly breaking on a per-allocation basis, and then coming back up.

When raising the telemetry collection interval to eg. 15s, those metrics all break instantly right after starting the Nomad client. The sole purpose of raising this value here is to make reproduction of the issue faster / easier:

telemetry {
  collection_interval = "15s"
}

I've investigated and identified this to come from: https://github.com/hashicorp/nomad/blob/main/drivers/docker/stats.go#L117

I've tried to reproduce locally with some quick Go code and found out that if not waiting for exactly 1 second, then Docker's streaming API would return multiple JSON responses separated by '\n' characters. This is expected since we're dealing with the streaming API here.

This indicates that in the event of a time drift or a telemetry collection interval higher than the default 1s, then Nomad should split each line in the response buffer and pick the last one for JSON parsing (?). Failing to do so, will result in JSON parse errors since we're passing multiple lines of JSON to the parser — which is effectively invalid JSON.

package main

import (
	containerapi "github.com/docker/docker/api/types/container"
	"github.com/docker/docker/client"
	"fmt"
	"context"
	"io"
	"time"
	"bytes"
	"encoding/json"
)

func main() {
	var stats *containerapi.Stats

	newClient, _ := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())

	ctx, _ := context.WithTimeout(context.Background(), 5 * time.Second)

	statsReader, err := newClient.ContainerStats(ctx, "8ba3d594adf4", true)
	if err != nil && err != io.EOF {
		fmt.Printf("error collecting stats from container", "error", err)
		return
	}

	buf := new(bytes.Buffer)
	buf.ReadFrom(statsReader.Body)
	statsReader.Body.Close()

	fmt.Printf("got raw stats from container = %s", buf.String())

	// (This section will not parse anything since the stats have already been read for debug above)
	err = json.NewDecoder(statsReader.Body).Decode(&stats)

	if err != nil && err != io.EOF {
		fmt.Printf("error decoding stats data from container", "error", err)
		return
	}

	if stats == nil {
		fmt.Printf("error decoding stats data: stats were nil")
		return
	}

	fmt.Printf("got stats from container = %v", *stats)
}

The text was updated successfully, but these errors were encountered:

valeriansaliou · 2024-12-10T16:20:52Z

Referencing this issue which is linked: #24339 and closing that one. This issue serves as a reproduction log for your team. Thank you!

valeriansaliou added the type/bug label Dec 10, 2024

valeriansaliou closed this as completed Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker metrics collector fail to parse JSON from Docker API due to buffering multiple JSON payloads #24639

Docker metrics collector fail to parse JSON from Docker API due to buffering multiple JSON payloads #24639

valeriansaliou commented Dec 10, 2024

valeriansaliou commented Dec 10, 2024

Docker metrics collector fail to parse JSON from Docker API due to buffering multiple JSON payloads #24639

Docker metrics collector fail to parse JSON from Docker API due to buffering multiple JSON payloads #24639

Comments

valeriansaliou commented Dec 10, 2024

Nomad version

Operating system and Environment details

Issue

valeriansaliou commented Dec 10, 2024