Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker metrics collector fail to parse JSON from Docker API due to buffering multiple JSON payloads #24639

Closed
valeriansaliou opened this issue Dec 10, 2024 · 1 comment
Labels

Comments

@valeriansaliou
Copy link

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf1014886c0ff9f882f4a2691d5ae8ad8131c

Operating system and Environment details

  • Debian 12
  • Running in a single-node Nomad setup, no Consul, no Vault. Simple 1 server + 1 client setup on same server.

Issue

I am periodically seeing those logs in my Nomad client logs:

    2024-12-10T13:08:03.571-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character '_' looking for beginning of value"
    2024-12-10T13:08:15.973-0300 [DEBUG] client: received stale allocation information; retrying: index=16 min_index=16
    2024-12-10T13:08:33.572-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character 'v' looking for beginning of value"
    2024-12-10T13:09:03.573-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character 'd' in exponent of numeric literal"
    2024-12-10T13:09:33.574-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character 'o' looking for beginning of value"

When those logs appear, metrics in Nomad's Web UI for allocations are all stuck at 0% CPU and 0% MB RAM used.

When using the default telemetry collection interval of 1s, those typically work for a few hours / days and then start randomly breaking on a per-allocation basis, and then coming back up.

When raising the telemetry collection interval to eg. 15s, those metrics all break instantly right after starting the Nomad client. The sole purpose of raising this value here is to make reproduction of the issue faster / easier:

telemetry {
  collection_interval = "15s"
}

I've investigated and identified this to come from: https://github.com/hashicorp/nomad/blob/main/drivers/docker/stats.go#L117

I've tried to reproduce locally with some quick Go code and found out that if not waiting for exactly 1 second, then Docker's streaming API would return multiple JSON responses separated by '\n' characters. This is expected since we're dealing with the streaming API here.

This indicates that in the event of a time drift or a telemetry collection interval higher than the default 1s, then Nomad should split each line in the response buffer and pick the last one for JSON parsing (?). Failing to do so, will result in JSON parse errors since we're passing multiple lines of JSON to the parser — which is effectively invalid JSON.

package main

import (
	containerapi "github.com/docker/docker/api/types/container"
	"github.com/docker/docker/client"
	"fmt"
	"context"
	"io"
	"time"
	"bytes"
	"encoding/json"
)

func main() {
	var stats *containerapi.Stats

	newClient, _ := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())

	ctx, _ := context.WithTimeout(context.Background(), 5 * time.Second)

	statsReader, err := newClient.ContainerStats(ctx, "8ba3d594adf4", true)
	if err != nil && err != io.EOF {
		fmt.Printf("error collecting stats from container", "error", err)
		return
	}

	buf := new(bytes.Buffer)
	buf.ReadFrom(statsReader.Body)
	statsReader.Body.Close()

	fmt.Printf("got raw stats from container = %s", buf.String())

	// (This section will not parse anything since the stats have already been read for debug above)
	err = json.NewDecoder(statsReader.Body).Decode(&stats)

	if err != nil && err != io.EOF {
		fmt.Printf("error decoding stats data from container", "error", err)
		return
	}

	if stats == nil {
		fmt.Printf("error decoding stats data: stats were nil")
		return
	}

	fmt.Printf("got stats from container = %v", *stats)
}
@valeriansaliou
Copy link
Author

Referencing this issue which is linked: #24339 and closing that one. This issue serves as a reproduction log for your team. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant