You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running in a single-node Nomad setup, no Consul, no Vault. Simple 1 server + 1 client setup on same server.
Issue
I am periodically seeing those logs in my Nomad client logs:
2024-12-10T13:08:03.571-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character '_' looking for beginning of value"
2024-12-10T13:08:15.973-0300 [DEBUG] client: received stale allocation information; retrying: index=16 min_index=16
2024-12-10T13:08:33.572-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character 'v' looking for beginning of value"
2024-12-10T13:09:03.573-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character 'd' in exponent of numeric literal"
2024-12-10T13:09:33.574-0300 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=8ba3d594adf45cb3b8e1f441c54b47e59b53393fd61ed194d1f2af37ff6aaffe driver=docker error="invalid character 'o' looking for beginning of value"
When those logs appear, metrics in Nomad's Web UI for allocations are all stuck at 0% CPU and 0% MB RAM used.
When using the default telemetry collection interval of 1s, those typically work for a few hours / days and then start randomly breaking on a per-allocation basis, and then coming back up.
When raising the telemetry collection interval to eg. 15s, those metrics all break instantly right after starting the Nomad client. The sole purpose of raising this value here is to make reproduction of the issue faster / easier:
I've tried to reproduce locally with some quick Go code and found out that if not waiting for exactly 1 second, then Docker's streaming API would return multiple JSON responses separated by '\n' characters. This is expected since we're dealing with the streaming API here.
This indicates that in the event of a time drift or a telemetry collection interval higher than the default 1s, then Nomad should split each line in the response buffer and pick the last one for JSON parsing (?). Failing to do so, will result in JSON parse errors since we're passing multiple lines of JSON to the parser — which is effectively invalid JSON.
package main
import (
containerapi "github.com/docker/docker/api/types/container""github.com/docker/docker/client""fmt""context""io""time""bytes""encoding/json"
)
funcmain() {
varstats*containerapi.StatsnewClient, _:=client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())
ctx, _:=context.WithTimeout(context.Background(), 5*time.Second)
statsReader, err:=newClient.ContainerStats(ctx, "8ba3d594adf4", true)
iferr!=nil&&err!=io.EOF {
fmt.Printf("error collecting stats from container", "error", err)
return
}
buf:=new(bytes.Buffer)
buf.ReadFrom(statsReader.Body)
statsReader.Body.Close()
fmt.Printf("got raw stats from container = %s", buf.String())
// (This section will not parse anything since the stats have already been read for debug above)err=json.NewDecoder(statsReader.Body).Decode(&stats)
iferr!=nil&&err!=io.EOF {
fmt.Printf("error decoding stats data from container", "error", err)
return
}
ifstats==nil {
fmt.Printf("error decoding stats data: stats were nil")
return
}
fmt.Printf("got stats from container = %v", *stats)
}
The text was updated successfully, but these errors were encountered:
Nomad version
Operating system and Environment details
Issue
I am periodically seeing those logs in my Nomad client logs:
When those logs appear, metrics in Nomad's Web UI for allocations are all stuck at 0% CPU and 0% MB RAM used.
When using the default telemetry collection interval of 1s, those typically work for a few hours / days and then start randomly breaking on a per-allocation basis, and then coming back up.
When raising the telemetry collection interval to eg. 15s, those metrics all break instantly right after starting the Nomad client. The sole purpose of raising this value here is to make reproduction of the issue faster / easier:
I've investigated and identified this to come from: https://github.com/hashicorp/nomad/blob/main/drivers/docker/stats.go#L117
I've tried to reproduce locally with some quick Go code and found out that if not waiting for exactly 1 second, then Docker's streaming API would return multiple JSON responses separated by '\n' characters. This is expected since we're dealing with the streaming API here.
This indicates that in the event of a time drift or a telemetry collection interval higher than the default 1s, then Nomad should split each line in the response buffer and pick the last one for JSON parsing (?). Failing to do so, will result in JSON parse errors since we're passing multiple lines of JSON to the parser — which is effectively invalid JSON.
The text was updated successfully, but these errors were encountered: