Some BMCs start sending garbage values after a while #29

gebn · 2019-09-17T14:30:28Z

Firmware version changes (with a minor version >99, which is impossible for a 1 byte BCD value)
GUID changes

Sometimes its the whole field, sometimes it's just a few bytes.
Definitely a bug, as if it were corruption, the checksum validation would fail (is the checksum definitely there for these packets?).

If a BMC shows strangeness, it's fine to treat it with care (e.g. new connection each scrape) until the process is killed, even between Close()s on the collector.

Could be an old socket full stop - not just the connection on it.
bmc_up and/or chassis_cooling_fault and/or chassis_powered_on flaps. You only need the first one to identify this. If bmc_up == 0, close the session before finishing collecting, so it is re-established next scrape.
The debug mode in #23 would help here. Particularly the error returned by the command exec attempt.

The text was updated successfully, but these errors were encountered:

gebn · 2019-09-17T14:55:31Z

No need to do fancy stuff with checking if values have changed (that's unreliable anyway).
One or more commands failing quickly (with an error that we don't currently log) is the reliable indicator of issues.
On receipt of an error that is not a timeout, establish a new session and re-try the collection (up to the ctx expiry, with back-off).
Need to create an internal collector that we can back-off. We cannot send the same metric twice to Collect()'s channel, so we need to build up a struct of the scrape result, then call .Send(ch) on it when we're happy with it.

A badly behaved BMC will look like:

Establish session, collect values.
Session exists, collect values, error, re-establish, collect values.
Session exists, collect values, error, re-establish, collect values.
...

We never treat a BMC "differently", e.g. by putting it into a special mode - we just handle what we see, which is a better approach.

gebn · 2019-09-27T01:23:44Z

Disable command retries, increase timeout.

gebn · 2019-10-06T19:07:08Z

Increasing the per-command timeout to 1m fixed this, so it's caused by sending retries. We could extend the timeout, however the packet will likely be sent along the same path. Probably better to let it fail and re-establish the session (which may also fail, but that's better than leaving it in a bad state).

gebn · 2019-10-07T22:23:43Z

Having the option to disable retries in a session (setting the per-command timeout to 0) could be a solution here. Or just disable them full-stop - it shouldn't be possible to hold the library wrong (much). Or is something like a 5s timeout enough to mitigate the behaviour? Shame to remove this feature when most BMCs handle it correctly. The 60s timeout hasn't led to a reduction in scrape success rate.

gebn · 2019-10-09T16:21:46Z

Scrape arrives
- Scrape timeout starts in /bmc handler
Eventually we get to the Collect() method
- Collect timeout starts at beginning of Collect()
Assuming first scrape, we establish a session; session-less commands are re-sent after the default 1s timeout
Session is established, session-based commands begin
Each command is sent with a 2s context (controls end-to-end, including retries which are each 1s; retries use an exponential back-off). If it fails, we retry once with a new session.

Retry if: no response, malformed response, or completion code is CompletionCodeNodeBusy.
Close (and maybe re-establish session) if: no response, malformed/truncated response.

gebn · 2019-10-12T16:31:41Z

Given the suspected buffer implementation, it would likely affect session-less commands as well as those sent within a session, so the timeout would have to be applied to both. Need to try spamming an affected BMC with session-less commands to verify behaviour.

gebn closed this as completed in f7fc93f Sep 26, 2019

gebn reopened this Sep 27, 2019

gebn added the bug Something isn't working label Oct 7, 2019

gebn added this to the Stable milestone Oct 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some BMCs start sending garbage values after a while #29

Some BMCs start sending garbage values after a while #29

gebn commented Sep 17, 2019 •

edited

Loading

gebn commented Sep 17, 2019

gebn commented Sep 27, 2019

gebn commented Oct 6, 2019

gebn commented Oct 7, 2019 •

edited

Loading

gebn commented Oct 9, 2019 •

edited

Loading

gebn commented Oct 12, 2019

Some BMCs start sending garbage values after a while #29

Some BMCs start sending garbage values after a while #29

Comments

gebn commented Sep 17, 2019 • edited Loading

gebn commented Sep 17, 2019

gebn commented Sep 27, 2019

gebn commented Oct 6, 2019

gebn commented Oct 7, 2019 • edited Loading

gebn commented Oct 9, 2019 • edited Loading

gebn commented Oct 12, 2019

gebn commented Sep 17, 2019 •

edited

Loading

gebn commented Oct 7, 2019 •

edited

Loading

gebn commented Oct 9, 2019 •

edited

Loading