Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some BMCs start sending garbage values after a while #29

Open
gebn opened this issue Sep 17, 2019 · 6 comments
Open

Some BMCs start sending garbage values after a while #29

gebn opened this issue Sep 17, 2019 · 6 comments
Labels
bug Something isn't working
Milestone

Comments

@gebn
Copy link
Owner

gebn commented Sep 17, 2019

  • Firmware version changes (with a minor version >99, which is impossible for a 1 byte BCD value)
  • GUID changes

Sometimes its the whole field, sometimes it's just a few bytes.
Definitely a bug, as if it were corruption, the checksum validation would fail (is the checksum definitely there for these packets?).

If a BMC shows strangeness, it's fine to treat it with care (e.g. new connection each scrape) until the process is killed, even between Close()s on the collector.

Could be an old socket full stop - not just the connection on it.
bmc_up and/or chassis_cooling_fault and/or chassis_powered_on flaps. You only need the first one to identify this. If bmc_up == 0, close the session before finishing collecting, so it is re-established next scrape.
The debug mode in #23 would help here. Particularly the error returned by the command exec attempt.

@gebn
Copy link
Owner Author

gebn commented Sep 17, 2019

No need to do fancy stuff with checking if values have changed (that's unreliable anyway).
One or more commands failing quickly (with an error that we don't currently log) is the reliable indicator of issues.
On receipt of an error that is not a timeout, establish a new session and re-try the collection (up to the ctx expiry, with back-off).
Need to create an internal collector that we can back-off. We cannot send the same metric twice to Collect()'s channel, so we need to build up a struct of the scrape result, then call .Send(ch) on it when we're happy with it.

A badly behaved BMC will look like:

  1. Establish session, collect values.
  2. Session exists, collect values, error, re-establish, collect values.
  3. Session exists, collect values, error, re-establish, collect values.
    ...

We never treat a BMC "differently", e.g. by putting it into a special mode - we just handle what we see, which is a better approach.

@gebn gebn closed this as completed in f7fc93f Sep 26, 2019
@gebn gebn reopened this Sep 27, 2019
@gebn
Copy link
Owner Author

gebn commented Sep 27, 2019

Disable command retries, increase timeout.

@gebn
Copy link
Owner Author

gebn commented Oct 6, 2019

Increasing the per-command timeout to 1m fixed this, so it's caused by sending retries. We could extend the timeout, however the packet will likely be sent along the same path. Probably better to let it fail and re-establish the session (which may also fail, but that's better than leaving it in a bad state).

@gebn gebn added the bug Something isn't working label Oct 7, 2019
@gebn
Copy link
Owner Author

gebn commented Oct 7, 2019

Having the option to disable retries in a session (setting the per-command timeout to 0) could be a solution here. Or just disable them full-stop - it shouldn't be possible to hold the library wrong (much). Or is something like a 5s timeout enough to mitigate the behaviour? Shame to remove this feature when most BMCs handle it correctly. The 60s timeout hasn't led to a reduction in scrape success rate.

@gebn gebn added this to the Stable milestone Oct 9, 2019
@gebn
Copy link
Owner Author

gebn commented Oct 9, 2019

  1. Scrape arrives
    • Scrape timeout starts in /bmc handler
  2. Eventually we get to the Collect() method
    • Collect timeout starts at beginning of Collect()
  3. Assuming first scrape, we establish a session; session-less commands are re-sent after the default 1s timeout
  4. Session is established, session-based commands begin
  5. Each command is sent with a 2s context (controls end-to-end, including retries which are each 1s; retries use an exponential back-off). If it fails, we retry once with a new session.

Retry if: no response, malformed response, or completion code is CompletionCodeNodeBusy.
Close (and maybe re-establish session) if: no response, malformed/truncated response.

@gebn
Copy link
Owner Author

gebn commented Oct 12, 2019

Given the suspected buffer implementation, it would likely affect session-less commands as well as those sent within a session, so the timeout would have to be applied to both. Need to try spamming an affected BMC with session-less commands to verify behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant