-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some BMCs start sending garbage values after a while #29
Comments
No need to do fancy stuff with checking if values have changed (that's unreliable anyway). A badly behaved BMC will look like:
We never treat a BMC "differently", e.g. by putting it into a special mode - we just handle what we see, which is a better approach. |
Disable command retries, increase timeout. |
Increasing the per-command timeout to 1m fixed this, so it's caused by sending retries. We could extend the timeout, however the packet will likely be sent along the same path. Probably better to let it fail and re-establish the session (which may also fail, but that's better than leaving it in a bad state). |
Having the option to disable retries in a session (setting the per-command timeout to 0) could be a solution here. Or just disable them full-stop - it shouldn't be possible to hold the library wrong (much). Or is something like a 5s timeout enough to mitigate the behaviour? Shame to remove this feature when most BMCs handle it correctly. The 60s timeout hasn't led to a reduction in scrape success rate. |
Retry if: no response, malformed response, or completion code is |
Given the suspected buffer implementation, it would likely affect session-less commands as well as those sent within a session, so the timeout would have to be applied to both. Need to try spamming an affected BMC with session-less commands to verify behaviour. |
Sometimes its the whole field, sometimes it's just a few bytes.
Definitely a bug, as if it were corruption, the checksum validation would fail (is the checksum definitely there for these packets?).
If a BMC shows strangeness, it's fine to treat it with care (e.g. new connection each scrape) until the process is killed, even between
Close()
s on the collector.Could be an old socket full stop - not just the connection on it.
bmc_up
and/orchassis_cooling_fault
and/orchassis_powered_on
flaps. You only need the first one to identify this. Ifbmc_up == 0
, close the session before finishing collecting, so it is re-established next scrape.The debug mode in #23 would help here. Particularly the error returned by the command exec attempt.
The text was updated successfully, but these errors were encountered: