node monitoring: report errors through respondd #3320

Djfe · 2024-07-19T11:52:20Z

#3319 lead me to an idea:
respondd should have the ability to report the device being in an error state.
possible error states:

/overlay is read-only
/overlay doesn't have enough erase-blocks
vpn doesn't connect despite having link and an IP on WAN
lost connection to gateway
the last update failed to install
autoupdater cannot find device in manifest
manifest doesn't exist
autoupdater failed to connect to update server y
a wifi interface failed to come up
lost wan connection (info string)
dfs event detected at time y (info string)
NAND flash has y badblocks (info string)

The info could also be available thereby when you are connected to an offline node via wifi mesh but don't have the key on the affected device, yet. You could read out an error-state before rebooting and thereby deleting all logs.

Communities could define custom errors like these:

available space on VM is smaller than y (updates cannot be installed)
RAM is smaller than y and not enough (VM)
edgerouter x: there are bad blocks/there are no bad blocks. (and maybe a further info string: needs to be updated manually)
custom error (mcu timeout, some ath10k error, etc.)

These are just examples. Some communities already did similar things in the past by renaming the release-name or the hostname with a package. But those were fairly limited.
example device: https://map.freifunk-winterberg.net/#!/de/map/fcecda7cc036

These error messages could be displayed on the map and also further evaluated by Grafana (including timestamps when the device started showing an error and how often)
They could help on evaluating major version updates.

I'm not sure how verbose we want to make these messages since they are queried constantly. We could define error codes instead of sending full strings over the air.
Or we could define a new data type in addition to the current one where you can query the full message:

nodeinfo: 158
statistics: 159
neighbours: 160

What are your thoughts on my idea?

Djfe · 2024-07-19T12:19:41Z

Some format suggestions:

since it's too complex to design a system that resolves errors the system will always report all errors since the last boot.
edit: there could be like a few errors that will spawn "resolved" messages that will resolve previous messages of a specific type. (so the map or grafana only shows current errors)

the respondd format would be a list containing items of:

error-type: info/warning/error
error-message: string
error-count: counting the number of reports since the last boot
error-date: timestamp of the last occurence

this list is sorted in the order the errors appeared. If an error reappears it will be sorted back in at the end of the list.

nodeinfo could have an id counter for the errors so yanic only queries respondd for new errors if the counter was updated since the last request. (either this or a timestamp of the last error but a timestamp could lead to race conditions)

this should not interfere with existing systems like yanic since it only adds a new feature.

It's useful for node monitoring and systems like Node Alarm https://nodealarm.freifunk-stuttgart.de/sign-in and Node Monitor https://play.google.com/store/apps/details?id=net.freifunk.darmstadt.nodewhisperer&hl=en_US
It might be useful for automatic release testing.

freifunk-gluon locked and limited conversation to collaborators Jul 19, 2024

blocktrron converted this issue into discussion #3321 Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

node monitoring: report errors through respondd #3320

node monitoring: report errors through respondd #3320

Djfe commented Jul 19, 2024 •

edited

Loading

Djfe commented Jul 19, 2024 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

node monitoring: report errors through respondd #3320

node monitoring: report errors through respondd #3320

Comments

Djfe commented Jul 19, 2024 • edited Loading

Djfe commented Jul 19, 2024 • edited Loading

This issue was moved to a discussion.

Djfe commented Jul 19, 2024 •

edited

Loading

Djfe commented Jul 19, 2024 •

edited

Loading