Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node monitoring: report errors through respondd #3320

Closed
Djfe opened this issue Jul 19, 2024 · 1 comment
Closed

node monitoring: report errors through respondd #3320

Djfe opened this issue Jul 19, 2024 · 1 comment

Comments

@Djfe
Copy link
Contributor

Djfe commented Jul 19, 2024

#3319 lead me to an idea:
respondd should have the ability to report the device being in an error state.
possible error states:

  • /overlay is read-only
  • /overlay doesn't have enough erase-blocks
  • vpn doesn't connect despite having link and an IP on WAN
  • lost connection to gateway
  • the last update failed to install
  • autoupdater cannot find device in manifest
  • manifest doesn't exist
  • autoupdater failed to connect to update server y
  • a wifi interface failed to come up
  • lost wan connection (info string)
  • dfs event detected at time y (info string)
  • NAND flash has y badblocks (info string)

The info could also be available thereby when you are connected to an offline node via wifi mesh but don't have the key on the affected device, yet. You could read out an error-state before rebooting and thereby deleting all logs.

Communities could define custom errors like these:

  • available space on VM is smaller than y (updates cannot be installed)
  • RAM is smaller than y and not enough (VM)
  • edgerouter x: there are bad blocks/there are no bad blocks. (and maybe a further info string: needs to be updated manually)
  • custom error (mcu timeout, some ath10k error, etc.)

These are just examples. Some communities already did similar things in the past by renaming the release-name or the hostname with a package. But those were fairly limited.
example device: https://map.freifunk-winterberg.net/#!/de/map/fcecda7cc036

These error messages could be displayed on the map and also further evaluated by Grafana (including timestamps when the device started showing an error and how often)
They could help on evaluating major version updates.

I'm not sure how verbose we want to make these messages since they are queried constantly. We could define error codes instead of sending full strings over the air.
Or we could define a new data type in addition to the current one where you can query the full message:

  • nodeinfo: 158
  • statistics: 159
  • neighbours: 160

What are your thoughts on my idea?

@Djfe
Copy link
Contributor Author

Djfe commented Jul 19, 2024

Some format suggestions:

since it's too complex to design a system that resolves errors the system will always report all errors since the last boot.
edit: there could be like a few errors that will spawn "resolved" messages that will resolve previous messages of a specific type. (so the map or grafana only shows current errors)

the respondd format would be a list containing items of:

  • error-type: info/warning/error
  • error-message: string
  • error-count: counting the number of reports since the last boot
  • error-date: timestamp of the last occurence

this list is sorted in the order the errors appeared. If an error reappears it will be sorted back in at the end of the list.

nodeinfo could have an id counter for the errors so yanic only queries respondd for new errors if the counter was updated since the last request. (either this or a timestamp of the last error but a timestamp could lead to race conditions)

this should not interfere with existing systems like yanic since it only adds a new feature.

It's useful for node monitoring and systems like Node Alarm https://nodealarm.freifunk-stuttgart.de/sign-in and Node Monitor https://play.google.com/store/apps/details?id=net.freifunk.darmstadt.nodewhisperer&hl=en_US
It might be useful for automatic release testing.

@freifunk-gluon freifunk-gluon locked and limited conversation to collaborators Jul 19, 2024
@blocktrron blocktrron converted this issue into discussion #3321 Jul 19, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant