Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator crash when adding a device #82004

Closed
mguaylam opened this issue Nov 12, 2022 · 73 comments
Closed

Coordinator crash when adding a device #82004

mguaylam opened this issue Nov 12, 2022 · 73 comments
Assignees

Comments

@mguaylam
Copy link

mguaylam commented Nov 12, 2022

The problem

Issue

Coordinator crash when a new devices joins or re-join the network.

ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart

Steps to reproduce the issue

Add a new device to the network or reapply mains power to it.

Controller fails immediately. May take more or less time for Home-Assistant to be able to reset the controller. Can end up stuck indefinitely when adding a new device until it is reset. Usually helps when Home-Assistant is restarted entirely.

Effect

Philips Hue lights start a flashing patern in sync while the controler is failed. Similar with IKEA devices but stroboscopic effect.
The coordinator is not reachable thus, it is not possible to control devices.

ezgif-4-fa36b502b4

Coordinator

Generic Aliexpress EFR32MG21 : https://www.aliexpress.com/item/1005003578599189.html?spm=a2g0o.order_list.0.0.46f01802MxU13p

Hue light

What version of Home Assistant Core has the issue?

core-2022.11.2

What type of installation are you running?

Home Assistant Container

Integration causing the issue

ZHA

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

home-assistant.log
Here’s my network backup as I wonder if something is not wrong with it :
ZHA backup 2022-11-12T16-58-07.216Z.txt

Example YAML snippet

zha:
  custom_quirks_path: /config/quirks/
  zigpy_config:
    #source_routing: true
    ezsp_config:
      #CONFIG_SOURCE_ROUTE_TABLE_SIZE: 150 // Used to have source routing but disabling it does not change the behaviour.
      CONFIG_APS_ACK_TIMEOUT: 8000
      CONFIG_ADDRESS_TABLE_SIZE: 8
      CONFIG_APS_UNICAST_MESSAGE_COUNT: 12
    ota:
      ikea_provider: true
      otau_directory: /config/OTAU/
@home-assistant
Copy link

Hey there @dmulcahey, @Adminiuga, @puddly, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of zha can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Change the title of the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign zha Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


zha documentation
zha source
(message by IssueLinks)

@MattWestb
Copy link
Contributor

Its the first gen SonOff MG21 coordinator and it shall working with the standard firmware for the repro.

Is the network long in production or is in new formed ?
Its sound for my its somthing bad stored in the key storage in the flash that is being triggered then the coordinator is trying adding new devices in the tocken storage (it shall not being needed / done after forming the network the bellows is hashed TC-Link Keys).

Sonoff have making one hot fix then have running EZSP 7.X that is extending the NVM for token storage and going back to EZSP 6.10.3.0 or earlier is the coordinator crashing.
They is trying making one updated firmware but as hot fix they have making one GBL file that is deleting the NVM file (writion it over so the NCP is making one new clean one) and 6.x firmware is working OK.

If you is not having to many device try the fix by flashing the https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/nvm3_initfile.gbl. I think it shall being safe letting ZHA restoring the network after have flashing it.

The issue and the GBL file xsp1989/zigbeeFirmware#28 (comment).

@mguaylam
Copy link
Author

mguaylam commented Nov 13, 2022

Hey there @MattWestb ! Thank you for looking into my issue! 😃
To answer your questions : my network has been for a long time in production, now over 2 years. It started with a HUSBZB-1 but that obviously did not cut it out as the network grew. That’s where I went to the EFR32MG21 chipset about 6 months ago and saw the issue first appear 1 month ago when I purchased a new Philips Hue bulb.

For the NVM, is it something that is written by the network backup? Because I can observe this issue with 2 different coordinators with different firmware provider. I can destroy the NVM portion in question with the proposed firmware but considering I see this issue with 2 different coordinator, could that be involved in the issue? If I erase this portion, what would be the consequences?

I can do a network analysis if needed but for now, I’m not entirely sure where the issue reside. Would it definitively be the coordinator or Home-Assistant could be involved? I overcame lot’s of issues on my network over the time but this one hits me quite hard as it renders my whole network unusable and I can’t pinpoint where the issue reside. It also can happen when you don’t expect it with the problem that lights start to flash as well.

@Adminiuga
Copy link
Contributor

You need to provide the debug logs.
The very same chipset just works fine with Yellow and SkyBlue hardware. So it leaves two possibilities:

  • it's something specific to your setup
  • it's something specific to the dongle hardware/firmware (although firmware is pretty much standard between for this chipset)

@mguaylam
Copy link
Author

I posted this in the initial post but I might be misunderstanding what is meant by debug log, sorry if it is not what you are asking for.

home-assistant(1).log

This was generated with the following :

logger:
  default: info
  logs:
    homeassistant.core: debug
    homeassistant.components.zha: debug
    bellows.zigbee.application: debug
    bellows.ezsp: debug
    zigpy: debug
    zigpy_deconz.zigbee.application: debug
    zigpy_deconz.api: debug
    zigpy_xbee.zigbee.application: debug
    zigpy_xbee.api: debug
    zigpy_zigate: debug
    zigpy_znp: debug
    zhaquirks: debug

If you need other occurrences, I have several to look at. 😸

@Adminiuga
Copy link
Contributor

The dongle crashes with RESET_ASSERT, zha starts initialization and in middle of re-initialization it gets another assert error from the dongle. And stops responding.
Don't know if they have a newer firmware, I'd try that first.

@puddly
Copy link
Contributor

puddly commented Nov 13, 2022

Can you try this configuration?

zha:
  zigpy_config:
    ezsp_config:
      CONFIG_ADDRESS_TABLE_SIZE:              16  # FW:  32, ZHA:  16
      CONFIG_MULTICAST_TABLE_SIZE:             8  # FW:   8, ZHA:  16
      CONFIG_PACKET_BUFFER_COUNT:            250  # FW: 250, ZHA: 255
      CONFIG_SOURCE_ROUTE_TABLE_SIZE:         16  # FW: 200, ZHA:  16
      CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE:  2  # FW:   0, ZHA:   2

Related: itead/Sonoff_Zigbee_Dongle_Firmware#10

If this indeed fixes the problem and you can reliably reproduce the crash, it would be super helpful if you could help us narrow down which of the config options solves the problem. Something about Sonoff's build of EmberZNet is unstable.

@mguaylam
Copy link
Author

mguaylam commented Nov 14, 2022

So, currently I am using the https://www.aliexpress.com/item/1005003578599189.html?spm=a2g0o.order_list.0.0.46f01802MxU13p with the https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/ncp-uart-sw_v6.10.3_115200.gbl firmware.

Just tried the configuration you gave me. Unfortunately, it still did it.
They mention those parameters in the firmware :
https://github.com/xsp1989/zigbeeFirmware/tree/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP

Configuration Parameter Value
Part EFR32MG21A020F768IM32
Version EZSP 6.10.3.0
CTUNE value 128
Address Table Size 32
Child Table Size 32
Source Routes 200
TX PB01
RX PB00

Here is the log :
home-assistant(config puddly).log

I can see that the NCP failed :
NCP entered failed state. Requesting APP controller restart

But also that there is no memory available for this configuration :
Couldn't set EzspConfigId.CONFIG_PACKET_BUFFER_COUNT=250 configuration value: EzspStatus.ERROR_OUT_OF_MEMORY

As for the firmware, I can always try any of the following :
https://github.com/xsp1989/zigbeeFirmware/tree/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP
There is the 655 (Zigbee EmberZNet 6.5.5) which is on the EZSP v7.

Thank you for your kind help! 😺

@MattWestb
Copy link
Contributor

MattWestb commented Nov 14, 2022

If installing the linked EZSP 7.X and like going back to one lower version you must erasing the NVM / tocke with the GBL file i meted before or the NCP is crashing.

You can also trying one of Garys cooked for sonoff ZBB that have the same Zigbee-module and shall being pin compatible (its for hacked ZBB without signing that only the ZBB is using and shall working on the sticks 2).
https://github.com/grobasoz/zigbee-firmware/tree/master/Sonoff-ZBBridge.

PS. EZSP 6.5.5 is working but is not recommended the its have many bad bugs also with 6.8.X and 6.9.X and working good / recommended ones is 6.10.X and late 6.7.X.

@TheJulianJES
Copy link
Member

Does the issue only happen with that particular bulb? Although it very likely isn't the issue, were you ever able to pair it via Bluetooth to your phone? Firmware version 1.76.11 is almost two years old and there should be newer firmware available.
I wasn't able to find the firmware for your bulb, as the image_type_id : 65535 you mentioned doesn't exist.
But you should be able to update it through the Philips Hue app using your phone.

@mguaylam
Copy link
Author

mguaylam commented Nov 19, 2022

@TheJulianJES at first I believed it was this bulb in particular but then anything I pair (ex : an IKEA bulb) it does the same thing.
I was able to connect the Philips Hue bulb to my phone and update it thru Bluetooth, it now says it is at the latest version. It now has the firmware : 1.93.11.

The issue still persist, I am to believe the issue is not from the bulb but rather the coordinator.

Now, how I can diagnose this, it’s a little bit harder, I'm not sure where to look.

I can capture the network activity, but I don’t know if my answers will be there.

One thing I found interesting, someone was kind enough to show me their coordinator backup and I found one difference between mine and his, I have to note that a while back, I did migrate from one coordinator to another with zigpy since it was not yet implemented in Home-Assistant.

I can see that mine is missing this portion :

"stack_specific": {
            "ezsp": {
                "hashed_tclk": "**redacted**"

Would that be normal?

I can see in my first ever back-up the tc_link_key was indeed present. Is it why new devices have such a hard time joining but not older ones?

@MattWestb
Copy link
Contributor

I think your backup is done with normal TC-Link keys stored in the chip NVM / token storage in the flash chip and is getting problems then restoring it on the new coordinator that is using hashed TC-Link keys.

@puddly Is the coordinator restore from one other coordinator (not EZSP) to one EZSP working OK or if the backup of EZSP network was formed with one EZSP that was not using hashed TC-Link keys (old install with EM53X coordinator) and restoring the backup on one new with not formed network in the chip = forming one Hashed TC-Link key network ??

Trying ZHA Toolkit Service: ezsp_clear_keys for deleting the saved TC-Link keys ?
(= rejoining all devices that was having TC-Link keys (all ZB3 devices))

@dmulcahey
Copy link
Contributor

You can retroactively apply hashed link key settings. I’ve done it. I’ll dig the commands up later

@puddly
Copy link
Contributor

puddly commented Nov 19, 2022

All you need to do to upgrade to hashed link keys is to click the "Migrate" button and reconfigure the current radio. If you restore the most recent backup, it'll upgrade you to a hashed link key automatically when re-forming the network.

@mguaylam
Copy link
Author

mguaylam commented Nov 19, 2022

@dmulcahey i’d be very happy if that’s what I need to solve my issue!
@puddly that is strange, because even if I did migrate several times with the latest Home-Assistant, I did not see the hashed link key appear in my back-up. Is it supposed to?

@puddly
Copy link
Contributor

puddly commented Nov 19, 2022

Ah, I forgot it won't actually perform a restore if the current settings are identical to the new settings.

You will have to leave the current network first, either by:

  • purposefully migrating and clicking "Form a new network", and then migrating again and restoring from a backup
  • or by using one of the CLIs:
    • pip install git+https://github.com/zigpy/zigpy-cli.git && zigpy radio --baudrate 115200 ezsp /dev/... reset
    • pip install bellows && bellows --baudrate 115200 -d /dev/... leave

ZHA will auto-restore in the second scenario.

@mguaylam
Copy link
Author

Hey @puddly ! Thanks for helping me out!
I just did a reset with zigpy cli but when it formed the network again, the key is still absent from the backup.
bash-5.1# zigpy radio --baudrate 115200 ezsp /dev/ttyUSB0 reset

I’m not sure why.
"stack_specific": {},

Since I have my original key, could I write it in the backup and restore from it?

@puddly
Copy link
Contributor

puddly commented Nov 20, 2022

Something isn't adding up. Can you post a full debug log of the backup and restore?

$ zigpy -vvv radio --baudrate 115200 ezsp /dev/ttyUSB0 backup -z > backup.json
$ cat backup.json
$ zigpy -vvv radio --baudrate 115200 ezsp /dev/ttyUSB0 restore backup.json

@mguaylam
Copy link
Author

@puddly
Copy link
Contributor

puddly commented Nov 20, 2022

Thanks. According to the restore, one was written:

... stack_specific={'ezsp': {'hashed_tclk': 'a2473867c61c6c4e43e764b18dc95164'}} ...

Can you do another backup to confirm?

@mguaylam
Copy link
Author

mguaylam commented Nov 20, 2022

Sooo strange. Now it did :

        "stack_specific": {
            "ezsp": {
                "hashed_tclk": "a2473867c61c6c4e43e764b18dc95164"
            }
        },

Is something broken in Zigpy?

@puddly
Copy link
Contributor

puddly commented Nov 20, 2022

The exact same code is used by network formation, backup restoration, and the ZHA config flow so I think either the original network was never cleared or your browser may have cached the downloaded backup.

@Adminiuga
Copy link
Contributor

Does it still crash?

@mguaylam
Copy link
Author

mguaylam commented Nov 20, 2022

It does. 😞 I removed the power from the newer bulb and re-applied it and the coordinator crashed :
NCP entered failed state. Requesting APP controller restart

ControllerApplication reset unsuccessful: TimeoutError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 643, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 665, in _reset_controller
    await self.initialize()
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 76, in initialize
    await self.load_network_info(load_devices=False)
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 257, in load_network_info
    brd_manuf, brd_name, version = await self._get_board_info()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 117, in _get_board_info
    return await self._ezsp.get_board_info()
  File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/__init__.py", line 299, in get_board_info
    (value,) = await self.getMfgToken(token)
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

home-assistant_crash_2022_11_20.log

There is no crash when I do the same for older devices of the network.

I also can’t add new devices to the coordinator following the restore by zigpy cli.

@MattWestb : I’ll try erasing the NVM portion tomorrow as you recommended. You say it is used to store keys?

@MattWestb
Copy link
Contributor

@mguaylam If Puddly is not finding any other way i think flashing the NVM fix is one way also reflashing the EZSP 6.10.3 at the same time you have hocked up i think can being good (but the EZSP first and then the NVM fix).

The NVM fix is writing one empty file over the aria the token storage is in the flash.
Then the SOC is booting after the flash is making one new clean NVM that shall being OK and all old tokens is away.

I can see that you have writing one new IEEE then changing from the EM358X coordinator and it shall not being any problems as long the old coordinator is not online in your radio range.

Fast look in the log i finding little strange that the system is reading the manufacture tokens many times (first time its having problems) then the normal is only doing then initializing the coordinator and perhaps then (our Puddly) is doing one new backup.
Also its many timeouts but is not easy knowing if its the coordinator or slow system that is making that. Also source-routing is not working OK then the coordinator is not 100% communicating with all device and cant getting router records from not online devices (they is in the network and having the network key but have not syncing the frame counter with the coordinator and the TC-Link key can being wrong).

@mguaylam
Copy link
Author

mguaylam commented Nov 21, 2022

I flashed the 6.10.3 firmware from : https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/ncp-uart-sw_v6.10.3_115200.gbl then https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/nvm3_initfile.gbl but the coordinator was not responding correctly on the serial connection. I did the inverse and it worked.

I still can’t add new devices now, it’s very strange. I also confirmed the coordinator still crash :
home-assistant_after_erasing_NVM+crash.log

I indeed came from a EM358X (HUSBZB-1) which I have get rid of a while ago. My system is a full fledged server so I’d be surprised there would be performance issues from there.

@mguaylam
Copy link
Author

mguaylam commented Nov 23, 2022

It is a HPE Tower Server running Fedora with pods built on Podman. It only has USB 2 ports. I tried an extension cable this morning to be sure it wasn’t that. My Wi-Fi access point is on channel 11 and my ZigBee network is on channel 15. No overlapping.

The coordinator I'm using is the easyIoT on v.6.10.3 built by xsp1989 but I tried the SonOff-E coordinator with ITEAD v.6.10.3 firmware as well in case the coordinator itself was the issue.

I also tried the easyIoT with v.6.7.9 built by xsp1989 in the hope it would be a bug in the 6.10.3 firmware.

As for restoring, I did a restore on both of these coordinators, never a new network. I tried a form a new network on the easyIoT before restoring but the result was the same.

@mguaylam
Copy link
Author

mguaylam commented Nov 28, 2022

I think I'm onto something. I need to analyze more; I can send also the capture file in private.
But first thing I see, when I unscrew and re-screw the bulb as shown in the GIF after it tries to join the network (0xbd0f) :
image

Then I can see an insane amount of network conflicts coming from my 4 Sinope thermostats and it goes very fast, i’m wondering if that’s what crash my coordinator :
image

The thing is, now I’m wondering if my issues started exactly after I added those thermostats to my network. But I never heard of anyone having such issue with their Sinope thermostats.
What is also strange is that those address conflicts only comes from my Sinope thermostats, nothing else.

Then I see a lot of beacons and I’m not sure where they are coming from, I see some from smart meters from my electricity company (but only a few) :
image

Then a non-tree link failure from my unscrewed bulb :
image

Then the network seem’s to start :
image
image

Do Philips bulbs flash like that when there is an address conflict? Would the address conflict broadcast-ed by the thermostats be the reason the coordinator crash?

@MattWestb
Copy link
Contributor

If i remember right is old HUE blinking then there is rejoining the network (Its in the black box of bad Zigbee devices then is not deleting children that have jumping / leave the network).

I think its not the coordinator that is the problem for the moment i think its the network that is stalling then so many device is out of sync (all is having the network key but the frame counter is not OK and cant updating all things then the TC-Link key is new for all devices).
With so much address and routing problems and the mesh network cant syncing / healing.
Is the TRVs router, Sleeping or no sleeping end devices ?
If end devices they is not making routing at all only talking with there parents.
But as with my HUE issue is they have jumping and the HUE lights is reporting them belonging to more then one routers then you is having one real address conflict (of HUEs).

If i was you i should setting up one new network with different PAN-ID, Extended PAN-ID and network key on one other Chanel (channel is not one must but is more safe) and adding / moving only some devices and see if they is working OK.
If sniffing you can see if you is getting strange routing problems or network conflicts in the mesh.

One other variant is shutting down all devices and power on only some at the time and trying getting them working / syncing but i think its the same work if all is going well but is much more if not working OK compared with doing one new network.

Also all devices need getting the new TC-Link key and must being resettled for getting it = the same work as making one new network.

@Adminiuga
Copy link
Contributor

Do Philips bulbs flash like that when there is an address conflict?

When there's an address conflict, the conflicting device would leave and re-join the network -> this would cause the light to blink because of the identify command.

Then I can see an insane amount of network conflicts coming from my 4 Sinope thermostats and it goes very fast, i’m wondering if that’s what crash my coordinator :

The "nwk conflict" is a network broadcast and it is possible it is causing a crash, because you have reported quite high setting for the broadcast table, unless CONFIG_BROADCAST_TABLE_SIZE: 15 setting was accepted successfuly.

Before you get the 1st address conflict, what is the command which triggers it? Usually it is one very close to the 1st address conflict messages in the batch.

What Address is reported to be conflicting? I see two scenarios for the address conflict:

  1. either the device(s) reporting the conflict have wrong information about IEEE<>NWK addr mapping when it hears a request
  2. the conflicting device is indeed using a conflicting NWK address. Check the packet capture if there was any other traffic from the same NWK

@Adminiuga
Copy link
Contributor

I used to have a lot of "NWK Conflicts" in the network and for some reason it was caused by Interference. After I moved coordinator a little bit, it just started working.

In the packet captures I was seeing this triggering the storm of conflicts:

Frame 72838: 37 bytes on wire (296 bits), 37 bytes captured (296 bits)
IEEE 802.15.4 Data, Dst: 0x2d79, Src: 0xccfa
    Frame Control Field: 0x8841, Frame Type: Data, PAN ID Compression, Destination Addressing Mode: Short/16-bit, Frame Version: IEEE Std 802.15.4-2003, Source Addressing Mode: Short/16-bit
    Sequence Number: 78
    Destination PAN: 0xbeef
    Destination: 0x2d79
    Source: 0xccfa
    [Extended Source: GreenPea_00:b4:4c:aa:c5 (00:15:5f:00:b4:4c:aa:c5)]
    [Origin: 72838]
    FCS: 0x1a6e (Correct)
ZigBee Network Layer Command, Dst: 0x2d79, Src: 0xccfa
    Frame Control Field: 0x1809, Frame Type: Command, Discover Route: Suppress, Destination, Extended Source Command
    Destination: 0x2d79
    Source: 0xccfa
    Radius: 32
    Sequence Number: 96
    Destination: 00:00:00_00:00:1f:da:0f (00:00:00:00:00:1f:da:0f)
    Extended Source: GreenPea_00:b4:4c:aa:c5 (00:15:5f:00:b4:4c:aa:c5)
    Command Frame: Rejoin Request
        Command Identifier: Rejoin Request (0x06)
        Capability Information: 0x00

Notice the destination address of Destination: 00:00:00_00:00:1f:da:0f (00:00:00:00:00:1f:da:0f) This is obviously wrong and I have no idea how on earth it was getting this address.

@MattWestb
Copy link
Contributor

Nice Pan Destination PAN: 0xbeef !!

Corrupted NVM in the device or buggy firmware in 0xccfa in Alexie case ?

I think then more or less all devices is not having one OK TC-Link key its no way getting the network back then devices cant doing private communication with TC and they need being retested and rejoined for getting it OK.

@TheJulianJES
Copy link
Member

Due to a couple of reasons, my main network is also running on a ZBDongle-E at the moment with this "custom" firmware.

While it has mostly been working fine (coming from a CC2652 where older firmware had sub-optimal performance but newer firmware crashed after some time), I experienced some weird behavior yesterday that may or may not be related to this issue. A bit of a backstory is required though:

HA version: 2022.12.0.dev20221127 (I also had some (similar) issues with 2022.11.4)

  • at 00:02, some lights started briefly flickering (Hue + LEDVANCE + GLEDOPTO Pro lights which were on) (Hue lights that were off didn't flicker) (note: the identify effect when a device joins the network was disabled)
  • two servers that have an individual Zigbee power measurement plug each also turned off briefly (relay of both plugs clicked off and back on shortly after) ("restore last state") (no UPS for the servers) (one of the servers is running my main HA instance at the moment)
  • At first, I thought this was a very brief power outage of a second or so maybe. Other PCs in the house kept running fine and lights which were off didn't seem to have any flickering (a possible explanation would be that the capacitors in the lights power supplies were still charged enough during the entire power loss, as nothing was really drawing power from them)
  • However, now I think that it's very possible that there was never an outage and multiple lights + power plugs randomly decided to toggle themselves very briefly for some reason. (also, the lights/plugs aren't in a group and no remote was triggered during that time)
  • After this incident, the servers rebooted and eventually came back up. Two GLEDOPTO Pro controllers didn't respond though. One of them seems to have reset itself somewhat (switched to a different color which should only come up when it's reset, as the device was set to "restore last state"). The other one is a dimmer only model, so I'm not sure if it was also reset. (It's also possible they didn't fully reset, but it definitely wasn't normal power cycle behavior.) (IIRC this light was also off during the possible "power loss" and the power supply it's hooked up to should have continued providing enough power on the LV side during the HV power loss of a second or so in that case.)
  • Some time ago, I had a similar issue where another controller stopped reacting and wouldn't re-join the network, even though it was well within distance of other routers. Restarting HA also always seemed to stop that device from working for some time (possibly because "routes are deleted" on newer EZSP version restarts? and they weren't correctly re-discovered?)
  • back to yesterday: Whilst trying to re-join the "GLEDOPTO Pro" controller, I was unable too (and by re-plugging the barrel connector too many times, I also somehow broke that device, as it's now producing high-pitched noises and getting very hot (duh))
  • so I grabbed an identical spare controller with the exact same firmware and tried to pair that as a replacement but was also unable to do so, even though the place is well covered with other Zigbee (3.0) routers (and the coordinator is also near that place)
  • Eventually, I brought it in the basement to do some testing and for some reason, it was instantly discovered. It did seem to re-join the network multiple times with different network ids though, before finally joining properly and getting interviewed.
  • Bringing it back upstairs (near the coordinator), it was also working fine now and I could control the device. I decided to remove it again (which also resets the device and allows it to be discovered again), but it either didn't appear for pairing or didn't complete the interview process at all.
  • A similar issue was also seen on 2022.11.4 where I was unable to pair a Philips Hue bulb in another place (with a few Hue bulbs around) and I had to bring it into the basement (further away from the coordinator, but near a bunch of other Hue lights). There, it paired, but also re-joined the network a couple of times (changing the network id).
  • A video of a "GLEDOPTO Pro Zigbee 3.0 CC/WW GL-C-006P (EZSP-based)" and a Philips Hue bulb (also likely EZSP-based) (failing) to join the network is attached below:
interview.mp4

I have very little time at the moment, so I didn't really get to debugging this (video was also recorded on a phone). I should have saved the debug logs from that pairing to look at them later though.

I used to have a lot of "NWK Conflicts" in the network and for some reason it was caused by Interference. After I moved coordinator a little bit, it just started working.

This is why I even wrote up my interesting story of yesterday night. The coordinator is already on a USB extension and so on (not near any USB 3 plugs, APs, ...) and if the network works, it performs very well (light scene changes, ...).
It's interesting to me that I now experience NwkId conflicts and also only have pairing working in some places. (those were never problems with my CC2652 coordinator that was at the same location)
The flashing of some lights (with the identify effect being disabled) is also weird.

So yeah, uhm, not sure where exactly I was going with this, but maybe it contains some useful information to debug this problem. Also not sure how much longer my main network is going to be run off a ZBDongle-E.

@Adminiuga
Copy link
Contributor

Do you get NWKID conflicts in the logs now?

@MattWestb
Copy link
Contributor

I think both was running network with stored TC-Link key and have merger to one EZSP that is using hashed Link-Keys. Also standard EZSP is only having 12 tokens (or less) for TC-Link so if merging from TI coordinator and have many Zigbee 3 devices its not enough token to storing all in the flash.

I think in both cases is the network routing badly broken so the sending permit join the network cant broadcasting it to all devices then some / many router is blocked with some false network settings.

@TheJulianJES Did you flashing the EZSP with the IEEE from the coordinator before then merger the network or is it running on the chip default ?

And the XSP1989 firmware is from the original Sonoff ZBB but is shall working well on the new E-Dongle then the the pins used is the same and also EZSP 6.10.3 shall being stable on this devices and is more secure running then 7.X if getting problem.

@MattWestb
Copy link
Contributor

MattWestb commented Nov 29, 2022

Here the third is coming !!

One of my Billy RCP was having problems with IKEA controllers so i "converting" back to NCP 6.10.3.0 by stooping Silabs addon and flashing the chip with NCP firmware and pointing ZHA to the comport and restarting.

I have one IKEA GU10 WS2 and one STOFTMOLN (Light engine 3 WS) as routers and one wireless dimmer and one Symphonisk.
I cant getting both controller joining OK if deleting them or trying rejoining them but with little luck i can getting one doing checkins but then i getting the other working is the first not responsible in the network.

So i doing little sniffing then trying joining the controllers and i is getting the same error with tree error. Sadly i was not saving the sniff.
Trying repower the router but is not helping.
Trying initiating one new network from ZHA and have the network parameters in config but i getting one error but ZHA is starting OK.
The network is not working so somthing is changed (the network key is in wireshark and perhaps more) so i resetting the 2 routers and adding them /without deleting them) and they is working OK.
Then doing the same with the 2 controller and they is going in without any problems and also reconfigure is working OK.
Also they is doing checkins very nicely every hour now :-)))
So its looks being one way getting back resetting the network and readding all devices (without deleting them in ZHA).

The legend is that the RCP test have bin running on different chips but ZHA have restoring the network all the time and looks working OK.
Then the coordinator IEEE is changes i think its broken the hashed link key (or more exactly the hash is not working then its being made from the original IEEEE and is wrong with the new one).

In my test setup it was looking like unicast to one joining sleeper was not working and ZHA is timing out all the time after the device have getting the network key and little more.

I have not looking wot commands the TC is using the TC-Link key but 100% with routing things and some direct unicast. Most of the network traffic is only using network key for protection so shall not being any problems.

I think if having this problems with one large production network is the best forming one new one with network key, PAN-ID and extended pan-id changed and start resetting / adding routers near the coordinator and extending the routers so getting full coverage and last adding the end devices.

I shall doing more testing by flashing RCP firmware and letting ZHA attaching the network from Silabs addon to the chip (its normally working OK) and see if its making more problems with the TC-Link key then adding devices.
As long the IEEE is not changed it shall working OK only the manufacture tokens and the firmware version is being changed and the network, TC-Link key and the hash seed shall not being touched.

Edit: The RCP / NCP have being changed some times then it was one IKEA Markus and ZHA have merger it OK all the swapping but somthing have going wrong with the TC-Link key.

@MattWestb
Copy link
Contributor

In 3.2.2 Trust Center Link Key is the answer then the TC-Link key is being used and one Zigbee 3 network cant working stable if its not being sett OK in all (router) device and also not (re)joining devices OK.
an1233-zigbee-security.pdf

@mguaylam
Copy link
Author

mguaylam commented Nov 29, 2022

Do Philips bulbs flash like that when there is an address conflict?

When there's an address conflict, the conflicting device would leave and re-join the network -> this would cause the light to blink because of the identify command.

Weirdly, I have disabled the identify effect upon joining but still get that effect from the GIF in the first post. It start as soon as the Sinopé thermostats declares an address conflict. Also happens then the coordinator is unplugged.

Then I can see an insane amount of network conflicts coming from my 4 Sinope thermostats and it goes very fast, i’m wondering if that’s what crash my coordinator :

The "nwk conflict" is a network broadcast and it is possible it is causing a crash, because you have reported quite high setting for the broadcast table, unless CONFIG_BROADCAST_TABLE_SIZE: 15 setting was accepted successfuly.

I can try but this happens even with the coordinator unplugged (basically offline).

Before you get the 1st address conflict, what is the command which triggers it? Usually it is one very close to the 1st address conflict messages in the batch.

I can’t see a direct conflict to a certain message but it seem’s to always be after 0xbd0f which is my new Philips Bulb. 0x965b is one of my thermostat. This screenshot is the first network conflict since unplugging my coordinator. It took about 10 minutes before the thermostats went crazy and all my (still on) lights started flashing. I have to note that it does not seem to be particular to this bulb, I tried another Philips Hue bulb of the same model with the same result, even an IKEA bulb does the same.

image

What Address is reported to be conflicting? I see two scenarios for the address conflict:

So, from my understanding, if a device see’s a network address conflict, it will broadcast a network conflict and the joining device is supposed to leave and try to come back with another address? Because I can’t see an indication of the culpit in the message. There is also no one else in the capture with 0x965b except : 50:0b:91:40:00:02:bc:68

Content of the shown frame selected above.


Frame 31851: 113 bytes on wire (904 bits), 113 bytes captured (904 bits) on interface \Device\NPF_Loopback, id 0
Null/Loopback
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.1
User Datagram Protocol, Src Port: 17754, Dst Port: 17754
ZigBee Encapsulation Protocol, Channel: 15, Length: 49
IEEE 802.15.4 Data, Dst: Broadcast, Src: 0x9c44
    Frame Control Field: 0x8841, Frame Type: Data, PAN ID Compression, Destination Addressing Mode: Short/16-bit, Frame Version: IEEE Std 802.15.4-2003, Source Addressing Mode: Short/16-bit
    Sequence Number: 230
    Destination PAN: 0xeade
    Destination: 0xffff
    Source: 0x9c44
    [Extended Source: PhilipsL_01:09:04:58:0e (00:17:88:01:09:04:58:0e)]
    [Origin: 36]
    TI CC24xx-format metadata: FCS OK
ZigBee Network Layer Command, Dst: Broadcast, Src: 0x965b
    Frame Control Field: 0x1209, Frame Type: Command, Discover Route: Suppress, Security, Extended Source Command
    Destination: 0xfffd
    Source: 0x965b
    Radius: 9
    Sequence Number: 180
    Extended Source: IEEERegi_40:00:02:bc:68 (50:0b:91:40:00:02:bc:68)
    ZigBee Security Header
    Command Frame: Network Status
        Command Identifier: Network Status (0x03)
        Status Code: Address Conflict (0x0d)
        Destination: 0x0000
1. either the device(s) reporting the conflict have wrong information about IEEE<>NWK addr mapping when it hears a request

That would mean somehow my 4 thermostats are affected as they are the only ones reporting conflicts and when 1 starts, all 4 goes. How can I verify that they have the right information?

2. the conflicting device is indeed using a conflicting NWK address. Check the packet capture if there was any other traffic from the same NWK

Unfortunately, I can’t see any conflict unless I misunderstand how the conflict process works. Devices reporting a conflict are alone using their address, I don’t see any other MAC using them.

@Adminiuga
Copy link
Contributor

Can you post the packet # 31825 here? The route request one? And also post the IEEE address of your coordinator. The original one and the new one if you overwrote it with backup.

I had some devices using incorrect coordinator IEEE address for some reason to even though it was overwritten with the new one.

@mguaylam
Copy link
Author

mguaylam commented Nov 29, 2022

Packet # 31825

Frame 31825: 123 bytes on wire (984 bits), 123 bytes captured (984 bits) on interface \Device\NPF_Loopback, id 0
Null/Loopback
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.1
User Datagram Protocol, Src Port: 17754, Dst Port: 17754
ZigBee Encapsulation Protocol, Channel: 15, Length: 59
    Protocol ID String: EX
    Protocol Version: 2
    Type: Data (1)
    Channel ID: 15
    Device ID: 56440
    LQI/CRC Mode: LQI
    Link Quality Indication: 252
    Timestamp: Nov 29, 2022 15:29:18.593999999 Est
    Sequence Number: 30068
    .011 1011 = Length: 59 bytes
IEEE 802.15.4 Data, Dst: Broadcast, Src: 0xbd0f
    Frame Control Field: 0x8841, Frame Type: Data, PAN ID Compression, Destination Addressing Mode: Short/16-bit, Frame Version: IEEE Std 802.15.4-2003, Source Addressing Mode: Short/16-bit
    Sequence Number: 127
    Destination PAN: 0xeade
    Destination: 0xffff
    Source: 0xbd0f
    [Extended Source: PhilipsL_01:0c:e5:bf:ab (00:17:88:01:0c:e5:bf:ab)]
    [Origin: 22]
    TI CC24xx-format metadata: FCS OK
        FCS Valid: True
        RSSI: -37 dB
        LQI Correlation Value: 126
ZigBee Network Layer Command, Dst: Broadcast, Src: 0xbd0f
    Frame Control Field: 0x1209, Frame Type: Command, Discover Route: Suppress, Security, Extended Source Command
        .... .... .... ..01 = Frame Type: Command (0x1)
        .... .... ..00 10.. = Protocol Version: 2
        .... .... 00.. .... = Discover Route: Suppress (0x0)
        .... ...0 .... .... = Multicast: False
        .... ..1. .... .... = Security: True
        .... .0.. .... .... = Source Route: False
        .... 0... .... .... = Destination: False
        ...1 .... .... .... = Extended Source: True
        ..0. .... .... .... = End Device Initiator: False
    Destination: 0xfffc
    Source: 0xbd0f
    Radius: 30
    Sequence Number: 191
    Extended Source: PhilipsL_01:0c:e5:bf:ab (00:17:88:01:0c:e5:bf:ab)
    ZigBee Security Header
        Security Control Field: 0x28, Key Id: Network Key, Extended Nonce
            ...0 1... = Key Id: Network Key (0x1)
            ..1. .... = Extended Nonce: True
        Frame Counter: 28717356
        Extended Source: PhilipsL_01:0c:e5:bf:ab (00:17:88:01:0c:e5:bf:ab)
        Key Sequence Number: 0
        Message Integrity Code: ee4fbd61
        [Key: 2478a58e6c32cc0067bc1620049082ec]
        [Key Label: ]
    Command Frame: Route Request
        Command Identifier: Route Request (0x01)
        Command Options: 0x20, Extended Destination, Many-to-One Discovery: Not Many-to-One
            .0.. .... = Multicast: False
            ..1. .... = Extended Destination: True
            ...0 0... = Many-to-One Discovery: Not Many-to-One (0x0)
        Route ID: 4
        Destination: 0x0000
        Path Cost: 0
        Extended Destination: SiliconL_ff:fe:0f:dd:1e (60:a4:23:ff:fe:0f:dd:1e)

Current IEEE of the coordinator

00:0d:6f:00:11:f5:dc:78

Which concord's with my original HUSBZB-1 back-up : 00:0d:6f:00:11:f5:dc:78

As for the previous IEEE, do you know if we can see the one we overwritten? i didn’t take note of it.

Interestingly, the network conflicts started around the first route request of 0xbd0f after some filtering.
Also when looking at the multiple route requests, when the broadcast request pass thru a thermostat I see a changed extended destination?
image
But thru another Philips Hue bulb :
image

Just a word to thank you for looking into my issue.

@TheJulianJES
Copy link
Member

Did you flashing the EZSP with the IEEE from the coordinator before then merger the network or is it running on the chip default ?

Yes, I did. The original IEEE should still come from an Elelabs ELU013 -> CC2652 -> ZBDongle-E.

Do you get NWKID conflicts in the logs now?

I did get a lot of "joined the network"/"changed id" logs during the unsuccessful pairing from the video:

2022-11-28 00:55:09.902 DEBUG (MainThread) [bellows.zigbee.application] Received trustCenterJoinHandler frame with [0x1f8d, 58:8e:81:ff:fe:d3:90:2c, <EmberDeviceUpdate.STANDARD_SECURITY_UNSECURED_JOIN: 1>, <EmberJoinDecision.USE_PRECONFIGURED_KEY: 0>, 0x7866]
2022-11-28 00:55:09.902 INFO (MainThread) [zigpy.application] Device 0x1f8d (58:8e:81:ff:fe:d3:90:2c) joined the network
2022-11-28 00:55:09.902 DEBUG (MainThread) [zigpy.application] Device 58:8e:81:ff:fe:d3:90:2c changed id (0xaad7 => 0x1f8d)

However, now that my network is back to running "normal" (still unable to pair devices in some places and seem to get multiple rejoins/nwk id changes in other places), I do not get any rejoins/nwk id changes during normal operation.

@TheJulianJES
Copy link
Member

Possibly completely unrelated but bellows apparently had an issue with APS link key generation when restoring from a backup: zigpy/bellows@b7a3925
Not sure if the frame counter(s) would make a difference though(?)

@Adminiuga
Copy link
Contributor

It sends the route request to destination SiliconL_ff:fe:0f:dd:1e (60:a4:23:ff:fe:0f:dd:1e) nwk id 0x0000 but the eui should be 00:0d:6f:00:11:f5:dc:78 and that's why the thermostat reports the nwk conflict.

I guess the 60:a4:23:ff:fe:0f:dd:1e is the original IEEE address? I have no ideas how devices get the wind of the original IEEE.

In your backup, are there any TC link keys with that IEEE?

@mguaylam
Copy link
Author

mguaylam commented Nov 30, 2022

Unfortunately I have no trace of the previous IEEE address of the coordinator before I wrote on it. So I can’t know for sure it come’s from there.

I've checked for that address but it doesn't appear in any backup I have made.

I do still have 2 new EasyIoT I could write to if need be (and look at their IEEE address before writing 00:0d:6f:00:11:f5:dc:78 on it) but I would keep that as a last resort because i’m burning thru coordinators like crazy. 😹

I looked at my first ever back-up of my HUSBZB-1 I have :

"tc_link_key": {
"bitmask": 26, "type": 1, "key": [90, 105, 103, 66, 101, 101, 65, 108, 108, 105, 97, 110, 99, 101, 48, 57],
"outgoingFrameCounter": 98304,
"incomingFrameCounter": 0,
"sequenceNumber": 0,
"partnerEUI64": "00:0d:6f:00:11:f5:dc:78"},

And the current EasyIoT:

        "tc_link_key": {
            "key": "5a:69:67:42:65:65:41:6c:6c:69:61:6e:63:65:30:39",
            "tx_counter": 1376256,
            "rx_counter": 0,
            "seq": 0,
            "partner_ieee": "00:0d:6f:00:11:f5:dc:78"
        },

@Adminiuga
Copy link
Contributor

No, i mean the current backup of EZSP coordinator as of now? Does it have any partner eui of  60:a4:23:ff:fe:0f:dd:1e ?

@Adminiuga
Copy link
Contributor

No, i mean the current backup of EZSP coordinator as of now? Does it have any partner eui of  60:a4:23:ff:fe:0f:dd:1e ?

Or is the EasyIoT current EZSP coordinator?
Can you try to factory reset and rejoin the Philips bulbs?

@MattWestb
Copy link
Contributor

I was checking the ghost IEEE 50:0b:91:40:00:02:bc:68 and is not given out so looks being somthing the device have self made and not one real device in your network.
And one more from the joining try 58:8e:81:ff:fe:d3:90:2c.

@mguaylam can you post the first 6 numbers (or all) of the IEEE for one or all then im interested of the chip / Zigbee stack they is using then its smells firmware bugs.

@mguaylam
Copy link
Author

mguaylam commented Dec 1, 2022

@Adminiuga sorry for the confusion, my main coordinator is the EasyIoT. The Sonoff one was for testing.
That means, the current back-up would be this one, it dosnt have any trace of 60:a4:23:ff:fe:0f:dd:1e 😕 :

{
    "backup_time": "2022-11-30T01:45:33.959957+00:00",
    "network_info": {
        "extended_pan_id": "57:e9:61:a5:1b:f1:b5:ec",
        "pan_id": "EADE",
        "nwk_update_id": 0,
        "nwk_manager_id": "0000",
        "channel": 15,
        "channel_mask": [
            11,
            12,
            13,
            14,
            15,
            16,
            17,
            18,
            19,
            20,
            21,
            22,
            23,
            24,
            25,
            26
        ],
        "security_level": 5,
        "network_key": {
            "key": "**redacted**",
            "tx_counter": 76832682,
            "rx_counter": 0,
            "seq": 0,
            "partner_ieee": "ff:ff:ff:ff:ff:ff:ff:ff"
        },
        "tc_link_key": {
            "key": "5a:69:67:42:65:65:41:6c:6c:69:61:6e:63:65:30:39",
            "tx_counter": 1376256,
            "rx_counter": 0,
            "seq": 0,
            "partner_ieee": "00:0d:6f:00:11:f5:dc:78"
        },
        "key_table": [],
        "children": [
            "00:0d:6f:00:0f:a0:6b:07"
        ],
        "nwk_addresses": {
            "00:0d:6f:00:0f:a0:6b:07": "92ED",
            "00:17:88:01:09:6e:5f:15": "D10A",
            "00:17:88:01:09:6e:5f:af": "5E33"
        },
        "stack_specific": {
            "ezsp": {
                "hashed_tclk": "a2473867c61c6c4e43e764b18dc95164"
            }
        },
        "metadata": {
            "ezsp": {
                "manufacturer": "",
                "board": "",
                "version": "6.10.3.0 build 297",
                "stack_version": 8,
                "can_write_custom_eui64": false
            }
        },
        "source": "[email protected]"
    },
    "node_info": {
        "nwk": "0000",
        "ieee": "00:0d:6f:00:11:f5:dc:78",
        "logical_type": "coordinator"
    }
}

As for the Philips bulbs, I have quite a few, I’ll do that as soon as I can and come back to you. 😸

@MattWestb : I’m having trouble understanding you. 😢 Are you on the Discord of Home-Assistant?

@Adminiuga
Copy link
Contributor

It could be that the Philips joined while you were running the other coordinator and it got wrong tclink key cached.
Factory reset that Philips bulb and any other bulb/device which uses wrong IEEE for coordinator and causes address conflicts.

@MattWestb
Copy link
Contributor

@mguaylam Sorry no HA Discord :-((
I think Alexei is on the right track then your routers is having problem with routing and old IEEE and old TC-Link key / not current link-key key hash is the key things that is making all the routing problems and as long the routing is not working the network is not working OK or partly not working.

@mguaylam can you post the first 6 numbers (or all) of the IEEE for one or all then im interested of the chip / Zigbee stack they is using then its smells firmware bugs.

This is the second track and i like to knowing if the TRVs is having "normal" firmware or some more "exotic" ones that can doing problems.

@mguaylam
Copy link
Author

mguaylam commented Dec 7, 2022

So I repaired all routers of the network. At first, I could not pair anything except directly thru the coordinator but I think that’s fixed now, I had to pair directly with the coordinator or a device with the new trust center link key to make it work.

I then did some tests ;

  • Unplugging the bulb.
  • Disconnecting the coordinator.
  • Pair new devices.

All of which used to make the thermostats think there is a network conflict and make all my lights flash. I think it’s quite an annoying feature from Philips to make lights flash that horrible pattern upon network conflicts but oh well.

Interestingly it suddenly stopped when the problematic Philips Hue bulb got its proper configuration. So the issue resolved before I reset the thermostats. Not sure why since I taught they were the issue.

I’m gonna hope it’s the good one. I will close this issue now in the hope I will not need to open it again.

To conclude, it appears the missing trust center link key was the issue here. I’m not sure why mine disappeared since I always had one.

Thanks Puddly, Adminiuga, MattWestb, TheJulianJES and Dmulcahey for the help provided in this issue here and on other platforms. It really helped! I hope you all know how your time is valued here.

I hope you all have a nice day! ☀️

@mguaylam mguaylam closed this as completed Dec 7, 2022
@MattWestb
Copy link
Contributor

Great work done !!

Keep looking in the log for routing problems and also if some end devices (sleeper or not) is having problems and can need being repaired for getting one valid TC-Link key and working 100% OK in the network.
If i was you and was having time i wold repairing all devices so i knowing all is OK for the future and not getting problems later.

Then your backup was having TC-Link keys it was likely forming the network before ZHA was start using hashed TC-Link keys and was working OK until you was restoring one backup and some device (new added ones) was starting using hashed ones and the old not => routing and joining problems.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants