Node discovery takes 3 minutes for a pairing #1153

triller-telekom · 2020-10-08T14:26:48Z

I am currently investigating a problem where it takes very long to discover a zigbee device (3 minutes and 9 seconds), i.e. reading out its descriptors, endpoints, etc.

What I found out so far is, that there is a huge gap from scheduling a task, until it actually starts:

2020-10-07T09:09:48.771+0200   476  DEBUG          [symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.app.discovery.ZigBeeNodeServiceDiscoverer] - 14B457FFFEC4E8CB: Node SVC Discovery: scheduled [NWK_ADDRESS, POWER_DESCRIPTOR, NODE_DESCRIPTOR, ACTIVE_ENDPOINTS]

2020-10-07T09:10:30.289+0200   476  DEBUG          [symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.app.discovery.ZigBeeNodeServiceDiscoverer] - 14B457FFFEC4E8CB: Node SVC Discovery: running NWK_ADDRESS

This continues for basically all of the above tasks, so it sums up to the 3 minutes.

The only explanation I have so far is:

/**
* Executor service to execute update threads for discovery or mesh updates etc.
* We use a {@link ZigBeeExecutors.newScheduledThreadPool} to provide a fixed number of threads as otherwise this
* could result in a large number of simultaneous threads in large networks.
*/
private final ScheduledExecutorService executorService = ZigBeeExecutors.newScheduledThreadPool(6,
            "NetworkManager");

So there are only 6 threads available for such tasks to be run in parallel.

On the startup of the ZigbeeDiscoveryExtension, we create a ZigBeeNetworkDiscoverer and tell it to start a node discovery from network node "0". We collect all associated devices and add those to the network manager. I am assuming that the network state must be ONLINE at this point, because I assume the listener for nodeAdded will be triggered and thus we start a discovery (with tasks occupying threads from the pool, mentioned above) for all "associated nodes". That is because ZigbeeDiscoveryExtension.startDiscoveryIfNecessary() has no discoverer for each node and thus it creates them.

The other scenario where we could start a discovery for all nodes, is when we load the nodes from the storage, however, I think the network state is not ONLINE at this point in time and thus no discovery will be triggered.

Also: I have identified 4 "broken" ZigBeeNodes in the particular system, which are nodes that exist, but only have a IEEE address and no endpoints, descriptors, etc. So they are left overs from a broken pairing/deletion of a device, whatever. Those 4 devices would take 4 threads (continuously failing because they are not reachable) and I am wondering why the 2 other threads are also occupied. The only explanation I have is what I wrote above: That we start a discovery for ALL nodes on startup.

So, i think we might run into a problem if there are 6 devices in the network, that not reachable at startup. Because if they occupy all threads and run into timeouts -> retries -> timeouts, it will take a long time until we are able to start a discoverer for a pairing of a new device.

The text was updated successfully, but these errors were encountered:

triller-telekom · 2020-10-12T07:49:54Z

Some further investigations on this, as I think I found the reason why discoveries are started on the startup of ZSS:

Once we set the network to ONLINE in ZigBeeNetworkManager#setNetworkStateOnline(), we notify all nodes:

com.zsmartsystems.zigbee/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/ZigBeeNetworkManager.java

Lines 1330 to 1337 in 84042f7

    
           for (final ZigBeeNetworkNodeListener listener : nodeListeners) { 
        
               notificationService.execute(new Runnable() { 
        
                   @Override 
        
                   public void run() { 
        
                       listener.nodeAdded(node); 
        
                   } 
        
               }); 
        
           }

Since the nodeDiscoverer is null here:

com.zsmartsystems.zigbee/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/app/discovery/ZigBeeDiscoveryExtension.java

Lines 252 to 254 in 84042f7

    
           if (nodeDiscoverer == null || nodeDiscoverer.isFinished() && !nodeDiscoverer.isSuccessful()) { 
        
               logger.debug("{}: DISCOVERY Extension: Adding discoverer for node", node.getIeeeAddress()); 
        
               startDiscovery(node);

We start a new discovery for this node. And we add the meshUpdateTasks, defined as:

com.zsmartsystems.zigbee/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/app/discovery/ZigBeeDiscoveryExtension.java

Lines 82 to 89 in 84042f7

    
               /** 
        
                * List of tasks to be completed during a mesh update. 
        
                * We want to get the neighbors and routes so we have visibility of the mesh. We also default to requesting the 
        
                * network address to ensure that it hasn't changed. 
        
                */ 
        
               private List<NodeDiscoveryTask> meshUpdateTasks = Arrays 
        
                       .asList(new NodeDiscoveryTask[] { NodeDiscoveryTask.NWK_ADDRESS, NodeDiscoveryTask.NEIGHBORS, 
        
                               NodeDiscoveryTask.ROUTES });

here:

com.zsmartsystems.zigbee/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/app/discovery/ZigBeeDiscoveryExtension.java

Line 295 in 84042f7

nodeDiscoverer.setUpdateMeshTasks(meshUpdateTasks);

Now I am wondering what we should do about this comment:

We also default to requesting the network address to ensure that it hasn't changed.

If we always request the network address per node, we will add one task to the thread pool per device on startup. If there are multiple nodes that re currently turned off (no power for example), we will will try to reach them multiple times, due to our retries until we eventually give up. During these retries, there are some chances for other tasks to be run, however, they might take some time, see my comment above.

If we have 6 or more unreachable devices, the situation becomes even worse, because then all threads from the pool might be occupied and if the user wants to start a pairing in that time-span, chances are high that this will take a very long time (and the application that initiated the pairing might have even timed out then).

So the questions would be:

Do we really need to request the network address for mesh tasks? (if we do not have it, we request it anyway, see

com.zsmartsystems.zigbee/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/app/discovery/ZigBeeNodeServiceDiscoverer.java

Lines 743 to 746 in 84042f7

    
           // Always request the network address unless this is our local node - in case it's changed 
        
           if (!networkManager.getLocalNwkAddress().equals(node.getNetworkAddress())) { 
        
               tasks.add(NodeDiscoveryTask.NWK_ADDRESS); 
        
           }

)

If we do need to request the network address for mesh to ensure we have the up2date one, in case it has changed, can we skip this on startup and only do it on regular mesh update runs?

cdjackson · 2020-10-14T09:59:43Z

Do we really need to request the network address for mesh tasks?

The theory was that the NWK address can change. This is especially true for battery devices where they can change parent, or could leave/rejoin for a number of reasons, and this will result in them having a different address.

All that said, I'm happy to discuss any suggestions here @triller-telekom as I do think we're probably being too conservative. Probably we need a more optimistic approach - eg not to rediscover the network so often, and not to assume that the NWK address has changed, and only perform these sort of checks on exception (eg if we have a transaction failure or something).

These concepts came originally from the ZigBee4Java project, and possibly the ZigBee4Osgi before that, but I do think there is a better way to manage the discovery of devices than crawling the neighbough tables etc and reducing some of this "noise" would likely improve things...

Ok... So...

I guess there are 2 issues here?

the number of threads is potentially an issue. We could either just increase the number of thread s (probably it's ok) or we could look to recode this to run in a single thread for all devices (which would be a lot more work). I think that increasing the number of threads here is probably ok.
the discovery needs further work. I've done some improvements on this recently with removal of some of the broadcasts, and there is also Don't broadcast NetworkAddressRequest if DeviceAnnounce just received #1108 that is also linked here, and I think we can probably reduce the crawling of the neighbour tables to find devices - this is really only needed if you've completely lost the network and need to find all devices again, although in time they are likely to be rediscovered anyway.

I'm open to suggestions / PRs here :)

triller-telekom · 2020-10-14T11:55:05Z

All that said, I'm happy to discuss any suggestions here @triller-telekom as I do think we're probably being too conservative.

I agree with that. But I also think that there is nothing wrong with being conservative, because this way we have a "working" system, even if something changes in the network.

not to rediscover the network so often, and not to assume that the NWK address has changed, and only perform these sort of checks on exception (eg if we have a transaction failure or something).

I had something like this in mind while analyzing it, sounds plausible to me.

To your 2 points:

I am fine with increasing the number of threads, although we know that this is only weakening the problem and not "solving" it.
Re-working the discovery, by not crawling the neighbour tables is certainly a bigger task and change. However, I agree that we do not have to traverse all neighbour tables on every startup. Maybe we need a mechanism that can be triggered by the application to rediscover a node (I think we have that already, as mentioned in the linked openHAB ticket). So every time the application wants to communicate with an IEEE address where we do not have a corresponding ZigBeeNode for, the apllication has to trigger such a rediscovery.

Back to my questions above, as I think they might be a first (and easier) step towards a slimmer discovery:

Do you think its feasible to remove the network address request task from the mesh discovery completely, as we do add that task later anyway, in the case we do not know the nwk address? If not, then we certainly should build in a mechanism to skip this on startup, so not flood the network or our transaction manager with too many requests.

stale · 2020-12-13T17:23:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-02-12T06:01:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

triller-telekom added the bug label Oct 8, 2020

triller-telekom mentioned this issue Oct 13, 2020

Rediscovery of ZigBeeNodes openhab/org.openhab.binding.zigbee#613

Open

stale bot added the wontfix label Dec 13, 2020

cdjackson removed the wontfix label Dec 13, 2020

stale bot added the wontfix label Feb 12, 2021

stale bot closed this as completed Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node discovery takes 3 minutes for a pairing #1153

Node discovery takes 3 minutes for a pairing #1153

triller-telekom commented Oct 8, 2020

triller-telekom commented Oct 12, 2020

cdjackson commented Oct 14, 2020

triller-telekom commented Oct 14, 2020

stale bot commented Dec 13, 2020

stale bot commented Feb 12, 2021

Node discovery takes 3 minutes for a pairing #1153

Node discovery takes 3 minutes for a pairing #1153

Comments

triller-telekom commented Oct 8, 2020

triller-telekom commented Oct 12, 2020

cdjackson commented Oct 14, 2020

triller-telekom commented Oct 14, 2020

stale bot commented Dec 13, 2020

stale bot commented Feb 12, 2021