Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA Upgrade: Respect that CurrentFileVersion attribute is optional #1091

Conversation

triller-telekom
Copy link
Contributor

Fixes #1087

Signed-off-by: Stefan Triller [email protected]

@cdjackson
Copy link
Member

Thanks @triller-telekom - this looks good. I assume that you tested this with a successful completion. Did you also test the UNSUPPORTED_ATTRIBUTE response by changing the attribute ID (or with a device the natively doesn't support the firmware version)?

@triller-telekom
Copy link
Contributor Author

No I did not test this so far. the problem is that the device that had the problem ran into #1088 which is why the request failed :(

I can try to find a different device where our firmware infrastructure is offering an update too...

@triller-telekom
Copy link
Contributor Author

In addition: I just realized that my eclipse did not report me any compilation errors, but the maven build fails with some error that i would have expected already on your PR when you introduced the new sendCommand method:

[ERROR] /home/stefan/openhab-master-new2/git/com.zsmartsystems.zigbee/com.zsmartsystems.zigbee/src/test/java/com/zsmartsystems/zigbee/app/seclient/ZclKeyEstablishmentClientTest.java:[115,69] reference to sendCommand is ambiguous
  both method sendCommand(com.zsmartsystems.zigbee.zcl.clusters.general.ZclGeneralCommand) in com.zsmartsystems.zigbee.zcl.ZclCluster and method sendCommand(com.zsmartsystems.zigbee.zcl.clusters.keyestablishment.ZclKeyEstablishmentCommand) in com.zsmartsystems.zigbee.zcl.clusters.ZclKeyEstablishmentCluster match

@cdjackson
Copy link
Member

I didn't spot this - I'll update these tests now.

@triller-telekom
Copy link
Contributor Author

I have the fixes in a branch and can create a PR, give me 2mins...

@triller-telekom
Copy link
Contributor Author

PR is here: #1093

@cdjackson
Copy link
Member

Thanks I missed your mail - I also provided a similar PR in #1094. Any preferences?

@triller-telekom triller-telekom force-pushed the otaUpgradeUnsupportedAttribute branch from d3172db to 8ea6349 Compare July 24, 2020 10:48
@cdjackson
Copy link
Member

@triller-telekom now that #1094 is fixed, please can you confirm once you've had the chance to test that this at least works fine with a standard OTA transfer where the version attribute is supported - so that at least we don't introduce a regression.

Thanks.

@triller-telekom
Copy link
Contributor Author

I just did some testing with the changes. A firmware update with an osram bulb that was successful previously is now failing.

I don't know why, yet. Still investigating.

@triller-telekom
Copy link
Contributor Author

@cdjackson : Here is what I found out, but I think I need some help to explain why this is happening:

Up until ZclOtaUpgradeServer.completeUpgrade() line updateStatus(ZigBeeOtaServerStatus.OTA_UPGRADE_FIRMWARE_RESTARTING); everything runs smooth.

We then tell the device that the update was finished:

2020-07-27_09:56:43:106	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.ZigBeeNetworkManager] - TX CMD: UpgradeEndResponse [Ota Upgrade: 0000/0 -> F5B3/3, cluster=0019, TID=B7, manufacturerCode=4364, imageType=6, fileVersion=16909330, currentTime=0, upgradeTime=0]

This transaction becomes marked as ACKED pretty soon too (when a message from a different device arrives we see it in the logs):

2020-07-27_09:56:45:484	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.transaction.ZigBeeTransactionManager] - notifyTransactionCommand: QueryNextImageCommand [Ota Upgrade: DAB8/4 -> 0000/1, cluster=0019, TID=1A, fieldControl=0, manufacturerCode=10132, imageType=257, fileVersion=48, hardwareVersion=null] ZigBeeTransaction [ieeeAddress=8418260000DA9DB6 queueTime=2382, state=ACKED, sendCnt=1, command=UpgradeEndResponse [Ota Upgrade: 0000/0 -> F5B3/3, cluster=0019, TID=B7, manufacturerCode=4364, imageType=6, fileVersion=16909330, currentTime=0, upgradeTime=0]]

However, after the device has restarted, the transaction is STILL queued:

2020-07-27_09:56:45:894	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.transaction.ZigBeeTransactionManager] - notifyTransactionCommand: DeviceAnnounce [F5B3/0 -> 0000/0, cluster=0013, TID=00, nwkAddrOfInterest=F5B3, ieeeAddr=8418260000DA9DB6, capability=8E] ZigBeeTransaction [ieeeAddress=8418260000DA9DB6 queueTime=2792, state=ACKED, sendCnt=1, command=UpgradeEndResponse [Ota Upgrade: 0000/0 -> F5B3/3, cluster=0019, TID=B7, manufacturerCode=4364, imageType=6, fileVersion=16909330, currentTime=0, upgradeTime=0]]

In theory we should block the thread on the get() here:

CommandResult response = cluster.sendCommand(new UpgradeEndResponse(otaFile.getManufacturerCode(),
                            otaFile.getImageType(), otaFile.getFileVersion(), 0, 0)).get();
                    if (!(response.isSuccess() || response.isTimeout())) {
                        updateStatus(ZigBeeOtaServerStatus.OTA_UPGRADE_FAILED);
                        return;
                    }

The newly introduced request for the CurrentFileVersion attribute of the OTA cluster, seems to be using the wrong cluster, i.e. not OTA = 19, but here "0000" (but I cannot explain how the local variale cluster inside ZclOtaUpgradeServer could have changed. Must have something to do with the node changes that we notice below, maybe?):

2020-07-27_09:57:05:112	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.transaction.ZigBeeTransactionQueue] - 8418260000DA9DB6: Added transaction to queue, len=2, transaction=ZigBeeTransaction [ieeeAddress=8418260000DA9DB6 queueTime=0, state=WAITING, sendCnt=0, command=ReadAttributesCommand [Basic: 0000/0 -> F5B3/3, cluster=0000, TID=--, identifiers=[2]]]

Also the UpgradeEndResponse that we have sent to the device to make it finish the update and restart, gets send out again now, whereas I thought everything would be blocked in the get() above...

2020-07-27_09:57:05:113	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.transaction.ZigBeeTransactionManager] - F5B3/3: Sending ZigBeeTransaction [ieeeAddress=8418260000DA9DB6 queueTime=22011, state=WAITING, sendCnt=1, command=UpgradeEndResponse [Ota Upgrade: 0000/0 -> F5B3/3, cluster=0019, TID=B7, manufacturerCode=4364, imageType=6, fileVersion=16909330, currentTime=0, upgradeTime=0]]

Some node changes are noticed here...

2020-07-27_09:57:05:386	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.ZigBeeNetworkManager] - 8418260000DA9DB6: Endpoint 3. Unknown output cluster 0000

Then I see the output of the newly introduced debug logging about a received DefaultResponse 10 times are its printed in the for loop:

2020-07-27_09:57:05:491	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.app.otaserver.ZclOtaUpgradeServer] - F5B3/3 : Received DefaultResponse: DefaultResponse [Basic: F5B3/3 -> 0000/1, cluster=0000, TID=CA, commandIdentifier=0, statusCode=FAILURE]

And since its a failure and not UNSUPPORTED_ATTRIBUTE I wait 10 times for 10 seconds, before the failure is announced here:

2020-07-27_09:58:37:182	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.app.otaserver.ZclOtaUpgradeServer] - F5B3/3 : Received DefaultResponse: DefaultResponse [Basic: F5B3/3 -> 0000/1, cluster=0000, TID=DC, commandIdentifier=0, statusCode=FAILURE]
2020-07-27_09:58:37:182	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.transaction.ZigBeeTransactionManager] - Transaction complete: ZigBeeTransaction [ieeeAddress=8418260000DA9DB6 queueTime=172, state=COMPLETE, sendCnt=1, command=ReadAttributesCommand [Basic: 0000/0 -> F5B3/3, cluster=0000, TID=DC, identifiers=[2]]]
2020-07-27_09:58:37:182	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.app.otaserver.ZclOtaUpgradeServer] - F5B3/3 : All attempts to reach the device failed after it should have rebooted, taking this as a failure.
2020-07-27_09:58:37:183	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.app.otaserver.ZclOtaUpgradeServer] - F5B3/3 OTA status updated to OTA_UPGRADE_FAILED.
2020-07-27_09:58:37:183	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.transaction.ZigBeeTransactionQueue] - 8418260000DA9DB6: transactionComplete COMPLETE 0
2020-07-27_09:58:37:184	NA	DEBUG:	[symbolicName=org.openhab.binding.zigbee.core, className=org.openhab.zigbee.handler.ZigBeeThingHandler] - 8418260000DA9DB6: OTA transfer status update OTA_UPGRADE_FAILED, percent=100
2020-07-27_09:58:37:225	NA	DEBUG:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.app.otaserver.ZclOtaUpgradeServer] - F5B3/3 OTA status updated to OTA_CANCELLED.
2020-07-27_09:58:37:226	NA	DEBUG:	[symbolicName=org.openhab.binding.zigbee.core, className=org.openhab.binding.zigbee.handler.ZigBeeThingHandler] - 8418260000DA9DB6: OTA transfer status update OTA_CANCELLED, percent=100

Unfortunately the logger I was using swallows the message of exceptions, but this one is thrown directly afterwards:

2020-07-27_09:58:37:240	NA	WARN:	[symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.ZigBeeExecutors] - Uncaught exception in thread NotificationService-thread-18
	at org.eclipse.smarthome.core.thing.internal.firmware.ProgressCallbackImpl.canceled(ProgressCallbackImpl.java:161)
	at org.openhab.binding.zigbee.handler.ZigBeeThingHandler.lambda$updateFirmware$3(ZigBeeThingHandler.java:1301)
	at com.zsmartsystems.zigbee.app.otaserver.ZclOtaUpgradeServer$2.run(ZclOtaUpgradeServer.java:545)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source:-1)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source:-1)
	at java.lang.Thread.run(Unknown Source:-1)

Where ZigBeeThingHandler.java:1301 is progressCallback.canceled(); in the case branch:

case OTA_CANCELLED:
                    progressCallback.canceled();
                    break;

So for now I cannot explain why the wrong cluster is used and also why the UpgradeEndResponse is not treated as finished. Maybe the restart of the device has something to do with it once it receives the message.

Attached is the "full" log of the end of the firmware update process. I "grep"ed the log case-insensitively for everything with F5B3, 8418260000DA9DB6, and ZclOtaUpgradeServer to scale it down to the relevant parts.

Do you have an idea what might go wrong here?

github-device-firmware-update.log

@cdjackson
Copy link
Member

So, there's a lot going on here ;)

This transaction becomes marked as ACKED pretty soon too (when a message from a different device arrives we see it in the logs):

The message is marked as ACKED not because of this message, but because of this message ->

EzspMessageSentHandler [networkId=0, type=EMBER_OUTGOING_DIRECT, indexOrDestination=F5B3, apsFrame=EmberApsFrame [profileId=0104, clusterId=0019, sourceEndpoint=1, destinationEndpoint=3, options=[EMBER_APS_OPTION_RETRY, EMBER_APS_OPTION_ENABLE_ROUTE_DISCOVERY, EMBER_APS_OPTION_ENABLE_ADDRESS_DISCOVERY], groupId=0, sequence=A9], messageTag=B7, status=EMBER_SUCCESS, messageContents=]

This is the acknowledgement that the command was sent to the device. This is fine. ACKED just means that the remote device acknowledged that it received the command.

The transaction doesn't complete until here - and it has failed -:

2020-07-27_09:56:53:106 NA DEBUG: [symbolicName=com.zsmartsystems.zigbee, className=com.zsmartsystems.zigbee.transaction.ZigBeeTransactionManager] - Transaction complete: ZigBeeTransaction [ieeeAddress=8418260000DA9DB6 queueTime=10005, state=FAILED, sendCnt=1, command=UpgradeEndResponse [Ota Upgrade: 0000/0 -> F5B3/3, cluster=0019, TID=B7, manufacturerCode=4364, imageType=6, fileVersion=16909330, currentTime=0, upgradeTime=0]]

In theory we should block the thread on the get() here:

Agreed - this should still be blocking I think.

The newly introduced request for the CurrentFileVersion attribute of the OTA cluster, seems to be using the wrong cluster

This is fixed now in #1098

Also the UpgradeEndResponse that we have sent to the device to make it finish the update and restart, gets send out again now, whereas I thought everything would be blocked in the get() above...

This just shows that the transaction is still running - this is a retry since the transaction didn't complete (even if we think it did as it does seem it passed by the get()). The sendCnt is now up to 1, and the queue time is 22 seconds. I think this is fine as the TransactionManager will think it's not complete as it didn't get a response (only the APS ACK, but no ZCL ACK) - the open question is why this hasn't blocked on the get().

So for now I cannot explain why the wrong cluster is used

As above, this is fixed now in #1098 - this was simply that the cluster ID is not set for the general commands - it wasn't required when we were only sending specific commands.

also why the UpgradeEndResponse is not treated as finished.

You mean why it IS treated as finished? Right? Ie it didn't seem to stop at the .get(). The transaction isn't completed I think as it never receives a ZCL ACK, but it does seem that it passed by the get() even though the transaction is still running and is still retrying.

I'm a bit curious if there's something happening before this log starts though, as I also see this attribute being requested at 09:56:43.095 - this is before the UpgradeEndResponse - maybe this comes from somewhere else - I can't see the request in the log though.

So... I do agree that this get is a bit strange from what I see here and I can't explain that unless there was another process running somewhere. I'd suggest we part that for a minute and you rebase the code on the current master which ought to solve the cluster id issue. Let's see what that looks like and if there are still strange things happening, please post another log...

@triller-telekom triller-telekom force-pushed the otaUpgradeUnsupportedAttribute branch from 8ea6349 to 1b3f1bb Compare July 31, 2020 11:23
@triller-telekom triller-telekom force-pushed the otaUpgradeUnsupportedAttribute branch from 1b3f1bb to 7b27abc Compare July 31, 2020 11:24
@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

Merging #1091 into master will decrease coverage by 0.01%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1091      +/-   ##
============================================
- Coverage     30.21%   30.20%   -0.02%     
- Complexity     4019     4020       +1     
============================================
  Files          1440     1440              
  Lines         56161    56179      +18     
  Branches       2484     2489       +5     
============================================
- Hits          16970    16969       -1     
- Misses        38450    38470      +20     
+ Partials        741      740       -1     
Impacted Files Coverage Δ Complexity Δ
...tems/zigbee/app/otaserver/ZclOtaUpgradeServer.java 27.47% <0.00%> (-1.68%) 25.00 <0.00> (ø)
...bee/dongle/ember/internal/ash/AshFrameHandler.java 58.25% <0.00%> (-0.50%) 47.00% <0.00%> (ø%)
...s/zigbee/transaction/ZigBeeTransactionManager.java 94.33% <0.00%> (+0.40%) 70.00% <0.00%> (+1.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39a8c7c...7b27abc. Read the comment docs.

@triller-telekom
Copy link
Contributor Author

Thank you for your detailed analysis and fixing the bug with the wrong cluster so quickly. I think it would have taken me a lot longer to find it there.

I have now rebased the PR against the latest master and tested 2 cases:

  1. Firmware update like before the fix with a device that supports the attribute CurrentFileVersion -> this is still working
  2. I requested the invalid attribute 15 from the OTA cluster from that device and got a ReadAttributesResponse as well, instead of a DefaultResponse. However, in the content one can see that the attribute requested was not available. So I have adjusted the PR in this regard and it worked.

So I really hope that with a device that does not support the CurrentFileVersion attribute, the update should succeed with this PR.

I still don't know why the get() call wasn't blocking. Maybe we should keep this in mind for the future if we see it again. In my tests above the firmware update was always successful, so no serious issue there.

@cdjackson Please have a look at PR again, so we can merge it soon, thanks.

I guess finding a solution for #1088 is up next, to make the firmware update work for even more devices.

@cdjackson
Copy link
Member

cdjackson commented Jul 31, 2020 via email

@cdjackson
Copy link
Member

Thanks for this @triller-telekom - looks good so merging.

@cdjackson cdjackson merged commit dfd35fb into zsmartsystems:master Aug 2, 2020
@triller-telekom triller-telekom deleted the otaUpgradeUnsupportedAttribute branch August 3, 2020 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZclOtaUpgradeServer should not enforce querying the current file version after update
2 participants