Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcu_mgmt: Memory corruption (cborattr suspected) - test case with smp_svr #7924

Closed
oliviermartin opened this issue May 25, 2018 · 18 comments
Closed
Assignees
Labels
area: Device Management bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@oliviermartin
Copy link

I was trying to enable FOTA on my Nordic nRF52832 based device. I initially tested it with zephyr/samples/subsys/mgmt/mcumgr/smp_svr and it was working fine. But when trying to integrate the mcumgr module into my Zephyr application, the update was crashing with ***** Stack Check Fail! *****.
I took me a while to understand why it works with smp_svr. But after enabling CONFIG_STACK_CANARIES in smp_svr it was crashing as well with the same ***** Stack Check Fail! *****. I tried to narrow down the code responsible and I am suspecting zephyr/samples/subsys/mgmt/mcumgr/cboattr.

To duplicate the issue I added to zephyr/samples/subsys/mgmt/mcumgr/smp_svr/prj.con:

# Enable Stack protection
CONFIG_STACK_CANARIES=y
CONFIG_STACK_CHK_GUARD=0x30564402

The mcumgr command that seems to trigger the issue is image testfor me (whatever the image signature).

mcumgr --conntype ble --connstring ctlr_name=hci0,peer_name='Zephyr' image test 62dea2fa9c0813445db62d61eeaa2351f2766f53951758d16e0ca937ca0fc2d7

The stack corruption occurs in img_mgmt_state_write. But I tried to comment some code in cbor_internal_read_object and following the lines I commented, there was or there was not a stack corruption. It looks like the type CborAttrByteStringType might cause the issue (and may other variable size CBOR type, to be confirmed).

I noticed a recent raised issue related to mcumgr and memory corruption: #7722
But from the comments it was looking like a false positive.

Maybe this issue #7613 might also be due to the memory corruption. If it is really a zephyr/samples/subsys/mgmt/mcumgr/cboattr issue then all the MCUMgr commands will be affected.

@MaureenHelm MaureenHelm added bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug area: Device Management labels May 25, 2018
@MaureenHelm
Copy link
Member

@ccollins476ad

@carlescufi
Copy link
Member

@oliviermartin first thing I would try here is to increase your stack sizes. Try to replicate at the very least the sizes that you get when building the smp_svr sample. Search its generated .config file for STACK_SIZE

@nvlsianpu
Copy link
Collaborator

@oliviermartin ^^ Any update on that?

@oliviermartin
Copy link
Author

I am on holiday for the next week. But increasing the stack to hide the memory corruption seems to be a bad idea to me.

@nvlsianpu
Copy link
Collaborator

Are you sure whether it is memory corruption or app runs out of stack available?

@oliviermartin
Copy link
Author

I am almost sure it is memory corruption. I checked with gdb the state of the the worker thread and there are plenty of space. When adding state canaries, it is not the stack of the current function that is corrupted but other function stacks.

@nvlsianpu
Copy link
Collaborator

Unfortunately I couldn't reproduce exactly this behavior due problems with BLE interfaces on my desktop (I'm using VM with linux as quest, and for some reason VM stooped to connect to any of BLE interfaces It used to - so I was unable to resolve this malfunction during few hours). So I tried to reproduce this behavior via serial connection - but I didn't get exactly the same - I observed the timeout of image upload command, but not a landing in stack fault handler (or any else fault handler). After that the app was deaf for further commands. Will debug this further at the Friday.

@oliviermartin
Copy link
Author

Have you enabled stack canaries? I suspect the reason the device could not get process more command is because of the memory corruption. I was lucky in my case the cborattr function overwrote the stack canaries otherwise I would not have seen the issue was from the Zephyr''s code.

I have not acces to the code. If you could explain (or even better add comments in the code) how memory is allocated for CborAttrByteStringType in cbor_internal_read_object.

@nvlsianpu
Copy link
Collaborator

@oliviermartin - can you try to extract the problem? I have tried (and I will) - but due other duties I have only limited amount of time to act.

@oliviermartin
Copy link
Author

I potentially have one fix:

--- a/ext/lib/mgmt/mcumgr/cmd/img_mgmt/src/img_mgmt_state.c
+++ b/ext/lib/mgmt/mcumgr/cmd/img_mgmt/src/img_mgmt_state.c
@@ -251,7 +251,11 @@ img_mgmt_state_read(struct mgmt_ctxt *ctxt)
 int
 img_mgmt_state_write(struct mgmt_ctxt *ctxt)
 {
-    uint8_t hash[IMAGE_HASH_LEN];
+    /*
+     * We add 1 to the 32-byte hash buffer as _cbor_value_copy_string() adds
+     * a null character at the end of the buffer.
+     */
+    uint8_t hash[IMAGE_HASH_LEN + 1];
     size_t hash_len;
     bool confirm;
     int slot;

As the comment says _cbor_value_copy_string() adds a null-character to the byte string - see: https://github.com/zephyrproject-rtos/zephyr/blob/master/ext/lib/encoding/tinycbor/src/cborparser.c#L1307
Not adding this additional byte to the buffer obviously corrupts the memory. This changes fixes my issue.

In its current implementation, the null-character is added at the end of the buffer (and not after the end of the byte string). In our case, the image hash is fixed (32-byte long) but in case of variable length byte string we might have an issue.
I had a quick look to the code to make the change myself, but the code is still obscure to me.

I still have a timing issue (in my Zephyr application, I do not know if it exists samples/subsys/mgmt/mcumgr/smp_svr) during the image test process (the issue was hidden as I added a lot of pritnf log during my investigation). When I add k_sleep(500) after https://github.com/zephyrproject-rtos/zephyr/blob/master/ext/lib/mgmt/mcumgr/smp/src/smp.c#L186
then the update process works otherwise it crashes.

JoeHut pushed a commit to workaroundgmbh/zephyr_public that referenced this issue Jul 4, 2018
This update to the latest master of mcumgr fixes a memory corruption in
the image management and updates the readme.

Fixes zephyrproject-rtos#7924

Origin: mcumgr
License: Apache 2.0
URL: https://github.com/apache/mynewt-mcumgr
commit: a837a731b94927c6198e39744cd6d979be23942a
Purpose: Fix memory corruption
Maintained-by: External

Signed-off-by: Johannes Hutter <[email protected]>
@oliviermartin
Copy link
Author

@carlescufi As I mentionned earlier, my fix does not fix the issue. It does not still work. For some reason, I cannot re-open the issue. Should I create a new one?

@carlescufi carlescufi reopened this Jul 5, 2018
@carlescufi
Copy link
Member

@oliviermartin no need, reopened now

@nvlsianpu
Copy link
Collaborator

nvlsianpu commented Jul 19, 2018

@oliviermartin - can you recheck whether it is still visible after newest mcumgr fixes (#8937, #8711 - apache/mynewt-mcumgr#5 ) - so the master. I was unable to reproduce using this version.

@oliviermartin
Copy link
Author

@nvlsianpu I saw your patch, I was thinking to test it to see if it fixes my issue. I will try to test it in the next 10 days 👍 I will leave a message in this issue and hopefully close it!

@Olivier-ProGlove
Copy link
Contributor

@nvlsianpu At least with the latest fixes I do not see a crash anymore.

I have a strange issue but it has nothing to do with this specific github ticket. I will investigate it later. This github ticket can be closed (for some reason I cannot close it myself).

FYI, here is my issue:

  • I copied the same image through mcumgr
  • I want to test the verification
  • I tried with a wrong hash and it does not work - as expected (I am assuming the error 3 is what it means)
  • I tried with the correct hash and it does not still work (Error 1)- not expected.
sudo ~/go/bin/mcumgr --conntype ble --connstring 'peer_name=Zephyr' image list
Images:
 slot=0
    version: 0.0.0
    bootable: true
    flags: active confirmed
    hash: 6abcbbec6486a4237a964ec12cc6153be28fb517a85f9fe7a103f74b49755acb
 slot=1
    version: 0.0.0
    bootable: true
    flags: 
    hash: 6abcbbec6486a4237a964ec12cc6153be28fb517a85f9fe7a103f74b49755acb
Split status: N/A (0)
sudo ~/go/bin/mcumgr --conntype ble --connstring ctlr_name=hci0,peer_name='Zephyr' image test ea365efc2f89674a7ff319f13d1479771b523a80c569003da5b4839c1f4ef051
Error: 3
sudo ~/go/bin/mcumgr --conntype ble --connstring ctlr_name=hci0,peer_name='Zephyr' image test 6abcbbec6486a4237a964ec12cc6153be28fb517a85f9fe7a103f74b49755acb
Error: 1

@nvlsianpu
Copy link
Collaborator

nvlsianpu commented Jul 30, 2018

what you had observed is the expected behavior, see the very last lines from doc:
http://docs.zephyrproject.org/samples/subsys/mgmt/mcumgr/smp_svr/README.html#smp-svr-sample

@nvlsianpu
Copy link
Collaborator

@oliviermartin ^^

@Olivier-ProGlove
Copy link
Contributor

I reported an issue to mcumgr project to make error message clearer (ie: plain text message rather than obscure non documented error code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Device Management bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

5 participants