mcu_mgmt: Memory corruption (cborattr suspected) - test case with smp_svr #7924

oliviermartin · 2018-05-25T14:01:04Z

I was trying to enable FOTA on my Nordic nRF52832 based device. I initially tested it with zephyr/samples/subsys/mgmt/mcumgr/smp_svr and it was working fine. But when trying to integrate the mcumgr module into my Zephyr application, the update was crashing with ***** Stack Check Fail! *****.
I took me a while to understand why it works with smp_svr. But after enabling CONFIG_STACK_CANARIES in smp_svr it was crashing as well with the same ***** Stack Check Fail! *****. I tried to narrow down the code responsible and I am suspecting zephyr/samples/subsys/mgmt/mcumgr/cboattr.

To duplicate the issue I added to zephyr/samples/subsys/mgmt/mcumgr/smp_svr/prj.con:

# Enable Stack protection
CONFIG_STACK_CANARIES=y
CONFIG_STACK_CHK_GUARD=0x30564402

The mcumgr command that seems to trigger the issue is image testfor me (whatever the image signature).

mcumgr --conntype ble --connstring ctlr_name=hci0,peer_name='Zephyr' image test 62dea2fa9c0813445db62d61eeaa2351f2766f53951758d16e0ca937ca0fc2d7

The stack corruption occurs in img_mgmt_state_write. But I tried to comment some code in cbor_internal_read_object and following the lines I commented, there was or there was not a stack corruption. It looks like the type CborAttrByteStringType might cause the issue (and may other variable size CBOR type, to be confirmed).

I noticed a recent raised issue related to mcumgr and memory corruption: #7722
But from the comments it was looking like a false positive.

Maybe this issue #7613 might also be due to the memory corruption. If it is really a zephyr/samples/subsys/mgmt/mcumgr/cboattr issue then all the MCUMgr commands will be affected.

The text was updated successfully, but these errors were encountered:

MaureenHelm · 2018-05-25T14:51:46Z

@ccollins476ad

carlescufi · 2018-05-28T09:51:51Z

@oliviermartin first thing I would try here is to increase your stack sizes. Try to replicate at the very least the sizes that you get when building the smp_svr sample. Search its generated .config file for STACK_SIZE

nvlsianpu · 2018-05-30T06:53:31Z

@oliviermartin ^^ Any update on that?

oliviermartin · 2018-05-30T07:31:35Z

I am on holiday for the next week. But increasing the stack to hide the memory corruption seems to be a bad idea to me.

nvlsianpu · 2018-05-30T07:35:53Z

Are you sure whether it is memory corruption or app runs out of stack available?

oliviermartin · 2018-05-30T07:40:11Z

I am almost sure it is memory corruption. I checked with gdb the state of the the worker thread and there are plenty of space. When adding state canaries, it is not the stack of the current function that is corrupted but other function stacks.

nvlsianpu · 2018-05-30T16:58:15Z

Unfortunately I couldn't reproduce exactly this behavior due problems with BLE interfaces on my desktop (I'm using VM with linux as quest, and for some reason VM stooped to connect to any of BLE interfaces It used to - so I was unable to resolve this malfunction during few hours). So I tried to reproduce this behavior via serial connection - but I didn't get exactly the same - I observed the timeout of image upload command, but not a landing in stack fault handler (or any else fault handler). After that the app was deaf for further commands. Will debug this further at the Friday.

oliviermartin · 2018-05-30T19:29:52Z

Have you enabled stack canaries? I suspect the reason the device could not get process more command is because of the memory corruption. I was lucky in my case the cborattr function overwrote the stack canaries otherwise I would not have seen the issue was from the Zephyr''s code.

I have not acces to the code. If you could explain (or even better add comments in the code) how memory is allocated for CborAttrByteStringType in cbor_internal_read_object.

nvlsianpu · 2018-06-07T14:38:03Z

@oliviermartin - can you try to extract the problem? I have tried (and I will) - but due other duties I have only limited amount of time to act.

oliviermartin · 2018-06-12T14:26:41Z

I potentially have one fix:

--- a/ext/lib/mgmt/mcumgr/cmd/img_mgmt/src/img_mgmt_state.c
+++ b/ext/lib/mgmt/mcumgr/cmd/img_mgmt/src/img_mgmt_state.c
@@ -251,7 +251,11 @@ img_mgmt_state_read(struct mgmt_ctxt *ctxt)
 int
 img_mgmt_state_write(struct mgmt_ctxt *ctxt)
 {
-    uint8_t hash[IMAGE_HASH_LEN];
+    /*
+     * We add 1 to the 32-byte hash buffer as _cbor_value_copy_string() adds
+     * a null character at the end of the buffer.
+     */
+    uint8_t hash[IMAGE_HASH_LEN + 1];
     size_t hash_len;
     bool confirm;
     int slot;

As the comment says _cbor_value_copy_string() adds a null-character to the byte string - see: https://github.com/zephyrproject-rtos/zephyr/blob/master/ext/lib/encoding/tinycbor/src/cborparser.c#L1307
Not adding this additional byte to the buffer obviously corrupts the memory. This changes fixes my issue.

In its current implementation, the null-character is added at the end of the buffer (and not after the end of the byte string). In our case, the image hash is fixed (32-byte long) but in case of variable length byte string we might have an issue.
I had a quick look to the code to make the change myself, but the code is still obscure to me.

I still have a timing issue (in my Zephyr application, I do not know if it exists samples/subsys/mgmt/mcumgr/smp_svr) during the image test process (the issue was hidden as I added a lot of pritnf log during my investigation). When I add k_sleep(500) after https://github.com/zephyrproject-rtos/zephyr/blob/master/ext/lib/mgmt/mcumgr/smp/src/smp.c#L186
then the update process works otherwise it crashes.

This update to the latest master of mcumgr fixes a memory corruption in the image management and updates the readme. Fixes zephyrproject-rtos#7924 Origin: mcumgr License: Apache 2.0 URL: https://github.com/apache/mynewt-mcumgr commit: a837a731b94927c6198e39744cd6d979be23942a Purpose: Fix memory corruption Maintained-by: External Signed-off-by: Johannes Hutter <[email protected]>

oliviermartin · 2018-07-05T09:19:02Z

@carlescufi As I mentionned earlier, my fix does not fix the issue. It does not still work. For some reason, I cannot re-open the issue. Should I create a new one?

carlescufi · 2018-07-05T09:20:16Z

@oliviermartin no need, reopened now

nvlsianpu · 2018-07-19T15:23:18Z

@oliviermartin - can you recheck whether it is still visible after newest mcumgr fixes (#8937, #8711 - apache/mynewt-mcumgr#5 ) - so the master. I was unable to reproduce using this version.

oliviermartin · 2018-07-19T19:53:31Z

@nvlsianpu I saw your patch, I was thinking to test it to see if it fixes my issue. I will try to test it in the next 10 days 👍 I will leave a message in this issue and hopefully close it!

Olivier-ProGlove · 2018-07-24T16:59:09Z

@nvlsianpu At least with the latest fixes I do not see a crash anymore.

I have a strange issue but it has nothing to do with this specific github ticket. I will investigate it later. This github ticket can be closed (for some reason I cannot close it myself).

FYI, here is my issue:

I copied the same image through mcumgr
I want to test the verification
I tried with a wrong hash and it does not work - as expected (I am assuming the error 3 is what it means)
I tried with the correct hash and it does not still work (Error 1)- not expected.

sudo ~/go/bin/mcumgr --conntype ble --connstring 'peer_name=Zephyr' image list
Images:
 slot=0
    version: 0.0.0
    bootable: true
    flags: active confirmed
    hash: 6abcbbec6486a4237a964ec12cc6153be28fb517a85f9fe7a103f74b49755acb
 slot=1
    version: 0.0.0
    bootable: true
    flags: 
    hash: 6abcbbec6486a4237a964ec12cc6153be28fb517a85f9fe7a103f74b49755acb
Split status: N/A (0)
sudo ~/go/bin/mcumgr --conntype ble --connstring ctlr_name=hci0,peer_name='Zephyr' image test ea365efc2f89674a7ff319f13d1479771b523a80c569003da5b4839c1f4ef051
Error: 3
sudo ~/go/bin/mcumgr --conntype ble --connstring ctlr_name=hci0,peer_name='Zephyr' image test 6abcbbec6486a4237a964ec12cc6153be28fb517a85f9fe7a103f74b49755acb
Error: 1

nvlsianpu · 2018-07-30T14:09:25Z

what you had observed is the expected behavior, see the very last lines from doc:
http://docs.zephyrproject.org/samples/subsys/mgmt/mcumgr/smp_svr/README.html#smp-svr-sample

nvlsianpu · 2018-07-30T14:11:26Z

@oliviermartin ^^

Olivier-ProGlove · 2018-08-16T16:18:59Z

I reported an issue to mcumgr project to make error message clearer (ie: plain text message rather than obscure non documented error code).

MaureenHelm added bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug area: Device Management labels May 25, 2018

MaureenHelm assigned carlescufi May 25, 2018

carlescufi assigned nvlsianpu May 25, 2018

carlescufi closed this as completed in 414291c Jul 4, 2018

carlescufi reopened this Jul 5, 2018

nvlsianpu closed this as completed Jul 30, 2018

Olivier-ProGlove mentioned this issue Aug 16, 2018

Print plain text error message rather than error code apache/mynewt-mcumgr-cli#4

Open

Olivier-ProGlove mentioned this issue Aug 17, 2018

Print plain text error message rather than error code apache/mynewt-mcumgr#10

Open

jimparis mentioned this issue Oct 4, 2019

tinycbor buffer overflow causing mcumgr image upload failure #19629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcu_mgmt: Memory corruption (cborattr suspected) - test case with smp_svr #7924

mcu_mgmt: Memory corruption (cborattr suspected) - test case with smp_svr #7924

oliviermartin commented May 25, 2018

MaureenHelm commented May 25, 2018

carlescufi commented May 28, 2018

nvlsianpu commented May 30, 2018

oliviermartin commented May 30, 2018

nvlsianpu commented May 30, 2018

oliviermartin commented May 30, 2018

nvlsianpu commented May 30, 2018

oliviermartin commented May 30, 2018

nvlsianpu commented Jun 7, 2018

oliviermartin commented Jun 12, 2018

oliviermartin commented Jul 5, 2018

carlescufi commented Jul 5, 2018

nvlsianpu commented Jul 19, 2018 •

edited

Loading

oliviermartin commented Jul 19, 2018

Olivier-ProGlove commented Jul 24, 2018

nvlsianpu commented Jul 30, 2018 •

edited

Loading

nvlsianpu commented Jul 30, 2018

Olivier-ProGlove commented Aug 16, 2018

mcu_mgmt: Memory corruption (cborattr suspected) - test case with smp_svr #7924

mcu_mgmt: Memory corruption (cborattr suspected) - test case with smp_svr #7924

Comments

oliviermartin commented May 25, 2018

MaureenHelm commented May 25, 2018

carlescufi commented May 28, 2018

nvlsianpu commented May 30, 2018

oliviermartin commented May 30, 2018

nvlsianpu commented May 30, 2018

oliviermartin commented May 30, 2018

nvlsianpu commented May 30, 2018

oliviermartin commented May 30, 2018

nvlsianpu commented Jun 7, 2018

oliviermartin commented Jun 12, 2018

oliviermartin commented Jul 5, 2018

carlescufi commented Jul 5, 2018

nvlsianpu commented Jul 19, 2018 • edited Loading

oliviermartin commented Jul 19, 2018

Olivier-ProGlove commented Jul 24, 2018

nvlsianpu commented Jul 30, 2018 • edited Loading

nvlsianpu commented Jul 30, 2018

Olivier-ProGlove commented Aug 16, 2018

nvlsianpu commented Jul 19, 2018 •

edited

Loading

nvlsianpu commented Jul 30, 2018 •

edited

Loading