This section shows the basic and necessary info for development with NNoM. Please do read carefuly before you start the development.
Why this is important:
Firstly, the aim of NNoM is narrow, only focusing the low-resources platform. Therefore, the ops it supports will always be a subset of comprehensive tools (e.g. Tensorflow, Pytorch). Which means you must train your model in a restricted configurations/forms to be able to use NNoM to run.
Secondly, NNoM is different from other inteference lib which only act as an parser of existing protobuf or model file. NNoM instead has its own representative of a model, which doesn't need runtime memory allocation or runtime parsing, brings maximum performance. (less than 1us (avg) switch time between layers' backend ops)
It is always recommended that the developer who trains keras model is the same who deploy the model to NNoM, to manipulate between the configuration and pefromance.
As said, NNoM only support a subset layers of the Keras, please refer to the list in the main page for details.
No support for an activation embedded with layer's configuration. i.e. x = Conv2D(..., activation='relu')(x)
.
The below form are supported.
x = Conv2D(..., activation='relu')(x)
x = ReLU()(x)
Do not change the name of a layer. The current script is still using name to recognise a layer's type. (Using type() is not stable in Keras.) Changing names will lead to unable to recognise layer.
Agains, for all layer, do not change the default name or if you know how to keep the keyword.
Including Conv1D
, Conv2D
, DepthwiseConv1D
, DepthwiseConv2D
Supported config: filters
, kernel_size
,dilation_rate
,stride_size
, padding
, depth_multiplier
.
Unrelated configs are not affecting: bias/weight's constrains/regularlizer/inits.
Restriction: use_bias
must be True
Examples that works:
x = Conv2D(16, dilation_rate=(1,1), kernel_size=(5, 5), strides=(1, 1), padding="valid")(x)
x = DepthwiseConv2D(depth_multiplier=2, dilation_rate=(2,2), kernel_size=(3, 3), padding="same")(x)
No support for generic Seperatable Conv. Please should you use Depthwise Conv
following by an regular Conv
(Pointwise Conv) instead.
Example that works:
x = DepthwiseConv2D(depth_multiplier=2, kernel_size=(3, 3), padding="same")(x)
x = Conv2D(16, kernel_size=(1, 1), strides=(1, 1), padding="same")(x)
NNoM is a higher-level inference framework. The most obvious feature is the human understandable interface.
-
It is also a layer-based framework, instead of operator-based. A layer might contain a few operators.
-
It natively supports complex model structure. High-efficiency network always benefited from complex structure.
-
It provides layer-to-layer analysis to help developer optimize their models.
The famous pre-trained models are more for the image processing side. They are efficient on such mobile phones. But they are still too bulky if the MCU doesn't provide at least 250K RAM and a hardware Neural Network Accelerator.
MobileNet V1 model with depth multi-plier (0.25x) ... STM32 F746 ... CMSIS-NN kernels to program the depthwise and pointwise convolutions ... approximately 0.75 frames/sec
Source: Visual Wake Words Dataset
In most cases, MCUs should not really do image processing without hardware accelerator. The data they normally process a few channels of time sequence measurement. For example, the accelerometer data consist of 3-axis (channel) measurement per timestamp.
Dealing with these data, building the ad-hoc models for each application is the only option.
Building an ad-hoc model is sooo easy with NNoM since most of the codes are automatically generated.
It provides an easy to use and easy to evaluate inference tools for fast neural network development.
As embedded engineers, we might not know well how does neural network work and how can we optimize it for the MCU. NNoM together with Keras can help you to start practising within half an hour. There is no need to learn other ML libs from scratch. Deployment can be done with one line of python code after you have trained a model using Keras.
Other than building a model, NNoM also provides a set of evaluation methods. These evaluation methods will give the developer a layer-to-layer performance evaluation of the model.
Developers can then modify the ad-hoc model to increase efficiency or to lower the memory cost. (Please check the following Performance sections for detail.)
As mentioned in many other docs, NNoM uses a layer-based structure. The most benefit is the model structure can seem directly from the codes.
It also makes the model conversion from other layer-based libs (Keras, TensorLayer, Caffe) to NNoM model very straight forward. When use generate_model(model, x_test, name='weights.h')
to generate NNoM model, it simply read the configuration out and rewrite it to C codes.
NNoM uses a compiler to manage the layer structure and other resources. After compiling, all layers inside the model will be put into a shortcut list per the running order. Besides that, arguments will be filled in and the memory will be allocated to each layer (Memory are reused in between layers). Therefore, no memory allocation performed in the runtime, performance is the same as running backend function directly.
The NNoM is more on managing the higher-level structure, context argument and memory. The actual arithmetics are done by the backend functions.
Currently, NNoM supports a pure C backend and CMSIS-NN backend. The CMSIS-NN is a highly optimized low-level NN core for ARM-Cortex-M microcontroller. Please check the optimization guide for utilisation.
NNoM currently only support 8bit weights and 8bit activations. The model will be quantised through model conversion generate_model(model, x_test, name='weights.h')
.
The input data (activations) will need to be quantised then feed to the model.
Please see any example for quantising the input data.
The CMSIS-NN can provide up to 5 times performance compared to the pure C backend on Cortex-M MCUs. It maximises the performance by using SIMD and other instructions(__SSAT, ...).
These optimizations come with different constraints. This is why CMSIS-NN provides many variances to one operator (such as 1x1 convolution, RGB convolution, none-square/square, they are all convolution only with different routines).
NNoM will automatically select the best operator for the layer when it is available. Sometimes, it is not possible to use CMSIS-NN because the condition is not met. CMSIS-NN provides a subset operator to the local pure C backend. When it is not possible to use CMSIS-NN, NNoM will run the layer using the C backend end instead. It varies from layer to layer whether to use CMSIS-NN or C backend.
The example condition for convolutions are list below:
Operation | Input Channel | Output Channel |
---|---|---|
Convolution | multiple of 4 | multiple of 2 |
Pointwise Convolution | multiple of 4 | multiple of 2 |
Depthwise Convolution | multiple of 2 | multiple of 2 |
The full details can be found in CMSIS-NN's source code and documentation. Some of them can be further optimized by square shape, however, the optimization is less significant.
Trick, if you keep the channel size is a multiple of 4, it should work in most of the case.
If you are not sure whether the optimization is working, simply use the model_stat()
in Evaluation API to print the performance of each layer. The comparison will be shown in the following sections.
Fully connected layers and pooling layers are less constrained.
Performances vary from chip to chip. Efficiencies are more constant.
We can use Multiply–accumulate operation (MAC) per Hz (MACops/Hz) to evaluate the efficiency. It simply means how many MAC can be done in one cycle.
Currently, NNoM only count MAC operations on Convolution layers and Dense layers since other layers (pooling, padding) are much lesser.
Running a model on CMSIS-NN and NNoM will have the same performance when a model is fully compliant with CMSIS-NN and running on Cortex-M4/7/33/35P. ("compliant" means it meets the optimization condition in the above discussion).
For example, in CMSIS-NN paper, the authors used an STM32F746@216MHz to run a model with 24.7M(MACops)
took 99.1ms
in total.
The runtime of each layer was recorded. What hasn't been shown in the paper is this table. (refer to Table 1 in the paper)
Layer | Input ch | output ch | Ops | Runtime | Efficiency (MACops/Hz) | |
---|---|---|---|---|---|---|
Layer 1 | Conv | 3 | 32 | 4.9M | 31.4ms | 0.36 |
Layer 3 | Conv | 32 | 32 | 13.1M | 42.8ms | 0.71 |
Layer 5 | Conv | 32 | 64 | 6.6M | 22.6ms | 0.68 |
Layer 7 | Dense | 1024 | 10 | 20k | 0.1ms | 0.93 |
Total | 24.7M | 99.1ms | 0.58 |
ops = 2 x MACops, total is less due to other layers such as activation and pooling, please check the paper for full table
In the table, layer 3 and 5 are both Convolution layer with input and output channel size equal to a multiple of 4. Layer 1 with input channel = 3.
You can already see the efficiency difference. When input channel = 3, the convolution is performed by arm_convolve_HWC_q7_RGB()
. This method is partially optimized since the input channel is not a multiple of 4, While Layer 3 and layer 5 are fully optimized. The efficiency difference is already huge (0.36
vs 0.71/0.68
).
To achieve high efficiency, you should keep both input channel is a multiple of 4 and output is a multiple of 2.
What does this number mean? You can use this number to estimate the best size of the model to fit the targeting MCU.
In typical applications:
Use motion sensor to recognise human activity. A model takes 9
channels time sequence data, 0.67M MACops
, STM32F746 will take around 0.67M/0.58/216MHz = 5.3ms
to do one inference.
Use microphone to spot key-word commands. A model takes 63 x 12 x 1
MFCC data, 2.09M MACops
, STM32F746 will take around 2.09M/0.58/216MHz = 16.7ms
to do one inference.
Notes, MACops/Hz in NNoM is less than the CMSIS-NN in the paper, this is because NNoM considers the operator and its following activation as one single layer. For example, the running time cost by the convolution layer is the time cost by
operator(Conv)
+ the time cost byactivation(ReLU)
.
Evaluation is equally important to building the model.
In NNoM, we provide a few different methods to evaluate the model. The details are list in Evaluation Methods. If your system support print through a console (such as serial port), the evaluation can be printed on the console.
Firstly, the model structure is printed during compiling in model_compile()
, which is normally called in nnom_model_create()
.
Secondly, the runtime performance is printed by model_stat()
.
Thirdly, there is a set of prediction_*()
APIs to validate a set of testing data and print out Top-K accuracy, confusion matrix and other info.
This is what a typical model looks like in the weights.h
or model.h
or whatever you name it. These codes are generated by the script.
In user's main()
, call nnom_model_create()
will create and compile the model.
...
/* nnom model */
static int8_t nnom_input_data[784];
static int8_t nnom_output_data[10];
static nnom_model_t* nnom_model_create(void)
{
static nnom_model_t model;
nnom_layer_t* layer[20];
new_model(&model);
layer[0] = Input(shape(28, 28, 1), nnom_input_data);
layer[1] = model.hook(Conv2D(12, kernel(3, 3), stride(1, 1), PADDING_SAME, &conv2d_1_w, &conv2d_1_b), layer[0]);
layer[2] = model.active(act_relu(), layer[1]);
layer[3] = model.hook(MaxPool(kernel(2, 2), stride(2, 2), PADDING_SAME), layer[2]);
layer[4] = model.hook(Cropping(border(1,2,3,4)), layer[3]);
layer[5] = model.hook(Conv2D(24, kernel(3, 3), stride(1, 1), PADDING_SAME, &conv2d_2_w, &conv2d_2_b), layer[4]);
layer[6] = model.active(act_relu(), layer[5]);
layer[7] = model.hook(MaxPool(kernel(4, 4), stride(4, 4), PADDING_SAME), layer[6]);
layer[8] = model.hook(ZeroPadding(border(1,2,3,4)), layer[7]);
layer[9] = model.hook(Conv2D(24, kernel(3, 3), stride(1, 1), PADDING_SAME, &conv2d_3_w, &conv2d_3_b), layer[8]);
layer[10] = model.active(act_relu(), layer[9]);
layer[11] = model.hook(UpSample(kernel(2, 2)), layer[10]);
layer[12] = model.hook(Conv2D(48, kernel(3, 3), stride(1, 1), PADDING_SAME, &conv2d_4_w, &conv2d_4_b), layer[11]);
layer[13] = model.active(act_relu(), layer[12]);
layer[14] = model.hook(MaxPool(kernel(2, 2), stride(2, 2), PADDING_SAME), layer[13]);
layer[15] = model.hook(Dense(64, &dense_1_w, &dense_1_b), layer[14]);
layer[16] = model.active(act_relu(), layer[15]);
layer[17] = model.hook(Dense(10, &dense_2_w, &dense_2_b), layer[16]);
layer[18] = model.hook(Softmax(), layer[17]);
layer[19] = model.hook(Output(shape(10,1,1), nnom_output_data), layer[18]);
model_compile(&model, layer[0], layer[19]);
return &model;
}
This is an example printed by model_compile()
, which is normally called by nnom_model_create()
.
Start compiling model...
Layer(#) Activation output shape ops(MAC) mem(in, out, buf) mem blk lifetime
-------------------------------------------------------------------------------------------------
#1 Input - - ( 28, 28, 1) ( 784, 784, 0) 1 - - - - - - -
#2 Conv2D - ReLU - ( 28, 28, 12) 84k ( 784, 9408, 36) 1 1 1 - - - - -
#3 MaxPool - - ( 14, 14, 12) ( 9408, 2352, 0) 1 1 1 - - - - -
#4 Cropping - - ( 11, 7, 12) ( 2352, 924, 0) 1 1 - - - - - -
#5 Conv2D - ReLU - ( 11, 7, 24) 199k ( 924, 1848, 432) 1 1 1 - - - - -
#6 MaxPool - - ( 3, 2, 24) ( 1848, 144, 0) 1 1 1 - - - - -
#7 ZeroPad - - ( 6, 9, 24) ( 144, 1296, 0) 1 1 - - - - - -
#8 Conv2D - ReLU - ( 6, 9, 24) 279k ( 1296, 1296, 864) 1 1 1 - - - - -
#9 UpSample - - ( 12, 18, 24) ( 1296, 5184, 0) 1 - 1 - - - - -
#10 Conv2D - ReLU - ( 12, 18, 48) 2.23M ( 5184, 10368, 864) 1 1 1 - - - - -
#11 MaxPool - - ( 6, 9, 48) ( 10368, 2592, 0) 1 1 1 - - - - -
#12 Dense - ReLU - ( 64, 1, 1) 165k ( 2592, 64, 5184) 1 1 1 - - - - -
#13 Dense - - ( 10, 1, 1) 640 ( 64, 10, 128) 1 1 1 - - - - -
#14 Softmax - - ( 10, 1, 1) ( 10, 10, 0) 1 1 - - - - - -
#15 Output - - ( 10, 1, 1) ( 10, 10, 0) 1 - - - - - - -
-------------------------------------------------------------------------------------------------
Memory cost by each block:
blk_0:5184 blk_1:2592 blk_2:10368 blk_3:0 blk_4:0 blk_5:0 blk_6:0 blk_7:0
Total memory cost by network buffers: 18144 bytes
Compling done in 179 ms
It shows the run order, Layer names, activations, the output shape of the layer, the operation counts, the buffer size, and the memory block assignments.
Later, it prints the maximum memory cost for each memory block. Since the memory block is shared between layers, the model only
e 3 memory blocks, altogether gives a sum memory cost by 18144 Bytes
.
This is an example printed by model_stat()
.
This method requires a microsecond timestamp porting, check porting guide
Print running stat..
Layer(#) - Time(us) ops(MACs) ops/us
--------------------------------------------------------
#1 Input - 11
#2 Conv2D - 5848 84k 14.47
#3 MaxPool - 698
#4 Cropping - 16
#5 Conv2D - 3367 199k 59.27
#6 MaxPool - 346
#7 ZeroPad - 36
#8 Conv2D - 4400 279k 63.62
#9 UpSample - 116
#10 Conv2D - 33563 2.23M 66.72
#11 MaxPool - 2137
#12 Dense - 2881 165k 57.58
#13 Dense - 16 640 40.00
#14 Softmax - 3
#15 Output - 1
Summary:
Total ops (MAC): 2970208(2.97M)
Prediction time :53439us
Efficiency 55.58 ops/us
NNOM: Total Mem: 20236
Calling this method will print out the time cost for each layer, and the efficiency in (MACops/us) of this layer.
This is very important when designing your ad-hoc model.
For example, #2 layer has only 14.47 MACops/us
, while #5, #8 and #10 are around 60 MACops/us
. This is due to the input channel of #2 layer is 1, which cannot fulfil the optimisation conditions of CMSIS-NN. One simple optimization strategy is to minimize the complexity in #2 layer by reducing the output channel size.
As mention, NNoM will allocate memory to the layer during the compiling phase. Memory block is a minimum unit for a layer to apply. For example, convolution layers normally apply one block for input data, one block for output data and one block for the intermediate data buffer.
Layer(#) Activation output shape ops(MAC) mem(in, out, buf) mem blk lifetime
-------------------------------------------------------------------------------------------------
#2 Conv2D - ReLU - ( 28, 28, 12) 84k ( 784, 9408, 36) 1 1 1 - - - - -
The example shows input buffer size 784
, output buffer size 9408
, intermediate buffer size 36
. The following mem blk lifetime
means how long does the memory block last. All three block last only one step, they will be freed after the layer. In NNoM, the output memory will be pass directly to the next layer(s) as input buffer, so there is no memory copy cost and memory allocation in between layers.