Grovety

Arm-optimized NNs

for Arm Virtual Hardware Corstone-300

EfficientPose I lite

Scalable single-person pose estimation
Model Resolution Parameters FLOPs PCKh@50
(MPII val)
PCKh@10
(MPII val)
PCKh@50
(MPII test)
PCKh@10
(MPII test)
EfficientPose RT Lite* 224x224 0.40M 0.86G 80.6 23.1 - -
EfficientPose RT 224x224 0.46M 0.87G 80.6 23.6 84.8 24.2
EfficientPose I Lite* 256x256 0.59M 1.54G 83.7 27.7 - -
EfficientPose I 256x256 0.72M 1.67G 85.2 26.5 - -
EfficientPose II Lite* 368x368 1.46M 7.25G 87.1 30.8 - -
EfficientPose II 368x368 1.73M 7.70G 88.2 30.2 - -
EfficientPose III 480x480 3.23M 23.35G 89.5 30.9 - -
EfficientPose IV 600x600 6.56M 72.89G 89.8 35.6 91.2 36.0
OpenPose (Cao et al.) 368x368 25.94M 160.36G 87.6 22.8 88.8 22.5

Results

ARM Model Zoo Networks Performance Test

Source of models:

All the models are optimized to use on Arm NPUs

Benchmark tools:

Performance measurement parameters:

Anomaly Detection

Network Type Score (AUC) SRAM, MB Flash, MB Inference time, ms
MicroNet Large INT8 INT8 0.968 0.4 0.46 4.8
MicroNet Medium INT8 INT8 0.963 0.27 0.47 4.83
MicroNet Small INT8 INT8 0.955 0.12 0.25 2.51
Dataset:
Dcase 2020 Task 2 Slide Rail

Anomaly Detection

Network Type Score (Top 1 Accuracy) SRAM, MB Flash, MB Inference time, ms
MobileNet v2 1.0 224 INT8 * INT8 0.697 1.47 3.57 43.13
MobileNet v2 1.0 224 UINT8 UINT8 0.708 1.47 3.27 40.19
Dataset:
ILSVRC 2012
Benchmark reports were generated by ARM vela compiler configured for ALIF Ensemble E7 HP core
Network summary for mobilenet_v2_1.0_224_INT8
Accelerator configuration               Ethos_U55_256
System configuration                Ethos_U55_Alif_HP
Memory mode                               Shared_Sram
Accelerator clock                                 400 MHz
Design peak SRAM bandwidth                       1.60 GB/s
Design peak Off-chip Flash bandwidth             0.10 GB/s
 
Total SRAM used                               1474.22 KiB
Total Off-chip Flash used                     3576.78 KiB
 
CPU operators = 0 (0.0%)
NPU operators = 95 (100.0%)
 
Average SRAM bandwidth                           0.60 GB/s
Input   SRAM bandwidth                          11.75 MB/batch
Weight  SRAM bandwidth                           6.95 MB/batch
Output  SRAM bandwidth                           6.99 MB/batch
Total   SRAM bandwidth                          25.86 MB/batch
Total   SRAM bandwidth            per input     25.86 MB/inference (batch size 1)
 
Average Off-chip Flash bandwidth                 0.08 GB/s
Input   Off-chip Flash bandwidth                 0.00 MB/batch
Weight  Off-chip Flash bandwidth                 3.46 MB/batch
Output  Off-chip Flash bandwidth                 0.00 MB/batch
Total   Off-chip Flash bandwidth                 3.47 MB/batch
Total   Off-chip Flash bandwidth  per input      3.47 MB/inference (batch size 1)
 
Neural network macs                         304452946 MACs/batch
Network Tops/s                                   0.01 Tops/s
 
NPU cycles                                   10635874 cycles/batch
SRAM Access cycles                            5024963 cycles/batch
DRAM Access cycles                                  0 cycles/batch
On-chip Flash Access cycles                         0 cycles/batch
Off-chip Flash Access cycles                  4959164 cycles/batch
Total cycles                                 17252122 cycles/batch
 
Batch Inference time                43.13 ms,   23.19 inferences/s (batch size 1)
Network summary for mobilenet_v2_1.0_224_quantized_1_default_1
Accelerator configuration               Ethos_U55_256
System configuration                Ethos_U55_Alif_HP
Memory mode                               Shared_Sram
Accelerator clock                                 400 MHz
Design peak SRAM bandwidth                       1.60 GB/s
Design peak Off-chip Flash bandwidth             0.10 GB/s
 
Total SRAM used                               1474.03 KiB
Total Off-chip Flash used                     3279.23 KiB
 
CPU operators = 0 (0.0%)
NPU operators = 64 (100.0%)
 
Average SRAM bandwidth                           0.62 GB/s
Input   SRAM bandwidth                          11.73 MB/batch
Weight  SRAM bandwidth                           6.06 MB/batch
Output  SRAM bandwidth                           6.97 MB/batch
Total   SRAM bandwidth                          24.94 MB/batch
Total   SRAM bandwidth            per input     24.94 MB/inference (batch size 1)
 
Average Off-chip Flash bandwidth                 0.08 GB/s
Input   Off-chip Flash bandwidth                 0.00 MB/batch
Weight  Off-chip Flash bandwidth                 3.16 MB/batch
Output  Off-chip Flash bandwidth                 0.00 MB/batch
Total   Off-chip Flash bandwidth                 3.17 MB/batch
Total   Off-chip Flash bandwidth  per input      3.17 MB/inference (batch size 1)
 
Neural network macs                         304450944 MACs/batch
Network Tops/s                                   0.02 Tops/s
 
NPU cycles                                    9635618 cycles/batch
SRAM Access cycles                            5013209 cycles/batch
DRAM Access cycles                                  0 cycles/batch
On-chip Flash Access cycles                         0 cycles/batch
Off-chip Flash Access cycles                  4773911 cycles/batch
Total cycles                                 16074423 cycles/batch
 
Batch Inference time                40.19 ms,   24.88 inferences/s (batch size 1)

Keyword Spotting

Network Type Score (Accuracy) SRAM, MB Flash, MB Inference time, ms
CNN Large INT8 * INT8 0.931 0.17 0.45 4.64
CNN Medium INT8 * INT8 0.911 0.16 0.16 1.69
CNN Small INT8 * INT8 0.912 0.05 0.07 0.76
DNN Large INT8 * INT8 0.863 0.0009 0.46 5.34
DNN Medium INT8 * INT8 0.844 0.0005 0.19 1.89
DNN Small INT8 * INT8 0.825 0.0003 0.09 1.44
DS-CNN Clustered INT8 * INT8 0.940 0.11 0.45 4.37
DS-CNN Large INT8 * INT8 0.946 0.12 0.52 5.05
DS-CNN Medium INT8 * INT8 0.946 0.12 0.52 5.05
DS-CNN Small INT8 * INT8 0.935 0.02 0.04 0.36
MicroNet Large INT8 INT8 0.965 0.21 0.55 5.94
MicroNet Medium INT8 INT8 0.958 0.1 0.15 1.68
MicroNet Small INT8 INT8 0.953 0.07 0.09 1.01
Dataset:
Google Speech Commands Test Set

Noise Suppression

Network Type Score (Accuracy Pesq) SRAM, MB Flash, MB Inference time, ms
RNNoise INT8 * INT8 2.945 0.0009 0.12 1.45
Dataset:
Noisy Speech Database For Training Speech Enhancement Algorithms And Tts Models

Speech Recognition

Network Type Score (LER) SRAM, MB Flash, MB Inference time, ms
Wav2letter INT8 INT8 0.0877 1.77 21.3 217.95
Wav2letter Pruned INT8 * INT8 0.0783 1.25 13.56 139.7
Tiny Wav2letter INT8 * INT8 0.0348 0.71 3.67 37.46
Tiny Wav2letter Pruned INT8 * INT8 0.0283 0.5 2.34 23.75
Dataset:
LibriSpeech, Fluent Speech

Visual Wake Words

Network Type Score (Accuracy) SRAM, MB Flash, MB Inference time, ms
MicroNet VWW-2 INT8 INT8 0.768 0.03 0.18 1.51
MicroNet VWW-3 INT8 INT8 0.855 0.13 0.42 4.65
MicroNet VWW-4 INT8 INT8 0.822 0.12 0.37 4.15
Dataset:
Visual Wake Words

Plans