Main Content

Classify Images on an FPGA Using a Quantized DAG Network

In this example, you use Deep Learning HDL Toolbox™ to deploy a quantized deep convolutional neural network and classify an image. The example uses the pretrained ResNet-18 convolutional neural network to demonstrate transfer learning, quantization, and deployment for the quantized network. Use MATLAB ® to retrieve the prediction results.

ResNet-18 has been trained on over a million images and can classify images into 1000 object categories (such as keyboard, coffee mug, pencil, and many animals). The network has learned rich feature representations for a wide range of images. The network takes an image as input and outputs a label for the object in the image together with the probabilities for each of the object categories.

Required Products

For this example, you need:

  • Deep Learning Toolbox™

  • Deep Learning HDL Toolbox™

  • Deep Learning Toolbox Model for ResNet-18 Network

  • Deep Learning HDL Toolbox™ Support Package for Xilinx® FPGA and SoC Devices

  • Image Processing Toolbox™

  • Deep Learning Toolbox Model Quantization Library

  • MATLAB® Coder™ Interface for Deep Learning Libraries

Transfer Learning Using Resnet-18

To perform classification on a new set of images, you fine-tune a pretrained ResNet-18 convolutional neural network by transfer learning. In transfer learning, you can take a pretrained network and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster and easier than training a network with randomly initialized weights from scratch. You can quickly transfer learned features to a new task using a smaller number of training images.

Load Pretrained ResNet-18 Network

To load the pretrained network ResNet-18, enter:

net = resnet18;

To view the layers of the pretrained network, enter:

analyzeNetwork(net);

The first layer, the image input layer, requires input images of size 227-by-227-by-3, where 3 is the number of color channels.

inputSize = net.Layers(1).InputSize;

Define Training and Validation Data Sets

This example uses the MathWorks MerchData data set. This is a small data set containing 75 images of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver, and torch).

curDir = pwd;
unzip('MerchData.zip');
imds = imageDatastore('MerchData', ...
'IncludeSubfolders',true, ...
'LabelSource','foldernames');
[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
validationData_FPGA = imdsValidation.subset(1:1);

Replace Final Layers

The fully connected layer and classification layer of the pretrained network net are configured for 1000 classes. These two layers fc1000 and ClassificationLayer_predictions in ResNet-18, contain information on how to combine the features that the network extracts into class probabilities and predicted labels . These two layers must be fine-tuned for the new classification problem. Extract all the layers, except the last two layers, from the pretrained network.

lgraph = layerGraph(net)
lgraph = 
  LayerGraph with properties:

         Layers: [71×1 nnet.cnn.layer.Layer]
    Connections: [78×2 table]
     InputNames: {'data'}
    OutputNames: {'ClassificationLayer_predictions'}

numClasses = numel(categories(imdsTrain.Labels))
numClasses = 5
newLearnableLayer = fullyConnectedLayer(numClasses, ...
'Name','new_fc', ...
'WeightLearnRateFactor',10, ...
'BiasLearnRateFactor',10);
lgraph = replaceLayer(lgraph,'fc1000',newLearnableLayer);
newClassLayer = classificationLayer('Name','new_classoutput');
lgraph = replaceLayer(lgraph,'ClassificationLayer_predictions',newClassLayer);

Train Network

The network requires input images of size 224-by-224-by-3, but the images in the image datastores have different sizes. Use an augmented image datastore to automatically resize the training images. Specify additional augmentation operations to perform on the training images, such as randomly flipping the training images along the vertical axis and randomly translating them up to 30 pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.

pixelRange = [-30 30];
imageAugmenter = imageDataAugmenter( ...
'RandXReflection',true, ...
'RandXTranslation',pixelRange, ...
'RandYTranslation',pixelRange);

To automatically resize the validation images without performing further data augmentation, use an augmented image datastore without specifying any additional preprocessing operations.

augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ...
'DataAugmentation',imageAugmenter);
augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);

Specify the training options. For transfer learning, keep the features from the early layers of the pretrained network (the transferred layer weights). To slow down learning in the transferred layers, set the initial learning rate to a small value. Specify the mini-batch size and validation data. The software validates the network every ValidationFrequency iterations during training.

options = trainingOptions('sgdm', ...
'MiniBatchSize',10, ...
'MaxEpochs',6, ...
'InitialLearnRate',1e-4, ...
'Shuffle','every-epoch', ...
'ValidationData',augimdsValidation, ...
'ValidationFrequency',3, ...
'Verbose',false, ...
'Plots','training-progress');

Train the network that consists of the transferred and new layers. By default, trainNetwork uses a GPU if one is available (requires Parallel Computing Toolbox™ and a supported GPU device. For more information, see GPU Computing Requirements (Parallel Computing Toolbox)). Otherwise, the network uses a CPU (requires MATLAB Coder Interface for Deep learning Libraries™). You can also specify the execution environment by using the 'ExecutionEnvironment' name-value argument of trainingOptions.

netTransfer = trainNetwork(augimdsTrain,lgraph,options);

Quantize the Network

Create a dlquantizer object and specify the network to quantize.

dlquantObj = dlquantizer(netTransfer,'ExecutionEnvironment','FPGA');

Calibrate the Quantized Network Object

Use the calibrate function to exercise the network with sample inputs and collect the range information. The calibrate function exercises the network and collects the dynamic ranges of the weights and biases in the convolution and fully connected layers of the network and the dynamic ranges of the activations in all layers of the network. The calibrate function returns a table. Each row of the table contains range information for a learnable parameter of the quantized network.

dlquantObj.calibrate(augimdsTrain)
ans=95×5 table
       Optimized Layer Name       Network Layer Name    Learnables / Activations    MinValue    MaxValue
    __________________________    __________________    ________________________    ________    ________

    {'conv1_Weights'         }    {'conv1'         }           "Weights"            -0.56885    0.65166 
    {'conv1_Bias'            }    {'conv1'         }           "Bias"               -0.66869    0.67504 
    {'res2a_branch2a_Weights'}    {'res2a_branch2a'}           "Weights"            -0.46037    0.34327 
    {'res2a_branch2a_Bias'   }    {'res2a_branch2a'}           "Bias"               -0.82446     1.3337 
    {'res2a_branch2b_Weights'}    {'res2a_branch2b'}           "Weights"             -0.8002    0.60524 
    {'res2a_branch2b_Bias'   }    {'res2a_branch2b'}           "Bias"                -1.3954     1.7536 
    {'res2b_branch2a_Weights'}    {'res2b_branch2a'}           "Weights"            -0.33991     0.3503 
    {'res2b_branch2a_Bias'   }    {'res2b_branch2a'}           "Bias"                -1.1367     1.5317 
    {'res2b_branch2b_Weights'}    {'res2b_branch2b'}           "Weights"             -1.2616    0.93491 
    {'res2b_branch2b_Bias'   }    {'res2b_branch2b'}           "Bias"               -0.86662     1.2352 
    {'res3a_branch2a_Weights'}    {'res3a_branch2a'}           "Weights"            -0.19675    0.23903 
    {'res3a_branch2a_Bias'   }    {'res3a_branch2a'}           "Bias"                -0.5063    0.69182 
    {'res3a_branch2b_Weights'}    {'res3a_branch2b'}           "Weights"             -0.5385    0.74078 
    {'res3a_branch2b_Bias'   }    {'res3a_branch2b'}           "Bias"               -0.66884     1.2152 
    {'res3a_branch1_Weights' }    {'res3a_branch1' }           "Weights"            -0.66715    0.98369 
    {'res3a_branch1_Bias'    }    {'res3a_branch1' }           "Bias"               -0.97269    0.83073 
      ⋮

Create Target Object

Use the dlhdl.Target class to create a target object with a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG,Install Xilinx™ Vivado™ Design Suite 2020.2. To set the Xilinx Vivado toolpath, enter:

% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2020.2\bin\vivado.bat');
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

Create WorkFlow Object

Use the dlhdl.Workflow class to create an object. When you create the object, specify the network and the bitstream name. Specify the saved pretrained alexnet neural network as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx ZCU102 SoC board. The bitstream uses a single data type.

hW = dlhdl.Workflow('Network', dlquantObj, 'Bitstream', 'zcu102_int8','Target',hTarget);

Compile the netTransfer DAG network

To compile the netTransfer DAG network, run the compile method of the dlhdl.Workflow object. You can optionally specify the maximum number of input frames.

dn = hW.compile('InputFrameNumberLimit',15)
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8.
### The network includes the following layers:
     1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
     2   'conv1'                 Convolution                  64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
     3   'bn_conv1'              Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
     4   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
     5   'pool1'                 Max Pooling                  3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
     6   'res2a_branch2a'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     7   'bn2a_branch2a'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
     8   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
     9   'res2a_branch2b'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    10   'bn2a_branch2b'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
    11   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    12   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    13   'res2b_branch2a'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    14   'bn2b_branch2a'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
    15   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    16   'res2b_branch2b'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    17   'bn2b_branch2b'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
    18   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    19   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    20   'res3a_branch2a'        Convolution                  128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
    21   'bn3a_branch2a'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    22   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    23   'res3a_branch2b'        Convolution                  128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    24   'bn3a_branch2b'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    25   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    26   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    27   'res3a_branch1'         Convolution                  128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
    28   'bn3a_branch1'          Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    29   'res3b_branch2a'        Convolution                  128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    30   'bn3b_branch2a'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    31   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    32   'res3b_branch2b'        Convolution                  128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'bn3b_branch2b'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    34   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    35   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    36   'res4a_branch2a'        Convolution                  256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    37   'bn4a_branch2a'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    38   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    39   'res4a_branch2b'        Convolution                  256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'bn4a_branch2b'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    41   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    42   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    43   'res4a_branch1'         Convolution                  256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    44   'bn4a_branch1'          Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    45   'res4b_branch2a'        Convolution                  256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    46   'bn4b_branch2a'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    47   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    48   'res4b_branch2b'        Convolution                  256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    49   'bn4b_branch2b'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    50   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    51   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    52   'res5a_branch2a'        Convolution                  512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    53   'bn5a_branch2a'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    54   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    55   'res5a_branch2b'        Convolution                  512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    56   'bn5a_branch2b'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    57   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    58   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    59   'res5a_branch1'         Convolution                  512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    60   'bn5a_branch1'          Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    61   'res5b_branch2a'        Convolution                  512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    62   'bn5b_branch2a'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    63   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    64   'res5b_branch2b'        Convolution                  512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    65   'bn5b_branch2b'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    66   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    67   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    68   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
    69   'new_fc'                Fully Connected              5 fully connected layer                                               (HW Layer)
    70   'prob'                  Softmax                      softmax                                                               (HW Layer)
    71   'new_classoutput'       Classification Output        crossentropyex with 'MathWorks Cap' and 4 other classes               (SW Layer)
                                                                                                                                  
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
### Compiling layer group: conv1>>pool1 ...
### Compiling layer group: conv1>>pool1 ... complete.
### Compiling layer group: res2a_branch2a>>res2a_branch2b ...
### Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete.
### Compiling layer group: res2b_branch2a>>res2b_branch2b ...
### Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete.
### Compiling layer group: res3a_branch1 ...
### Compiling layer group: res3a_branch1 ... complete.
### Compiling layer group: res3a_branch2a>>res3a_branch2b ...
### Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete.
### Compiling layer group: res3b_branch2a>>res3b_branch2b ...
### Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete.
### Compiling layer group: res4a_branch1 ...
### Compiling layer group: res4a_branch1 ... complete.
### Compiling layer group: res4a_branch2a>>res4a_branch2b ...
### Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete.
### Compiling layer group: res4b_branch2a>>res4b_branch2b ...
### Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete.
### Compiling layer group: res5a_branch1 ...
### Compiling layer group: res5a_branch1 ... complete.
### Compiling layer group: res5a_branch2a>>res5a_branch2b ...
### Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete.
### Compiling layer group: res5b_branch2a>>res5b_branch2b ...
### Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete.
### Compiling layer group: pool5 ...
### Compiling layer group: pool5 ... complete.
### Compiling layer group: new_fc ...
### Compiling layer group: new_fc ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "8.0 MB"        
    "OutputResultOffset"        "0x00800000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x00c00000"     "4.0 MB"        
    "SystemBufferOffset"        "0x01000000"     "28.0 MB"       
    "InstructionDataOffset"     "0x02c00000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x03000000"     "16.0 MB"       
    "FCWeightDataOffset"        "0x04000000"     "4.0 MB"        
    "EndOffset"                 "0x04400000"     "Total: 68.0 MB"

### Network compilation complete.
dn = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]
        constantData: {}

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.

hW.deploy
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 20-Jan-2022 08:45:22
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 20-Jan-2022 08:45:22

Load Image for Prediction

Load the example image.

imgFile = fullfile(pwd,'MerchData','MathWorks Cube','Mathworks cube_0.jpg');
inputImg = imresize(imread(imgFile),[224 224]);
imshow(inputImg)

Run Prediction for One Image

Execute the predict method on the dlhdl.Workflow object and then show the label in the MATLAB command window.

[prediction, speed] = hW.predict(single(inputImg),'Profile','on');
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    7389695                  0.02956                       1            7392277             33.8
    conv1                  1115359                  0.00446 
    pool1                   237742                  0.00095 
    res2a_branch2a          269669                  0.00108 
    res2a_branch2b          270019                  0.00108 
    res2a                   103095                  0.00041 
    res2b_branch2a          269716                  0.00108 
    res2b_branch2b          269895                  0.00108 
    res2b                   102385                  0.00041 
    res3a_branch1           156246                  0.00062 
    res3a_branch2a          227373                  0.00091 
    res3a_branch2b          245201                  0.00098 
    res3a                    52543                  0.00021 
    res3b_branch2a          244793                  0.00098 
    res3b_branch2b          244952                  0.00098 
    res3b                    51176                  0.00020 
    res4a_branch1           135788                  0.00054 
    res4a_branch2a          135745                  0.00054 
    res4a_branch2b          237464                  0.00095 
    res4a                    25612                  0.00010 
    res4b_branch2a          237244                  0.00095 
    res4b_branch2b          237242                  0.00095 
    res4b                    25952                  0.00010 
    res5a_branch1           311610                  0.00125 
    res5a_branch2a          311719                  0.00125 
    res5a_branch2b          596194                  0.00238 
    res5a                    13191                  0.00005 
    res5b_branch2a          595890                  0.00238 
    res5b_branch2b          596795                  0.00239 
    res5b                    14141                  0.00006 
    pool5                    36932                  0.00015 
    new_fc                   17825                  0.00007 
 * The clock frequency of the DL processor is: 250MHz
[val, idx] = max(prediction);
dlquantObj.NetworkObject.Layers(end).ClassNames{idx}
ans = 
'MathWorks Cube'

Performance Comparison

Compare the performance of the quantized network to the single data type network.

options_FPGA = dlquantizationOptions('Bitstream','zcu102_int8','Target',hTarget);
prediction_FPGA = dlquantObj.validate(imdsValidation,options_FPGA)
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8.
### The network includes the following layers:
     1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
     2   'conv1'                 Convolution                  64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
     3   'bn_conv1'              Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
     4   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
     5   'pool1'                 Max Pooling                  3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
     6   'res2a_branch2a'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     7   'bn2a_branch2a'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
     8   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
     9   'res2a_branch2b'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    10   'bn2a_branch2b'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
    11   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    12   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    13   'res2b_branch2a'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    14   'bn2b_branch2a'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
    15   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    16   'res2b_branch2b'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    17   'bn2b_branch2b'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
    18   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    19   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    20   'res3a_branch2a'        Convolution                  128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
    21   'bn3a_branch2a'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    22   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    23   'res3a_branch2b'        Convolution                  128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    24   'bn3a_branch2b'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    25   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    26   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    27   'res3a_branch1'         Convolution                  128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
    28   'bn3a_branch1'          Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    29   'res3b_branch2a'        Convolution                  128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    30   'bn3b_branch2a'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    31   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    32   'res3b_branch2b'        Convolution                  128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'bn3b_branch2b'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    34   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    35   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    36   'res4a_branch2a'        Convolution                  256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    37   'bn4a_branch2a'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    38   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    39   'res4a_branch2b'        Convolution                  256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'bn4a_branch2b'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    41   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    42   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    43   'res4a_branch1'         Convolution                  256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    44   'bn4a_branch1'          Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    45   'res4b_branch2a'        Convolution                  256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    46   'bn4b_branch2a'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    47   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    48   'res4b_branch2b'        Convolution                  256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    49   'bn4b_branch2b'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    50   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    51   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    52   'res5a_branch2a'        Convolution                  512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    53   'bn5a_branch2a'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    54   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    55   'res5a_branch2b'        Convolution                  512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    56   'bn5a_branch2b'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    57   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    58   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    59   'res5a_branch1'         Convolution                  512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    60   'bn5a_branch1'          Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    61   'res5b_branch2a'        Convolution                  512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    62   'bn5b_branch2a'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    63   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    64   'res5b_branch2b'        Convolution                  512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    65   'bn5b_branch2b'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    66   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    67   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    68   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
    69   'new_fc'                Fully Connected              5 fully connected layer                                               (HW Layer)
    70   'prob'                  Softmax                      softmax                                                               (HW Layer)
    71   'new_classoutput'       Classification Output        crossentropyex with 'MathWorks Cap' and 4 other classes               (SW Layer)
                                                                                                                                  
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
### Compiling layer group: conv1>>pool1 ...
### Compiling layer group: conv1>>pool1 ... complete.
### Compiling layer group: res2a_branch2a>>res2a_branch2b ...
### Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete.
### Compiling layer group: res2b_branch2a>>res2b_branch2b ...
### Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete.
### Compiling layer group: res3a_branch1 ...
### Compiling layer group: res3a_branch1 ... complete.
### Compiling layer group: res3a_branch2a>>res3a_branch2b ...
### Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete.
### Compiling layer group: res3b_branch2a>>res3b_branch2b ...
### Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete.
### Compiling layer group: res4a_branch1 ...
### Compiling layer group: res4a_branch1 ... complete.
### Compiling layer group: res4a_branch2a>>res4a_branch2b ...
### Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete.
### Compiling layer group: res4b_branch2a>>res4b_branch2b ...
### Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete.
### Compiling layer group: res5a_branch1 ...
### Compiling layer group: res5a_branch1 ... complete.
### Compiling layer group: res5a_branch2a>>res5a_branch2b ...
### Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete.
### Compiling layer group: res5b_branch2a>>res5b_branch2b ...
### Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete.
### Compiling layer group: pool5 ...
### Compiling layer group: pool5 ... complete.
### Compiling layer group: new_fc ...
### Compiling layer group: new_fc ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "12.0 MB"       
    "OutputResultOffset"        "0x00c00000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x01000000"     "4.0 MB"        
    "SystemBufferOffset"        "0x01400000"     "28.0 MB"       
    "InstructionDataOffset"     "0x03000000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x03400000"     "16.0 MB"       
    "FCWeightDataOffset"        "0x04400000"     "4.0 MB"        
    "EndOffset"                 "0x04800000"     "Total: 72.0 MB"

### Network compilation complete.

### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 20-Jan-2022 08:46:40
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 20-Jan-2022 08:46:40
### Finished writing input activations.
### Running in multi-frame mode with 20 inputs.


              Deep Learning Processor Bitstream Build Info

Resource                   Utilized           Total        Percentage
------------------        ----------      ------------    ------------
LUTs (CLB/ALM)*              248190            274080           90.55
DSPs                            384              2520           15.24
Block RAM                       581               912           63.71
* LUT count represents Configurable Logic Block(CLB) utilization in Xilinx devices and Adaptive Logic Module (ALM) utilization in Intel devices.

### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization.
### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   23871185                  0.10851                       1           23871185              9.2
    ____data_norm_add       210750                  0.00096 
    ____data_norm           210750                  0.00096 
    ____conv1              2165372                  0.00984 
    ____pool1               646226                  0.00294 
    ____res2a_branch2a      966221                  0.00439 
    ____res2a_branch2b      966221                  0.00439 
    ____res2a               210750                  0.00096 
    ____res2b_branch2a      966221                  0.00439 
    ____res2b_branch2b      966221                  0.00439 
    ____res2b               210750                  0.00096 
    ____res3a_branch1       540749                  0.00246 
    ____res3a_branch2a      763860                  0.00347 
    ____res3a_branch2b      919117                  0.00418 
    ____res3a               105404                  0.00048 
    ____res3b_branch2a      919117                  0.00418 
    ____res3b_branch2b      919117                  0.00418 
    ____res3b               105404                  0.00048 
    ____res4a_branch1       509261                  0.00231 
    ____res4a_branch2a      509261                  0.00231 
    ____res4a_branch2b      905421                  0.00412 
    ____res4a                52724                  0.00024 
    ____res4b_branch2a      905421                  0.00412 
    ____res4b_branch2b      905421                  0.00412 
    ____res4b                52724                  0.00024 
    ____res5a_branch1      1046605                  0.00476 
    ____res5a_branch2a     1046605                  0.00476 
    ____res5a_branch2b     2005197                  0.00911 
    ____res5a                26368                  0.00012 
    ____res5b_branch2a     2005197                  0.00911 
    ____res5b_branch2b     2005197                  0.00911 
    ____res5b                26368                  0.00012 
    ____pool5                54594                  0.00025 
    ____new_fc               22571                  0.00010 
 * The clock frequency of the DL processor is: 220MHz




              Deep Learning Processor Bitstream Build Info

Resource                   Utilized           Total        Percentage
------------------        ----------      ------------    ------------
LUTs (CLB/ALM)*              168836            274080           61.60
DSPs                            800              2520           31.75
Block RAM                       453               912           49.67
* LUT count represents Configurable Logic Block(CLB) utilization in Xilinx devices and Adaptive Logic Module (ALM) utilization in Intel devices.

### Finished writing input activations.
### Running single input activation.
prediction_FPGA = struct with fields:
       NumSamples: 20
    MetricResults: [1×1 struct]
       Statistics: [2×7 table]

prediction_FPGA.Statistics.FramesPerSecond
ans = 2×1

    9.2161
   33.8157

The first number is the frames per second performance for the single data type network and the second number is the frames per second performance for the quantized network.