Image Classification Using Neural Network on FPGA

This example uses:

Deep Learning HDL Toolbox Deep Learning HDL Toolbox
Deep Learning Toolbox Deep Learning Toolbox
Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices
Deep Learning Toolbox Model for ResNet-18 Network Deep Learning Toolbox Model for ResNet-18 Network

This example shows how to train, compile, and deploy a dlhdl.Workflow object that has ResNet-18 neural network to an FPGA and use MATLAB® to retrieve the prediction results.

Load Pretrained Network

Load the pretrained ResNet-18 network:

net = imagePretrainedNetwork('resnet18');

View the layers of the pretrained network:

deepNetworkDesigner(net);

The first layer, the image input layer, requires input images of size 224-by-224-by-3, where three is the number of color channels.

inputSize = net.Layers(1).InputSize;

Load Data

This example uses the MathWorks® MerchData data set. This is a small data set containing 75 images of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver, and torch).

curDir = pwd;
unzip('MerchData.zip');
imds = imageDatastore('MerchData', ...
'IncludeSubfolders',true, ...
'LabelSource','foldernames');

Specify Training and Validation Sets

Divide the data into training and validation data sets, so that 30% percent of the images go to the training data set and 70% of the images to the validation data set. splitEachLabel splits the datastore imds into two new datastores, imdsTrain and imdsValidation.

[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');

Replace Final layers

To retrain ResNet-18 to classify new images, replace the last fully connected layer of the network, named 'fc1000'. This layer is configured for 1000 classes and contains information on how to combine the features that the network extracts into class probabilities. It must be fine-tuned for the new classification problem. Create a new fullyConnectedLayer to be configured with the new set of classes and replace 'fc1000'.

classNames = categories(imds.Labels)

classNames = 5×1 cell array
    "'MathWorks Cap'"
    "'MathWorks Cube'"
    "'MathWorks Playing Cards'"
    "'MathWorks Screwdriver'"
    "'MathWorks Torch'"

numClasses = numel(classNames)

numClasses = 
5

newLearnableLayer = fullyConnectedLayer(numClasses, ...
'Name','new_fc', ...
'WeightLearnRateFactor',10, ...
'BiasLearnRateFactor',10);
net = replaceLayer(net,'fc1000',newLearnableLayer);

Prepare Data for Training

The network requires input images of size 224-by-224-by-3, but the images in the image datastores have different sizes. Use an augmented image datastore to automatically resize the training images. Specify additional augmentation operations to perform on the training images, such as randomly flipping the training images along the vertical axis and randomly translating them up to 30 pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.

pixelRange = [-30 30];
imageAugmenter = imageDataAugmenter( ...
'RandXReflection',true, ...
'RandXTranslation',pixelRange, ...
'RandYTranslation',pixelRange);

To automatically resize the validation images without performing further data augmentation, use an augmented image datastore without specifying any additional preprocessing operations.

augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ...
'DataAugmentation',imageAugmenter);
augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);

Specify Training Options

Specify the training options. For transfer learning, keep the features from the early layers of the pretrained network (the transferred layer weights). To slow down learning in the transferred layers, set the initial learning rate to a small value. Specify the mini-batch size and validation data. The software validates the network every ValidationFrequency iterations during training.

options = trainingOptions('sgdm', ...
'MiniBatchSize',10, ...
'MaxEpochs',6, ...
'InitialLearnRate',1e-4, ...
'Shuffle','every-epoch', ...
'ValidationData',augimdsValidation, ...
'ValidationFrequency',3, ...
'Verbose',false, ...
'Plots','training-progress');

Train Network

Train the network that consists of the transferred and new layers. By default, trainNetwork uses a GPU if one is available. Using this function on a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For more information, see GPU Computing Requirements (Parallel Computing Toolbox). If a GPU is not available, the network uses a CPU (requires MATLAB® Coder™ Interface for Deep Learning). You can also specify the execution environment by using the ExecutionEnvironment name-value argument of trainingOptions.

netTransfer = trainnet(augimdsTrain,net,"crossentropy",options);

Define FPGA Board Interface

Define the target FPGA board programming interface by using the dlhdl.Target object. Create a programming interface with custom name for your target device and an Ethernet interface to connect the target device to the host computer.

hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

Prepare Network for Deployment

Prepare the network for deployment by creating a dlhdl.Workflow object. Specify the network and bitstream name. Ensure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx® Zynq® UltraScale+™ MPSoC ZCU102 board and the bitstream uses the single data type.

hW = dlhdl.Workflow(Network=netTransfer,Bitstream='zcu102_single',Target=hTarget);

Compile Network

Run the compile method of the dlhdl.Workflow object to compile the network and generate the instructions, weights, and biases for deployment.

dn = compile(hW,'InputFrameNumberLimit',15)

### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### An output layer called 'Output1_prob' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization.
### The network includes the following layers:
     1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
     2   'conv1'                 2-D Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
     3   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
     4   'pool1'                 2-D Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
     5   'res2a_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     6   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
     7   'res2a_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     8   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
     9   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    10   'res2b_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    11   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    12   'res2b_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    13   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    14   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    15   'res3a_branch2a'        2-D Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
    16   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    17   'res3a_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    18   'res3a_branch1'         2-D Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
    19   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    20   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    21   'res3b_branch2a'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    22   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    23   'res3b_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    24   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    25   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    26   'res4a_branch2a'        2-D Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    27   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    28   'res4a_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    29   'res4a_branch1'         2-D Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    30   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    31   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    32   'res4b_branch2a'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    34   'res4b_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    35   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    36   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    37   'res5a_branch2a'        2-D Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    38   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    39   'res5a_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'res5a_branch1'         2-D Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    41   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    42   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    43   'res5b_branch2a'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    44   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    45   'res5b_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    46   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    47   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    48   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
    49   'new_fc'                Fully Connected              5 fully connected layer                                               (HW Layer)
    50   'prob'                  Softmax                      softmax                                                               (SW Layer)
    51   'Output1_prob'          Regression Output            mean-squared-error                                                    (SW Layer)
                                                                                                                                  
### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'Output1_prob' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.
### Compiling layer group: conv1>>pool1 ...
### Compiling layer group: conv1>>pool1 ... complete.
### Compiling layer group: res2a_branch2a>>res2a_branch2b ...
### Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete.
### Compiling layer group: res2b_branch2a>>res2b_branch2b ...
### Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete.
### Compiling layer group: res3a_branch1 ...
### Compiling layer group: res3a_branch1 ... complete.
### Compiling layer group: res3a_branch2a>>res3a_branch2b ...
### Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete.
### Compiling layer group: res3b_branch2a>>res3b_branch2b ...
### Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete.
### Compiling layer group: res4a_branch1 ...
### Compiling layer group: res4a_branch1 ... complete.
### Compiling layer group: res4a_branch2a>>res4a_branch2b ...
### Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete.
### Compiling layer group: res4b_branch2a>>res4b_branch2b ...
### Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete.
### Compiling layer group: res5a_branch1 ...
### Compiling layer group: res5a_branch1 ... complete.
### Compiling layer group: res5a_branch2a>>res5a_branch2b ...
### Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete.
### Compiling layer group: res5b_branch2a>>res5b_branch2b ...
### Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete.
### Compiling layer group: pool5 ...
### Compiling layer group: pool5 ... complete.
### Compiling layer group: new_fc ...
### Compiling layer group: new_fc ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "11.5 MB"       
    "OutputResultOffset"        "0x00b7c000"     "4.0 kB"        
    "SchedulerDataOffset"       "0x00b7d000"     "4.1 MB"        
    "SystemBufferOffset"        "0x00f94000"     "6.1 MB"        
    "InstructionDataOffset"     "0x015b9000"     "2.4 MB"        
    "ConvWeightDataOffset"      "0x0181d000"     "49.5 MB"       
    "FCWeightDataOffset"        "0x04995000"     "20.0 kB"       
    "EndOffset"                 "0x0499a000"     "Total: 73.6 MB"

### Network compilation complete.

dn = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]
        constantData: {[1×2 cell]  [1×200704 single]}
             ddrInfo: [1×1 struct]
       resourceTable: [6×2 table]

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx Zynq UltraScale+ MPSoC ZCU102 hardware, run the deploy method of the dlhdl.Workflow object. This method programs the FPGA board using the output of the compile method and the programming file, downloads the network weights and biases, displays progress messages, and the time it takes to deploy the network.

deploy(hW)

### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_single.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Programming done. The system will now reboot for persistent changes to take effect.
### Rebooting Xilinx SoC at 192.168.1.101...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 20-Jun-2024 11:41:03
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 20-Jun-2024 11:41:04

Test Network

Load the example image.

imgFile = fullfile('MathWorks_cube_0.jpg');
inputImg = imresize(imread(imgFile),[224 224]);
imshow(inputImg)

Classify the image on the FPGA by using the predict method of the dlhdl.Workflow object and display the results.

inputImg = dlarray(single(inputImg), 'SSCB');
[prediction,speed] = predict(hW,single(inputImg),'Profile','on');

### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   25195250                  0.11452                       1           25197887              8.7
    data_norm_add           351843                  0.00160 
    data_norm               351897                  0.00160 
    conv1                  2227067                  0.01012 
    pool1                   505559                  0.00230 
    res2a_branch2a          974578                  0.00443 
    res2a_branch2b          974411                  0.00443 
    res2a                   374362                  0.00170 
    res2b_branch2a          974305                  0.00443 
    res2b_branch2b          973993                  0.00443 
    res2b                   374342                  0.00170 
    res3a_branch1           539328                  0.00245 
    res3a_branch2a          542216                  0.00246 
    res3a_branch2b          909793                  0.00414 
    res3a                   187275                  0.00085 
    res3b_branch2a          909728                  0.00414 
    res3b_branch2b          910170                  0.00414 
    res3b                   186815                  0.00085 
    res4a_branch1           490994                  0.00223 
    res4a_branch2a          494815                  0.00225 
    res4a_branch2b          894077                  0.00406 
    res4a                    93534                  0.00043 
    res4b_branch2a          894273                  0.00406 
    res4b_branch2b          894030                  0.00406 
    res4b                    93504                  0.00043 
    res5a_branch1          1131745                  0.00514 
    res5a_branch2a         1134469                  0.00516 
    res5a_branch2b         2211712                  0.01005 
    res5a                    46862                  0.00021 
    res5b_branch2a         2212049                  0.01005 
    res5b_branch2b         2211664                  0.01005 
    res5b                    46922                  0.00021 
    pool5                    73827                  0.00034 
    new_fc                    2904                  0.00001 
 * The clock frequency of the DL processor is: 220MHz

[val,idx] = max(prediction);
classNames{idx}

ans = 
'MathWorks Cube'