Image Classification Using DAG Network Deployed to FPGA
This example shows how to train, compile, and deploy a dlhdl.Workflow
object that has ResNet-18 as the network object by using the Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC. Use MATLAB® to retrieve the prediction results from the target device.
Required Products
For this example, you need:
Deep Learning Toolbox ™
Deep Learning HDL Toolbox ™
Deep Learning Toolbox Model for ResNet-18 Network
Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices
Image Processing Toolbox ™
Load Pretrained SeriesNetwork
To load the pretrained network ResNet-18, enter:
snet = resnet18;
To view the layers of the pretrained network, enter:
analyzeNetwork(snet);
The first layer, the image input layer, requires input images of size 224-by-224-by-3, where 3 is the number of color channels.
inputSize = snet.Layers(1).InputSize;
Define Training and Validation Data Sets
This example uses the MathWorks
MerchData data set. This is a small data set containing 75 images of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver, and torch).
curDir = pwd; unzip('MerchData.zip'); imds = imageDatastore('MerchData', ... 'IncludeSubfolders',true, ... 'LabelSource','foldernames'); [imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
Replace Final Layers
The fully connected layer and classification layer of the pretrained network net
are configured for 1000 classes. These two layers fc1000
and ClassificationLayer_predictions
in ResNet-18, contain information on how to combine the features that the network extracts into class probabilities and predicted labels. These two layers must be fine-tuned for the new classification problem. Extract all the layers, except the last two layers, from the pretrained network.
lgraph = layerGraph(snet)
lgraph = LayerGraph with properties: Layers: [71×1 nnet.cnn.layer.Layer] Connections: [78×2 table] InputNames: {'data'} OutputNames: {'ClassificationLayer_predictions'}
numClasses = numel(categories(imdsTrain.Labels))
numClasses = 5
newLearnableLayer = fullyConnectedLayer(numClasses, ... 'Name','new_fc', ... 'WeightLearnRateFactor',10, ... 'BiasLearnRateFactor',10); lgraph = replaceLayer(lgraph,'fc1000',newLearnableLayer); newClassLayer = classificationLayer('Name','new_classoutput'); lgraph = replaceLayer(lgraph,'ClassificationLayer_predictions',newClassLayer);
Train Network
The network requires input images of size 224-by-224-by-3, but the images in the image datastores have different sizes. Use an augmented image datastore to automatically resize the training images. Specify additional augmentation operations to perform on the training images, such as randomly flipping the training images along the vertical axis and randomly translating them up to 30 pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.
pixelRange = [-30 30]; imageAugmenter = imageDataAugmenter( ... 'RandXReflection',true, ... 'RandXTranslation',pixelRange, ... 'RandYTranslation',pixelRange);
To automatically resize the validation images without performing further data augmentation, use an augmented image datastore without specifying any additional preprocessing operations.
augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ... 'DataAugmentation',imageAugmenter); augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);
Specify the training options. For transfer learning, keep the features from the early layers of the pretrained network (the transferred layer weights). To slow down learning in the transferred layers, set the initial learning rate to a small value. Specify the mini-batch size and validation data. The software validates the network every ValidationFrequency
iterations during training.
options = trainingOptions('sgdm', ... 'MiniBatchSize',10, ... 'MaxEpochs',6, ... 'InitialLearnRate',1e-4, ... 'Shuffle','every-epoch', ... 'ValidationData',augimdsValidation, ... 'ValidationFrequency',3, ... 'Verbose',false, ... 'Plots','training-progress');
Train the network that consists of the transferred and new layers. By default, trainNetwork
uses a GPU if one is available (requires Parallel Computing Toolbox™ and a supported GPU device. For more information, see GPU Computing Requirements (Parallel Computing Toolbox)). Otherwise, the network uses a CPU (requires MATLAB Coder Interface for Deep learning Libraries™). You can also specify the execution environment by using the 'ExecutionEnvironment'
name-value argument of trainingOptions
.
netTransfer = trainNetwork(augimdsTrain,lgraph,options);
Create Target Object
Use the dlhdl.Target
class to create a target object with a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG,Install Xilinx™ Vivado™ Design Suite 2020.2. To set the Xilinx Vivado toolpath, enter:
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2020.2\bin\vivado.bat');
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Create WorkFlow Object
Use the dlhdl.Workflow
class to create an object. When you create the object, specify the network and the bitstream name. Specify netTransfer
as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx ZCU102 SoC board. The bitstream uses a single data type.
hW = dlhdl.Workflow('Network', netTransfer, 'Bitstream', 'zcu102_single','Target',hTarget);
Compile the netTransfer DAG network
To compile the netTransfer DAG network, run the compile method of the dlhdl.Workflow
object. You can optionally specify the maximum number of input frames.
dn = hW.compile('InputFrameNumberLimit',15)
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_single ... ### The network includes the following layers: 1 'data' Image Input 224×224×3 images with 'zscore' normalization (SW Layer) 2 'conv1' Convolution 64 7×7×3 convolutions with stride [2 2] and padding [3 3 3 3] (HW Layer) 3 'bn_conv1' Batch Normalization Batch normalization with 64 channels (HW Layer) 4 'conv1_relu' ReLU ReLU (HW Layer) 5 'pool1' Max Pooling 3×3 max pooling with stride [2 2] and padding [1 1 1 1] (HW Layer) 6 'res2a_branch2a' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 7 'bn2a_branch2a' Batch Normalization Batch normalization with 64 channels (HW Layer) 8 'res2a_branch2a_relu' ReLU ReLU (HW Layer) 9 'res2a_branch2b' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 10 'bn2a_branch2b' Batch Normalization Batch normalization with 64 channels (HW Layer) 11 'res2a' Addition Element-wise addition of 2 inputs (HW Layer) 12 'res2a_relu' ReLU ReLU (HW Layer) 13 'res2b_branch2a' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 14 'bn2b_branch2a' Batch Normalization Batch normalization with 64 channels (HW Layer) 15 'res2b_branch2a_relu' ReLU ReLU (HW Layer) 16 'res2b_branch2b' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 17 'bn2b_branch2b' Batch Normalization Batch normalization with 64 channels (HW Layer) 18 'res2b' Addition Element-wise addition of 2 inputs (HW Layer) 19 'res2b_relu' ReLU ReLU (HW Layer) 20 'res3a_branch2a' Convolution 128 3×3×64 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 21 'bn3a_branch2a' Batch Normalization Batch normalization with 128 channels (HW Layer) 22 'res3a_branch2a_relu' ReLU ReLU (HW Layer) 23 'res3a_branch2b' Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 24 'bn3a_branch2b' Batch Normalization Batch normalization with 128 channels (HW Layer) 25 'res3a' Addition Element-wise addition of 2 inputs (HW Layer) 26 'res3a_relu' ReLU ReLU (HW Layer) 27 'res3a_branch1' Convolution 128 1×1×64 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 28 'bn3a_branch1' Batch Normalization Batch normalization with 128 channels (HW Layer) 29 'res3b_branch2a' Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 30 'bn3b_branch2a' Batch Normalization Batch normalization with 128 channels (HW Layer) 31 'res3b_branch2a_relu' ReLU ReLU (HW Layer) 32 'res3b_branch2b' Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 33 'bn3b_branch2b' Batch Normalization Batch normalization with 128 channels (HW Layer) 34 'res3b' Addition Element-wise addition of 2 inputs (HW Layer) 35 'res3b_relu' ReLU ReLU (HW Layer) 36 'res4a_branch2a' Convolution 256 3×3×128 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 37 'bn4a_branch2a' Batch Normalization Batch normalization with 256 channels (HW Layer) 38 'res4a_branch2a_relu' ReLU ReLU (HW Layer) 39 'res4a_branch2b' Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 40 'bn4a_branch2b' Batch Normalization Batch normalization with 256 channels (HW Layer) 41 'res4a' Addition Element-wise addition of 2 inputs (HW Layer) 42 'res4a_relu' ReLU ReLU (HW Layer) 43 'res4a_branch1' Convolution 256 1×1×128 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 44 'bn4a_branch1' Batch Normalization Batch normalization with 256 channels (HW Layer) 45 'res4b_branch2a' Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 46 'bn4b_branch2a' Batch Normalization Batch normalization with 256 channels (HW Layer) 47 'res4b_branch2a_relu' ReLU ReLU (HW Layer) 48 'res4b_branch2b' Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 49 'bn4b_branch2b' Batch Normalization Batch normalization with 256 channels (HW Layer) 50 'res4b' Addition Element-wise addition of 2 inputs (HW Layer) 51 'res4b_relu' ReLU ReLU (HW Layer) 52 'res5a_branch2a' Convolution 512 3×3×256 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 53 'bn5a_branch2a' Batch Normalization Batch normalization with 512 channels (HW Layer) 54 'res5a_branch2a_relu' ReLU ReLU (HW Layer) 55 'res5a_branch2b' Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 56 'bn5a_branch2b' Batch Normalization Batch normalization with 512 channels (HW Layer) 57 'res5a' Addition Element-wise addition of 2 inputs (HW Layer) 58 'res5a_relu' ReLU ReLU (HW Layer) 59 'res5a_branch1' Convolution 512 1×1×256 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 60 'bn5a_branch1' Batch Normalization Batch normalization with 512 channels (HW Layer) 61 'res5b_branch2a' Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 62 'bn5b_branch2a' Batch Normalization Batch normalization with 512 channels (HW Layer) 63 'res5b_branch2a_relu' ReLU ReLU (HW Layer) 64 'res5b_branch2b' Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 65 'bn5b_branch2b' Batch Normalization Batch normalization with 512 channels (HW Layer) 66 'res5b' Addition Element-wise addition of 2 inputs (HW Layer) 67 'res5b_relu' ReLU ReLU (HW Layer) 68 'pool5' Global Average Pooling Global average pooling (HW Layer) 69 'new_fc' Fully Connected 5 fully connected layer (HW Layer) 70 'prob' Softmax softmax (SW Layer) 71 'new_classoutput' Classification Output crossentropyex with 'MathWorks Cap' and 4 other classes (SW Layer) ### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' 5 Memory Regions created. Skipping: data Compiling leg: conv1>>pool1 ... Compiling leg: conv1>>pool1 ... complete. Compiling leg: res2a_branch2a>>res2a_branch2b ... Compiling leg: res2a_branch2a>>res2a_branch2b ... complete. Compiling leg: res2b_branch2a>>res2b_branch2b ... Compiling leg: res2b_branch2a>>res2b_branch2b ... complete. Compiling leg: res3a_branch1 ... Compiling leg: res3a_branch1 ... complete. Compiling leg: res3a_branch2a>>res3a_branch2b ... Compiling leg: res3a_branch2a>>res3a_branch2b ... complete. Compiling leg: res3b_branch2a>>res3b_branch2b ... Compiling leg: res3b_branch2a>>res3b_branch2b ... complete. Compiling leg: res4a_branch1 ... Compiling leg: res4a_branch1 ... complete. Compiling leg: res4a_branch2a>>res4a_branch2b ... Compiling leg: res4a_branch2a>>res4a_branch2b ... complete. Compiling leg: res4b_branch2a>>res4b_branch2b ... Compiling leg: res4b_branch2a>>res4b_branch2b ... complete. Compiling leg: res5a_branch1 ... Compiling leg: res5a_branch1 ... complete. Compiling leg: res5a_branch2a>>res5a_branch2b ... Compiling leg: res5a_branch2a>>res5a_branch2b ... complete. Compiling leg: res5b_branch2a>>res5b_branch2b ... Compiling leg: res5b_branch2a>>res5b_branch2b ... complete. Compiling leg: pool5 ... Compiling leg: pool5 ... complete. Compiling leg: new_fc ... Compiling leg: new_fc ... complete. Skipping: prob Skipping: new_classoutput Creating Schedule... ........................... Creating Schedule...complete. Creating Status Table... .......................... Creating Status Table...complete. Emitting Schedule... .......................... Emitting Schedule...complete. Emitting Status Table... ............................ Emitting Status Table...complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ _________________ "InputDataOffset" "0x00000000" "12.0 MB" "OutputResultOffset" "0x00c00000" "4.0 MB" "SchedulerDataOffset" "0x01000000" "4.0 MB" "SystemBufferOffset" "0x01400000" "28.0 MB" "InstructionDataOffset" "0x03000000" "4.0 MB" "ConvWeightDataOffset" "0x03400000" "52.0 MB" "FCWeightDataOffset" "0x06800000" "4.0 MB" "EndOffset" "0x06c00000" "Total: 108.0 MB" ### Network compilation complete.
dn = struct with fields:
weights: [1×1 struct]
instructions: [1×1 struct]
registers: [1×1 struct]
syncInstructions: [1×1 struct]
Program Bitstream onto FPGA and Download Network Weights
To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the dlhdl.Workflow
object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA. ### Deep learning network programming has been skipped as the same network is already loaded on the target FPGA.
Load Image for Prediction
Load the example image.
imgFile = fullfile(pwd,'MerchData','MathWorks Cube','Mathworks cube_0.jpg'); inputImg = imresize(imread(imgFile),[224 224]); imshow(inputImg)
Run Prediction for One Image
Execute the predict method on the dlhdl.Workflow
object and then show the label in the MATLAB command window.
[prediction, speed] = hW.predict(single(inputImg),'Profile','on');
### Finished writing input activations. ### Running single input activations. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 23470681 0.10668 1 23470681 9.4 conv1 2224133 0.01011 pool1 573009 0.00260 res2a_branch2a 972706 0.00442 res2a_branch2b 972715 0.00442 res2a 210584 0.00096 res2b_branch2a 972670 0.00442 res2b_branch2b 973171 0.00442 res2b 210235 0.00096 res3a_branch1 538433 0.00245 res3a_branch2a 746681 0.00339 res3a_branch2b 904757 0.00411 res3a 104923 0.00048 res3b_branch2a 904442 0.00411 res3b_branch2b 904234 0.00411 res3b 105019 0.00048 res4a_branch1 485689 0.00221 res4a_branch2a 486053 0.00221 res4a_branch2b 880357 0.00400 res4a 52814 0.00024 res4b_branch2a 880122 0.00400 res4b_branch2b 880268 0.00400 res4b 52492 0.00024 res5a_branch1 1056215 0.00480 res5a_branch2a 1056269 0.00480 res5a_branch2b 2057399 0.00935 res5a 26272 0.00012 res5b_branch2a 2057349 0.00935 res5b_branch2b 2057639 0.00935 res5b 26409 0.00012 pool5 71402 0.00032 new_fc 24650 0.00011 * The clock frequency of the DL processor is: 220MHz
[val, idx] = max(prediction); netTransfer.Layers(end).ClassNames{idx}
ans = 'MathWorks Cube'