Classify Videos Using Deep Learning with Custom Training Loop

This example uses:

This example shows how to create a network for video classification by combining a pretrained image classification model and a sequence classification network.

You can perform video classification without using a custom training loop by using the trainNetwork function. For an example, see Classify Videos Using Deep Learning. However, If trainingOptions does not provide the options you need (for example, a custom learning rate schedule), then you can define your own custom training loop as shown in this example.

To create a deep learning network for video classification:

Convert videos to sequences of feature vectors using a pretrained convolutional neural network, such as GoogLeNet, to extract features from each frame.
Train a sequence classification network on the sequences to predict the video labels.
Assemble a network that classifies videos directly by combining layers from both networks.

The following diagram illustrates the network architecture:

To input image sequences to the network, use a sequence input layer.
To extract features from the image sequences, use convolutional layers from the pretrained GoogLeNet network.
To classify the resulting vector sequences, include the sequence classification layers.

Video Classification Drawing (5).png

When training this type of network with the trainNetwork function (not done in this example), you must use sequence folding and unfolding layers to process the video frames independently. When you train this type of network with a dlnetwork object and a custom training loop (as in this example), sequence folding and unfolding layers are not required because the network uses dimension information given by the dlarray dimension labels.

Load Pretrained Convolutional Network

To convert frames of videos to feature vectors, use the activations of a pretrained network.

Load a pretrained GoogLeNet model using the googlenet function. This function requires the Deep Learning Toolbox™ Model for GoogLeNet Network support package. If this support package is not installed, then the function provides a download link.

netCNN = googlenet;

Load Data

Download the HMBD51 data set from HMDB: a large human motion database and extract the RAR file into a folder named "hmdb51_org". The data set contains about 2 GB of video data for 7000 clips over 51 classes, such as "drink", "run", and "shake_hands".

After extracting the RAR file, make sure that the folder hmdb51_org contains subfolders named after the body motions. If it contains RAR files, you need to extract them as well. Use the supporting function hmdb51Files to get the file names and the labels of the videos. To speed up training at the cost of accuracy, specify a fraction in the range [0 1] to read only a random subset of files from the database. If the fraction input argument is not specified, the function hmdb51Files reads the full dataset without changing the order of the files.

dataFolder = "hmdb51_org";
fraction = 1;
[files,labels] = hmdb51Files(dataFolder,fraction);

Read the first video using the readVideo helper function, defined at the end of this example, and view the size of the video. The video is an H-by-W-by-C-by-T array, where H, W, C, and T are the height, width, number of channels, and number of frames of the video, respectively.

idx = 1;
filename = files(idx);
video = readVideo(filename);
size(video)

ans = 1×4

   240   352     3   115

View the corresponding label.

labels(idx)

ans = categorical
     shoot_ball

To view the video, loop over the individual frames and use the image function. Alternatively, you can use the implay function (requires Image Processing Toolbox).

numFrames = size(video,4);
figure
for i = 1:numFrames
    frame = video(:,:,:,i);
    image(frame);
    xticklabels([]);
    yticklabels([]);
    drawnow
end

Convert Frames to Feature Vectors

Use the convolutional network as a feature extractor: input video frames to the network and extract the activations. Convert the videos to sequences of feature vectors, where the feature vectors are the output of the activations function on the last pooling layer of the GoogLeNet network ("pool5-7x7_s1").

This diagram illustrates the data flow through the network.

Read the video data using the readVideo function, defined at the end of this example, and resize it to match the input size of the GoogLeNet network. Note that this step can take a long time to run. After converting the videos to sequences, save the sequences and corresponding labels in a MAT file in the tempdir folder. If the MAT file already exists, then load the sequences and labels from the MAT file directly. In case a MAT file already exists but you want to overwrite it, set the variable overwriteSequences to true.

inputSize = netCNN.Layers(1).InputSize(1:2);
layerName = "pool5-7x7_s1";

tempFile = fullfile(tempdir,"hmdb51_org.mat");

overwriteSequences = false;

if exist(tempFile,'file') && ~overwriteSequences
    load(tempFile)
else
    numFiles = numel(files);
    sequences = cell(numFiles,1);
    
    for i = 1:numFiles
        fprintf("Reading file %d of %d...\n", i, numFiles)
        
        video = readVideo(files(i));
        video = imresize(video,inputSize);
        sequences{i,1} = activations(netCNN,video,layerName,'OutputAs','columns');
        
    end
    % Save the sequences and the labels associated with them.
    save(tempFile,"sequences","labels","-v7.3");
end

View the sizes of the first few sequences. Each sequence is a D-by-T array, where D is the number of features (the output size of the pooling layer) and T is the number of frames of the video.

sequences(1:10)

ans=10×1 cell array
    {1024×115 single}
    {1024×227 single}
    {1024×180 single}
    {1024×40  single}
    {1024×60  single}
    {1024×156 single}
    {1024×83  single}
    {1024×42  single}
    {1024×82  single}
    {1024×110 single}

Prepare Training Data

Prepare the data for training by partitioning the data into training and validation partitions and removing any long sequences.

Create Training and Validation Partitions

Partition the data. Assign 90% of the data to the training partition and 10% to the validation partition.

numObservations = numel(sequences);
idx = randperm(numObservations);
N = floor(0.9 * numObservations);

idxTrain = idx(1:N);
sequencesTrain = sequences(idxTrain);
labelsTrain = labels(idxTrain);

idxValidation = idx(N+1:end);
sequencesValidation = sequences(idxValidation);
labelsValidation = labels(idxValidation);

Remove Long Sequences

Sequences that are much longer than typical sequences in the networks can introduce lots of padding into the training process. Having too much padding can negatively impact the classification accuracy.

Get the sequence lengths of the training data and visualize them in a histogram of the training data.

numObservationsTrain = numel(sequencesTrain);
sequenceLengths = zeros(1,numObservationsTrain);

for i = 1:numObservationsTrain
    sequence = sequencesTrain{i};
    sequenceLengths(i) = size(sequence,2);
end

figure
histogram(sequenceLengths)
title("Sequence Lengths")
xlabel("Sequence Length")
ylabel("Frequency")

Only a few sequences have more than 400 time steps. To improve the classification accuracy, remove the training sequences that have more than 400 time steps along with their corresponding labels.

maxLength = 400;
idx = sequenceLengths > maxLength;
sequencesTrain(idx) = [];
labelsTrain(idx) = [];

Create Datastore for Data

Create an arrayDatastore object for the sequences and the labels, and then combine them into a single datastore.

dsXTrain = arrayDatastore(sequencesTrain,'OutputType','same');
dsYTrain = arrayDatastore(labelsTrain,'OutputType','cell');

dsTrain = combine(dsXTrain,dsYTrain);

Determine the classes in the training data.

classes = categories(labelsTrain);

Create Sequence Classification Network

Next, create a sequence classification network that can classify the sequences of feature vectors representing the videos.

Define the sequence classification network architecture. Specify the following network layers:

A sequence input layer with an input size corresponding to the feature dimension of the feature vectors.
A BiLSTM layer with 2000 hidden units with a dropout layer afterwards. To output only one label for each sequence, set the 'OutputMode' option of the BiLSTM layer to 'last'.
A dropout layer with a probability of 0.5.
A fully connected layer with an output size corresponding to the number of classes and a softmax layer.

numFeatures = size(sequencesTrain{1},1);
numClasses = numel(categories(labelsTrain));

layers = [
    sequenceInputLayer(numFeatures,'Name','sequence')
    bilstmLayer(2000,'OutputMode','last','Name','bilstm')
    dropoutLayer(0.5,'Name','drop')
    fullyConnectedLayer(numClasses,'Name','fc')
    softmaxLayer('Name','softmax')
    ];

Convert the layers to a layerGraph object.

lgraph = layerGraph(layers);

Create a dlnetwork object from the layer graph.

dlnet = dlnetwork(lgraph);

Specify Training Options

Train for 15 epochs and specify a mini-batch size of 16.

numEpochs = 15;
miniBatchSize = 16;

Specify the options for Adam optimization. Specify an initial learning rate of 1e-4 with a decay of 0.001, a gradient decay of 0.9, and a squared gradient decay of 0.999.

initialLearnRate = 1e-4;
decay = 0.001;
gradDecay = 0.9;
sqGradDecay = 0.999;

Visualize the training progress in a plot.

plots = "training-progress";

Train Sequence Classification Network

Create a minibatchqueue object that processes and manages mini-batches of sequences during training. For each mini-batch:

Use the custom mini-batch preprocessing function preprocessLabeledSequences (defined at the end of this example) to convert the labels to dummy variables.
Format the vector sequence data with the dimension labels 'CTB' (channel, time, batch). By default, the minibatchqueue object converts the data to dlarray objects with underlying type single. Do not add a format to the class labels.
Train on a GPU if one is available. By default, the minibatchqueue object converts each output to a gpuArray object if a GPU is available. Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox).

mbq = minibatchqueue(dsTrain,...
    'MiniBatchSize',miniBatchSize,...
    'MiniBatchFcn', @preprocessLabeledSequences,...
    'MiniBatchFormat',{'CTB',''});

Initialize the training progress plot.

if plots == "training-progress"
    figure
    lineLossTrain = animatedline('Color',[0.85 0.325 0.098]);
    ylim([0 inf])
    xlabel("Iteration")
    ylabel("Loss")
    grid on
end

Initialize the average gradient and average squared gradient parameters for the Adam solver.

averageGrad = [];
averageSqGrad = [];

Train the model using a custom training loop. For each epoch, shuffle the data and loop over mini-batches of data. For each mini-batch:

Evaluate the model gradients, state, and loss using dlfeval and the modelGradients function and update the network state.
Determine the learning rate for the time-based decay learning rate schedule: for each iteration, the solver uses the learning rate given by $ρ_{t} = \frac{ρ_{0}}{1 + k t}$ , where t is the iteration number, $ρ_{0}$ is the initial learning rate, and k is the decay.
Update the network parameters using the adamupdate function.
Display the training progress.

Note that training can take a long time to run.

iteration = 0;
start = tic;

% Loop over epochs.
for epoch = 1:numEpochs
    % Shuffle data.
    shuffle(mbq);
    
    % Loop over mini-batches.
    while hasdata(mbq)
        iteration = iteration + 1;
        
        % Read mini-batch of data.
        [dlX, dlY] = next(mbq);
        
        % Evaluate the model gradients, state, and loss using dlfeval and the
        % modelGradients function.
        [gradients,state,loss] = dlfeval(@modelGradients,dlnet,dlX,dlY);
        
        % Determine learning rate for time-based decay learning rate schedule.
        learnRate = initialLearnRate/(1 + decay*iteration);
        
        % Update the network parameters using the Adam optimizer.
        [dlnet,averageGrad,averageSqGrad] = adamupdate(dlnet,gradients,averageGrad,averageSqGrad, ...
            iteration,learnRate,gradDecay,sqGradDecay);
        
        % Display the training progress.
        if plots == "training-progress"
            D = duration(0,0,toc(start),'Format','hh:mm:ss');
            addpoints(lineLossTrain,iteration,double(gather(extractdata(loss))))
            title("Epoch: " + epoch + " of " + numEpochs + ", Elapsed: " + string(D))
            drawnow
        end
    end
end

Test Model

Test the classification accuracy of the model by comparing the predictions on the validation set with the true labels.

After training is complete, making predictions on new data does not require the labels.

To create a minibatchqueue object for testing:

Create an array datastore containing only the predictors of the test data.
Specify the same mini-batch size used for training.
Preprocess the predictors using the preprocessUnlabeledSequences helper function, listed at the end of the example.
For the single output of the datastore, specify the mini-batch format 'CTB' (channel, time, batch).

dsXValidation = arrayDatastore(sequencesValidation,'OutputType','same');
mbqTest = minibatchqueue(dsXValidation, ...
    'MiniBatchSize',miniBatchSize, ...
    'MiniBatchFcn',@preprocessUnlabeledSequences, ...
    'MiniBatchFormat','CTB');

Loop over the mini-batches and classify the images using the modelPredictions helper function, listed at the end of the example.

predictions = modelPredictions(dlnet,mbqTest,classes);

Evaluate the classification accuracy by comparing the predicted labels to the true validation labels.

accuracy = mean(predictions == labelsValidation)

accuracy = 0.6721

Assemble Video Classification Network

To create a network that classifies videos directly, assemble a network using layers from both of the created networks. Use the layers from the convolutional network to transform the videos into vector sequences and the layers from the sequence classification network to classify the vector sequences.

The following diagram illustrates the network architecture:

To input image sequences to the network, use a sequence input layer.
To use convolutional layers to extract features, that is, to apply the convolutional operations to each frame of the videos independently, use the GoogLeNet convolutional layers.
To classify the resulting vector sequences, include the sequence classification layers.

Video Classification Drawing (5).png

When training this type of network with the trainNetwork function (not done in this example), you have to use sequence folding and unfolding layers to process the video frames independently. When training this type of network with a dlnetwork object and a custom training loop (as in this example), sequence folding and unfolding layers are not required because the network uses dimension information given by the dlarray dimension labels.

Add Convolutional Layers

First, create a layer graph of the GoogLeNet network.

cnnLayers = layerGraph(netCNN);

Remove the input layer ("data") and the layers after the pooling layer used for the activations ("pool5-drop_7x7_s1", "loss3-classifier", "prob", and "output").

layerNames = ["data" "pool5-drop_7x7_s1" "loss3-classifier" "prob" "output"];
cnnLayers = removeLayers(cnnLayers,layerNames);

Add Sequence Input Layer

Create a sequence input layer that accepts image sequences containing images of the same input size as the GoogLeNet network. To normalize the images using the same average image as the GoogLeNet network, set the 'Normalization' option of the sequence input layer to 'zerocenter' and the 'Mean' option to the average image of the input layer of GoogLeNet.

inputSize = netCNN.Layers(1).InputSize(1:2);
averageImage = netCNN.Layers(1).Mean;

inputLayer = sequenceInputLayer([inputSize 3], ...
    'Normalization','zerocenter', ...
    'Mean',averageImage, ...
    'Name','input');

Add the sequence input layer to the layer graph. Connect the output of the input layer to the input of the first convolutional layer ("conv1-7x7_s2").

lgraph = addLayers(cnnLayers,inputLayer);
lgraph = connectLayers(lgraph,"input","conv1-7x7_s2");

Add Sequence Classification Layers

Add the previously trained sequence classification network layers to the layer graph and connect them.

Take the layers from the sequence classification network and remove the sequence input layer.

lstmLayers = dlnet.Layers;
lstmLayers(1) = [];

Add the sequence classification layers to the layer graph. Connect the last convolutional layer pool5-7x7_s1 to the bilstm layer.

lgraph = addLayers(lgraph,lstmLayers);
lgraph = connectLayers(lgraph,"pool5-7x7_s1","bilstm");

Convert to dlnetwork

To be able to do predictions, convert the layer graph to a dlnetwork object.

dlnetAssembled = dlnetwork(lgraph)

dlnetAssembled = 
  dlnetwork with properties:

         Layers: [144×1 nnet.cnn.layer.Layer]
    Connections: [170×2 table]
     Learnables: [119×3 table]
          State: [2×3 table]
     InputNames: {'input'}
    OutputNames: {'softmax'}
    Initialized: 1

Classify Using New Data

Unzip the file pushup_mathworker.zip.

unzip("pushup_mathworker.zip")

The extracted pushup_mathworker folder contains a video of a push-up. Create a file datastore for this folder. Use a custom read function to read the videos.

ds = fileDatastore("pushup_mathworker", ...
    'ReadFcn',@readVideo);

Read the first video from the datastore. To be able to read the video again, reset the datastore.

video = read(ds);
reset(ds);

To view the video, loop over the individual frames and use the image function. Alternatively, you can use the implay function (requires Image Processing Toolbox).

numFrames = size(video,4);
figure
for i = 1:numFrames
    frame = video(:,:,:,i);
    image(frame);
    xticklabels([]);
    yticklabels([]);
    drawnow
end

To preprocess the videos to have the input size expected by the network, use the transform function and apply the imresize function to each image in the datastore.

dsXTest = transform(ds,@(x) imresize(x,inputSize));

To manage and process the unlabeled videos, create a minibatchqueue:

Specify a mini-batch size of 1.
Preprocess the videos using the preprocessUnlabeledVideos helper function, listed at the end of the example.
For the single output of the datastore, specify the mini-batch format 'SSCTB' (spatial, spatial, channel, time, batch).

mbqTest = minibatchqueue(dsXTest,...
    'MiniBatchSize',1,...
    'MiniBatchFcn', @preprocessUnlabeledVideos,...
    'MiniBatchFormat',{'SSCTB'});

Classify the videos using the modelPredictions helper function, defined at the end of this example. The function expects three inputs: a dlnetwork object, a minibatchqueue object, and a cell array containing the network classes.

[predictions] = modelPredictions(dlnetAssembled,mbqTest,classes)

predictions = categorical
     pushup

Helper Functions

Video Reading Function

The readVideo function reads the video in filename and returns an H-by-W-by-C-by-T array, where H, W, C, and T are the height, width, number of channels, and number of frames of the video, respectively.

function video = readVideo(filename)

vr = VideoReader(filename);
H = vr.Height;
W = vr.Width;
C = 3;

% Preallocate video array
numFrames = floor(vr.Duration * vr.FrameRate);
video = zeros(H,W,C,numFrames,'uint8');

% Read frames
i = 0;
while hasFrame(vr)
    i = i + 1;
    video(:,:,:,i) = readFrame(vr);
end

% Remove unallocated frames
if size(video,4) > i
    video(:,:,:,i+1:end) = [];
end

end

Model Gradients Function

The modelGradients function takes as input a dlnetwork object dlnet and a mini-batch of input data dlX with corresponding labels Y, and returns the gradients of the loss with respect to the learnable parameters in dlnet, the network state, and the loss. To compute the gradients automatically, use the dlgradient function.

function [gradients,state,loss] = modelGradients(dlnet,dlX,Y)

    [dlYPred,state] = forward(dlnet,dlX);
    
    loss = crossentropy(dlYPred,Y);
    gradients = dlgradient(loss,dlnet.Learnables);

end

Model Predictions Function

The modelPredictions function takes as input a dlnetwork object dlnet, a minibatchqueue object of input data mbq, and the network classes, and computes the model predictions by iterating over all data in the mini-batch queue. The function uses the onehotdecode function to find the predicted class with the highest score. The function returns the predicted labels.

function [predictions] = modelPredictions(dlnet,mbq,classes)
    predictions = [];
    
    while hasdata(mbq)
        
        % Extract a mini-batch from the minibatchqueue and pass it to the
        % network for predictions
        [dlXTest] = next(mbq);
        dlYPred = predict(dlnet,dlXTest);
        
        % To obtain categorical labels, one-hot decode the predictions 
        YPred = onehotdecode(dlYPred,classes,1)';
        predictions = [predictions; YPred];
    end
end

Labeled Sequence Data Preprocessing Function

The preprocessLabeledSequences function preprocesses the sequence data using the following steps:

Use the padsequences function to pad the sequences in the time dimension and concatenate them in the batch dimension.
Extract the label data from the incoming cell array and concatenate into a categorical array.
One-hot encode the categorical labels into numeric arrays.
Transpose the array of one-hot encoded labels to match the shape of the network output.

function [X, Y] = preprocessLabeledSequences(XCell,YCell)
    % Pad the sequences with zeros in the second dimension (time) and concatenate along the third
    % dimension (batch)
    X = padsequences(XCell,2);
    
    % Extract label data from cell and concatenate
    Y = cat(1,YCell{1:end});
    
    % One-hot encode labels
    Y = onehotencode(Y,2);
    
    % Transpose the encoded labels to match the network output
    Y = Y';
end

Unlabeled Sequence Data Preprocessing Function

The preprocessUnlabeledSequences function preprocesses the sequence data using the padsequences function. This function pads the sequences with zeros in the time dimension and concatenates the result in the batch dimension.

function [X] = preprocessUnlabeledSequences(XCell)
    % Pad the sequences with zeros in the second dimension (time) and concatenate along the third
    % dimension (batch)
    X = padsequences(XCell,2);
end

Unlabeled Video Data Preprocessing Function

The preprocessUnlabeledVideos function preprocesses unlabeled video data using the padsequences function. This function pads the videos with zero in the time dimension and concatenates the result in the batch dimension.

function [X] = preprocessUnlabeledVideos(XCell)
    % Pad the sequences with zeros in the fourth dimension (time) and
    % concatenate along the fifth dimension (batch)
    X = padsequences(XCell,4);
end