Why is my code running slower on the GPU?

Question

AlexRD am 30 Mär. 2021

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/788519-why-is-my-code-running-slower-on-the-gpu

Kommentiert: AlexRD am 5 Apr. 2021

Hi,

I've been writing a deep learning neural network model by scratch, so i can have an intuitive understanding of them. The code i've written works fine, and i've spent a great amount of time optimizing it, but i seem to have reached a bottleneck that is the GPU code. I've implemented a dynamic network through the use of structures, with the structure vector representing layer depth. This model uses sigmoid activation functions, and cross-entropy cost function.

First things first, there are three files: The main script, and the backprop and feed_forward functions.

The main script

clc;
clear all; close all;
%% Load Data
load ('numbers.mat');
for i=1:length(numbers)
    temp = numbers(i).label;
    numbers(i).label = zeros(1,10);
    numbers(i).label(temp+1) = 1;
end
validation = numbers(1:10000);
training = numbers(10001:end);
%% Hyperparameters
batch_size = 10;
numEpochs = 5;
rateFunc = interp1(0.5 ./ [1:20], linspace(1, 20, numEpochs));
numInput = size(training(1).data, 1) * size(training(1).data, 2);
%% Initialization
net = create_net([numInput 100 10]);
numLayers = length(net);
average = [];
%% Main
for epoch=1:numEpochs
    tic;
    
    %% Backprop
    randIndex = randperm(size(training,2));
    for i=1:batch_size:length(training)-batch_size
        [net, gradient] = backprop(net, training(randIndex(i:i+batch_size-1)), rateFunc(epoch));
    end
    
    %% Validate Net
    fprintf ('Epoch(%d): %fs', epoch, toc);
    [average(end+1), error] = validate_net(net, validation);
    if mod(epoch, 5) == 0
        train_error = validate_net(net, training);
        fprintf ('\nError(Training): %f\n', train_error);
        if train_error >= 0.99, break; end
    end
    fprintf ('\nError: %f', average(end));
    fprintf ('\n---------------\n');
    
end
%% Functions
function [average, error] = validate_net(net, inputData)
error = [];
for i=1:size(inputData,2)
    layer = feed_forward(net, inputData(i).data);
    [~,ix] = max(layer(end).a);
    [~,iy] = max(inputData(i).label);
    error = [error; [ix-1, iy-1]];
    average = mean(error(:,1) == error(:,2));
end
end
function net = create_net(structure)
numLayers = length(structure) - 1;
net = struct('b', [], 'w', cell(1, numLayers));
for i=1:numLayers
    net(i).w = (randn(structure(i), structure(i+1))/sqrt(structure(i)));
    net(i).b = (randn(1, structure(i+1)));
end
end

Backprop

function [net, gradient] = backprop(net, inputData, rate)
numLayers = length(net);
delta = struct('b', [], 'w', cell(1, length(net)));
gradient = struct('b', 0, 'w', num2cell(zeros(1, length(net))));
for i=1:length(inputData)
    layer = feed_forward(net, inputData(i).data);
    
    delta(numLayers).b = layer(numLayers).a - inputData(i).label;
    delta(numLayers).w = layer(numLayers-1).a' * delta(numLayers).b;
    
    for L=numLayers-1:-1:2
        delta(L).b = (delta(L+1).b * net(L+1).w') .* 1./(1 + exp(-layer(L).z)) .* (1 - 1./(1 + exp(-layer(L).z)));
        delta(L).w = layer(L-1).a' * delta(L).b;
    end
    
    delta(1).b = (delta(2).b * net(2).w') .* 1./(1 + exp(-layer(1).z)) .* (1 - 1./(1 + exp(-layer(1).z)));
    delta(1).w = inputData(i).data' * delta(1).b;
    
    for L=1:numLayers
        gradient(L).b = gradient(L).b + delta(L).b;
        gradient(L).w = gradient(L).w + delta(L).w;
    end
end
for L=1:numLayers
    net(L).b = net(L).b - rate/length(inputData)*gradient(L).b;
    net(L).w = net(L).w - rate/length(inputData)*gradient(L).w;
end
end

Feed_forward

function layer = feed_forward(net, inputData)
layer = struct('z', [], 'a', cell(1, length(net)));
layer(1).z = inputData * net(1).w + net(1).b;
layer(1).a = 1./ (1 + exp(-layer(1).z));
for i=2:length(net)
    layer(i).z = layer(i-1).a * net(i).w + net(i).b;
    layer(i).a = 1./ (1 + exp(-layer(i).z));
end
end

The dataset I'm using is the classic MNIST digit recognition problem, and I've been able to get close to 98% accuracy on it. It takes roughly 5 seconds to run per epoch, but on the GPU it takes 6 times this amount. I use the GPU by changing the create_new function, like so:

function net = create_net(structure)
numLayers = length(structure) - 1;
net = struct('b', [], 'w', cell(1, numLayers));
for i=1:numLayers
    net(i).w = gpuArray(randn(structure(i), structure(i+1))/sqrt(structure(i)));
    net(i).b = gpuArray(randn(1, structure(i+1)));
end
end

Am i doing something wrong here? Would appreciate any feedback on optimizing the code, and how to solve this GPU issue.

Thanks for reading

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Joss Knight am 31 Mär. 2021

Well done on implementing your own neural net in MATLAB! I can't see anything obviously wrong. But of course, the GPU is only really effective when it's fully utilized. For a network processing data like MNIST, which typically has inputs of size 28x28, a batch size of 10 means the GPU is only processing about 10000 numbers at once - barely scratching the surface really. What happens when you increase the batch size to something like 256? Or 1024...?

AlexRD am 31 Mär. 2021

Increasing the batch size has little effect on the time it takes to finish an epoch in my algorithm. I think it's because the amount of calculations per epoch is fixed, but the time it takes to train the network is significantly increased.

Changing it from 10 to 100 gives me better time per epoch actually, since i imagine there are less function calls for backprop (from ~5s on the CPU to ~4.5s, and same for the GPU), but the time it takes for the network to fully finish training is increased proportional to the batch size.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Joss Knight am 31 Mär. 2021

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/788519-why-is-my-code-running-slower-on-the-gpu#answer_663974

Increasing the batch size alone cannot improve convergence in a simple MLP, you need to match it with an increase to the learning rate.

But more to the point of your question, does increasing the batch size improve the GPU performance relative to the CPU?

10 Kommentare
8 ältere Kommentare anzeigen8 ältere Kommentare ausblenden

Joss Knight am 2 Apr. 2021

Yes, you're right in one sense - it doesn't look as though you're normalizing the gradients by the batch size, so their magnitude should be scaling proportional to the batch size, thus increasng the effective learning rate. But your two comments seem contradictory. Does it converge in fewer iterations or more? It should take proportionally fewer iterations to converge with a larger batch size, because you're taking bigger steps. The reason you can safely increase the step size is because a larger batch gives a better estimate of the true gradient direction. So a rule of thumb is that it's going to take a fixed number of observations (images) to get to a certain accuracy, but with a large batch size your throughput is higher and so you get to the answer quicker. This has its limits of course. At a certain point the batch size is too large,

The reason why the GPU speed is unchanged with larger batch sizes is to do with utilization. Until you're fully utilizing the GPU, it just has a pretty much fixed execution time. But it's processing more data in the same time. You're seeing similar behaviour from the CPU I see, which might indicate an issue with your code, or could be the equivalent happening with your multicore CPU.

AlexRD am 5 Apr. 2021

In MATLAB Online öffnen

Yeah, that's what i overhauled about my backprop code and improved the timing significantly.

This is the new code i've written:

function [net, performance] = train_network(net, tdata, tlabel, vdata, vlabel, numEpochs, batch_size)
emptyLayer = struct('a', num2cell(net(1).b(1) * (zeros(1, length(net)))));
emptyNet = struct('b', 0, 'w', num2cell((zeros(1, length(net)))));
rateFunc = interp1(0.5 ./ (1:10), linspace(1, 10, numEpochs));
performance = [];
%% Backpropagation
for epoch=1:numEpochs
    tic;
    randIndex = randperm(size(tdata, 2));
    tdata = tdata(:, randIndex);
    tlabel = tlabel(:, randIndex);
    
    mainIndex = 1:batch_size:size(tdata, 2);
    mainIndex(end) = size(tdata, 2);
    
    for i=1:length(mainIndex)-1
        delta = emptyNet;
        layer = feed_forward(net, tdata(:, mainIndex(i):mainIndex(i+1)-1), emptyLayer);
        
        delta(length(net)).b = layer(length(net)).a - tlabel(:, mainIndex(i):mainIndex(i+1)-1);
        delta(length(net)).w = delta(length(net)).b * layer(length(net)-1).a';
        
        for L=length(net)-1:-1:2
            delta(L).b = (net(L+1).w' * delta(L+1).b) .* sigma_prime(layer(L).a);
            delta(L).w = delta(L).b * layer(L-1).a';
        end
        
        delta(1).b = (net(2).w' * delta(2).b) .* sigma_prime(layer(1).a);
        delta(1).w = delta(1).b * tdata(:, mainIndex(i):mainIndex(i+1)-1)';
        
        for L=1:length(net)
            net(L).b = net(L).b - rateFunc(epoch)/length(mainIndex(i):mainIndex(i+1)) * sum(delta(L).b, 2);
            net(L).w = net(L).w - rateFunc(epoch)/length(mainIndex(i):mainIndex(i+1)) * delta(L).w;
        end
    end
    %% Validate Net
    performance(end+1, 1) = validate_net(net, vdata, vlabel);
    if mod(epoch, 1) == 0
        train_error = validate_net(net, tdata, tlabel);
        fprintf ('Epoch(%d): %fs', epoch, toc);
        fprintf ('\nError(Training): %f', train_error);
        fprintf ('\nError: %f\n---------------\n', performance(end, 1));
%         if train_error >= 0.995, break; end
    else
        fprintf ('Epoch(%d): %fs', epoch, toc);
        fprintf ('\nError: %f\n---------------\n', performance(end, 1));
    end
end
end

It works great as the functions now all have a much bigger matrix to deal with. The single precision change gives a nice little 10% performance boost.

One interesting thing i noticed was that i thought the performance would be roughly proportional to the neuron count * batch size, but as you can see here:

The neuron count appears to have a much bigger role.

Joss Knight am 5 Apr. 2021

Surely your numbers are saying that the size of your weight matrix has almost no effect on performance except on the CPU when it gets large enough. On the GPU you can see that despite performing much larger operations, the performance is unchanged with neuron count, which must mean you haven't fully utilized the GPU at these sizes. You should pick your neuron size based on getting good results and not overfitting, while the batch size should just be as large a number as possible that still gives good convergence.

AlexRD am 5 Apr. 2021

Thank you very much!

Melden Sie sich an, um zu kommentieren.

Why is my code running slower on the GPU?

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Akzeptierte Antwort

10 Kommentare
8 ältere Kommentare anzeigen8 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

Why is my code running slower on the GPU?

3 Kommentare 1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Akzeptierte Antwort

10 Kommentare 8 ältere Kommentare anzeigen8 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

10 Kommentare
8 ältere Kommentare anzeigen8 ältere Kommentare ausblenden