# Improving the performance of matrix multiplication and division with nested for loops

2 views (last 30 days)
Matthew Kehoe on 25 Jul 2021
Commented: Matthew Kehoe on 31 Jul 2021
My Matlab code has a section of code which repeatedly performs matrix multiplication and division (through the backslash operator =A\b). When testing large datasets, a certain function named solvebvp_colloc.m is called millions of times. As this function contains multiple matrix multiplications and divisions, it tends to take up a lot of runtime. I'm curious if I could rewrite solvebvp_collect.m (or the code before solvebvp_collect.m) in order to improve performance. I'm not sure if matrix decomposition will help or if I could remove one of the for loops referencing j=1:Nx completely.
My original implementation is shown below. The function solvebvp_colloc.m is harmful to the overall performance of my Matlab code.
% Test Data
N = 3;
M = 3;
% In real test scenarios, I would have larger N,M such as N=M=30
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
% My production data doesn't have rand(). I am only creating random
% matrices to test peformance.
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_new = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 150; % I'm creating a ntests variable to test running the code below multiple times. The
% ntests variable is not part of my production code (and is only used for testing performance).
tic
% Original Implementation
for ii=1:ntests
for n=0:N
for m=0:M
for j=1:Nx
Fnmhat_p = Fnmhat(j,:).';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p(j)^2 - 2*alpha*p(j);
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(j,n+1,m+1);
d_max = -1i*gammap(j);
n_max = 1.0;
r_max = Jnmhat(j);
% This subroutine is expensive for large N,M. The problem is that
% solvebvp_collec.m needs to run against three for loops from
% 1:N,1:M,1:Nx and can be called millions of times.
uhat_p = solvebvp_colloc(Fnmhat_p,alphaalpha,betabeta,gammagamma,...
d_min,n_min,r_min,d_max,n_max,r_max,identy,D,D2);
Uhat(j,:) = uhat_p.';
end
end
end
end
toc
where the function solvebvp_colloc is of the form
function [utilde] = solvebvp_colloc(ftilde,alpha,beta,gamma,d_min,n_min,...
r_min,d_max,n_max,r_max,identy,D,D2)
A = alpha*D2 + beta*D + gamma*identy;
b = ftilde;
A(end,:) = n_min*D(end,:);
A(end) = A(end) + d_min;
b(end) = r_min;
A(1,:) = n_max*D(1,:);
A(1,1) = A(1,1) + d_max;
b(1) = r_max;
utilde = A\b;
return;
This code is slow for large N,M (large being anything over N=M=10. I would consider N=M=30 as a resonable test case for testing large datasets.) I think that my code design has three major performance flaws including:
1. The function solvebvp_colloc.m is processed a large number of times (as it needs to run for all values of the for loops 1:N, 1:M, and 1:Nx). When N and M are large (say N=M=20), the profiler shows that this function is called 20,829,312 times and takes a total runtime of 1042 seconds or 17.3 minutes. The operation utilde = A\b is expensive and calling this millions of times takes up a lot of runtime. Calling the matrix multiplication operations in solvebvp_colloc.m is also expensive.
2. Running an extra for loop with j=1:Nx creates larger runtime.
3. I don't use matrix decomposition. Since the matrix A changes in every iteration of j (because of the value of gammagamma is different for different values of j), I'm not sure if I can use matrix decomposition to speed up this calculation.
My initial thought was to avoid writing the additional for loop from j=1:Nx. I implemented this in a second impementation below. I wasn't able to figure out how to move the matrix A out of the for loop because the size of gammagamma is (Nx,1) and the size of the identity matrix (Nx+1,Nx+1) are different.
% Test Data
N = 3;
M = 3;
% In real test scenarios, I would have larger N,M such as N=M=30
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
% My production data doesn't have rand(). I am only creating random
% matrices to test peformance.
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_new = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 150;
% New Implementation
tic
for ii=1:ntests
for n=0:N
for m=0:M
% Moved outside of the for loop
Fnmhat_p = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
% Moved b outside of the for loop
b = Fnmhat_p;
% I don't know how to move A outside of the for loop as gammagamma
% is of size(Nx,1) and identy is of size(Ny+1,Ny+1) so writing
% A = alphaalpha*D2 + betabeta*D + gammagamma.*identy
% wouldn't work since the two matrices are of different sizes.
for j=1:Nx
uhat_p_new = solvebvp_colloc_new(b,alphaalpha,betabeta,gammagamma,...
d_min,n_min,r_min,d_max,n_max,r_max,identy,D,D2,j);
Uhat_new(j,:) = uhat_p_new.';
end
end
end
end
toc
diff = norm(Uhat - Uhat_new,inf); % is zero
where the new function solvebvp_colloc_new.m is written as
function [utilde] = solvebvp_colloc_new(b,alpha,beta,gamma,d_min,n_min,...
r_min,d_max,n_max,r_max,identy,D,D2,j)
A = alpha*D2 + beta*D + gamma(j)*identy;
A(end,:) = n_min*D(end,:);
A(end) = A(end) + d_min;
b(end,j) = r_min(j);
A(1,:) = n_max*D(1,:);
A(1,1) = A(1,1) + d_max(j);
b(1,j) = r_max(j);
utilde = A\b(:,j);
return;
A local test shows that the second implementation is "slightly faster."
Elapsed time of original implementation is 4.087457 seconds.
Elapsed time of new implementation 3.934619 seconds.
For large M,N (say N=M=20 or N=M=30) this difference would be larger. However, I wasn't able to significantly speed up the calculation because I still do a large amount of matrix multiplications and divisions inside the new function solvebvp_colloc_new.m. I'm interested in analyzing how I can speed up this calculation (I will need to test N=M=20 and N=M=30) and the profiler shows that this calculation takes up about 1/3 of the total runtime. I'm interested in analyzing the following performance improvements:
1. Can Matlab matrix decomposition help here even though A changes for every value of j (since gammagamma is different for every value of j)?
2. Is it possible to avoid writing the for loop from j=1:Nx in order to perform utilde = A\b less and matrix multiplication less?
3. Are there any other performance improvements that would speed up this calculation (the function solvebvp_colloc.m is called millions of times for large N,M)? I think that bsxfun may help where the backslash operator becomes utilde = bsxfun(@rdivide, A, b); although this doesn't match the dimensions of the inner for loop through j.

Chunru on 25 Jul 2021
Edited: Chunru on 25 Jul 2021
Analyse your program to see if the matrix A is independent of which loop variables (it seems A is independent to some loop variables at the first look of the program). Compute A outside the loops of the independent variables. In inner-loop contactate the right-hand b into a matrix B. Then solve multiple linear systems with the same A in one-go: X = A \ B. This way you will only do one matrix "division" for all the independent loops (if any) and hopefully the performance can be improved.
If A is changing over all loop variable, then you may not be able to improve the performance drastically.
Matthew Kehoe on 31 Jul 2021
I've optimized some of the code and have asked a different question. I am going to close this question by accepting your answer.

R2020a

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!