Improving the performance of vectorized code possible?
13 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Stephan
am 3 Mai 2018
Kommentiert: Nadeem Ahmed
am 29 Nov. 2022
Hello,
I rewrote a calculation script with nested loops in vectorized code, which greatly improved computation time. Now I have done several test runs of different size mxn and am a bit surprised by the results:
for m = 16 and n = 68 the improvement is about 40 ... 45%
for m = 160 and n = 680 the improvement is about 45 ... 55%
and for m = 1600 and n = 6800 the improvement is "only" about 30%.
I would have expected that the improvement by vectorization with small problems is rather small and increases with increasing problem size or at least remains at a certain level. This behavior is unclear to me and I would like to understand it to maybe make it better.
The calculation itself uses 8 matrices / vectors that are traversed and linked in different ways. These are the basic arithmetic and sin / cos functions. In addition, the quantities m and n play a role as well as the matrices generated by ndgrid. The result is a vector of size m x 1. To achieve the result, the Matlab functions diag and sum are also used.
I have used the following techniques to speed up the calculation:
- I used ndgrid instead of the nested loop
- all nested loop calculations were vectorized
- previously doubly calculated terms are calculated only once and used again
- Instead of working in a script, I've written a function that executes the code and returns the result.
My question on this topic is divided into two parts - a more theoretical and a practical one:
1.) How is it that the improvement of the calculation speed - after it first increases - decreases again with increasing problem size? I would like to understand technically what could be or is the cause.
2.) Are there any other or additional techniques that can be used, provided that the applied techniques are implemented correctly? Which could that be? Are the Matlab functions diag and sum used for such a thing or are there better alternatives?
Thank you for your efforts and kind regards
Stephan
1 Kommentar
Julian Hapke
am 3 Mai 2018
it's hard to tell if there are better ways than
diag
and
sum
if you do not share your code.
The decreasing performance increase has propably to do with memory access, but I'm no expert there.
Akzeptierte Antwort
John D'Errico
am 3 Mai 2018
Bearbeitet: John D'Errico
am 3 Mai 2018
I looked quickly at the code you were playing with in the original post.
The biggest problem was arguably that no arrays were preallocated. ALWAYS preallocate arrays that will be grown in size dynamically. This is perhaps the biggest reason for code (written by novices) running slowly.
Failing to preallocate an array that is then grown iteratively causes a quadratic time penalty. As the array size grows, the time required to dynamically re-allocate memory, then copying the entire array contents grows quadratically.
Of course, if you never grow any arrays or vectors, then there is no need to preallocate. That is often a consequence of "vectorization". But there are many ways to vectorize code. No single way exists. Often, vectorization just means re-thinking the code flow, re-thinking the basic algorithm. Very often vectorization means you need to trade off memory for external explicit loops. But the use of great gobs of memory can be expensive. And creating those large arrays, then working with them requires internal, implicitly generated loops. You will always have loops. But implicitly generated loops in compiled code tend to be much faster than explicit MATLAB loops.
Where possible in vectorized code, use either bsxfun or implicit scalar expansion. They can reduce the memory load.
You point out that the vectorization gain is not always uniform, or even consistent. That sometimes can be explained by too heavy use of memory. Thus if your temporary arrays grow so large that MATLAB has trouble finding the room, your computer will start swapping, using slow memory in exchange of fast memory.
Next, don't forget that different parts of your algorithm will have different time scalings as the problem size grows. So if one piece of the code is O(n^3) in time complexity, another is O(n), and another O(exp(n)), then the O(exp(n)) part will begin to dominate the computation as the problem grows in size, even if the constant out front was very small. So small computations may be dominated by the O(n) piece, but eventually, that O(exp(n)) term will grow until the problem becomes computationally impossible to solve.
Even vectorized code can be terribly slow, if the vectorization is terribly done. For example, suppose you want to scale every column of a matrix, using a vector of constants?
N = 10000;
A = rand(N);
K = 1:N;
So I want to multiply the j'th column by K(j).
You might use a double loop, thus effectively crap code written as if it was written in some low level language. At least loops are pretty well optimized in MATLAB these days. But still that will be slow as hell.
tic
B = A; % preallocate B!
for I = 1:N
for J = 1:N
B(I,J) = A(I,J)*K(J);
end
end
toc
Elapsed time is 4.396439 seconds.
So not very fast. Had I not preallocated B, I would still be here days later waiting though.
We could do this using a matrix multiply, with a diagonal matrix. No need to preallocate. but way slower!
tic
B = A*diag(K);
toc
Elapsed time is 48.600078 seconds.
That is because a matrix*matrix multiply os an O(N^3) operation. But we only had to do N*2 multiplies! Most of those multiplies were multiplies by zero, then adding up a lot of mainly zeros.
Making K into a sparse diagonal matrix makes that part much faster.
tic
B = A*spdiags(K(:),0,N,N);
toc
Elapsed time is 0.641262 seconds.
But even sparse matrix multiplies are not perfect. A simple repmat is better here.
tic
B = A.*repmat(K,N,1);
toc
Elapsed time is 0.501364 seconds.
Of course, this is why bsxfun was invented, to replace those repmats.
tic
B = bsxfun(@times,A,K);
toc
Elapsed time is 0.289946 seconds.
If you have a new release of MATLAB, then implicit array expansion helps, because bsxfun brings some overhead.
tic
B = A.*K;
toc
Elapsed time is 0.207587 seconds.
And, of course, had N been larger or smaller, then all of these basic variations might change their times, some getting relatively larger or slower.
In the end, there is no single best way to vectorize ANY block of code. The best way will depend on the problem size. It will sometimes depend on the data itself. It will depend on your skill at writing code in MATLAB, and how well you understand how MATLAB uses memory. It will depend on your knowledge of what functions are available to you. (For example, have you blinked, and missed that some new capability was recently introduced into MATLAB?)
5 Kommentare
Nadeem Ahmed
am 29 Nov. 2022
Hello @John D'Errico, because you explain the reasons of vectorization in a very efficient manner, I also would like to know how can we use the technique of vectorization efficiently if we have this kind of problem
clc
clear
close all
n=1000;
C1=zeros(n,n);
C2=zeros(n,n);
A=rand(n,n);
B=rand(n,n);
tic
for i=2:n-1
for j=2:n-1
C1(i,j) = (A(i,j)*B(i,j-1) + A(i-1,j)*B(i+1,j-1))/(A(i,j+1)*B(i+1,j));
end
end
toc
%VECTORIZATION
tic
C2(2:n-1,2:n-1)=((A(2:n-1,2:n-1).*B(2:n-1,1:n-2) + A(1:n-2,2:n-1).*B(3:n,1:n-2)))./(A(2:n-1,3:n).*B(3:n,2:n-1));
toc;
norm(C1-C2)
This is a very basic example, although it is showing the improvement after vectorization but not that enough. If I make more divison and multiplication in the same function, "vectorization" will become even worse than "for loop". It would be very helpful for me if you provide some suggestions on how I should improve my code.
Weitere Antworten (0)
Siehe auch
Kategorien
Mehr zu Matrix Indexing finden Sie in Help Center und File Exchange
Produkte
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!