Can A = A + B'*B be sped up somehow? It is seriously bottlenecking my for-loop
Ältere Kommentare anzeigen
I have a script where a very large matrix A (square, up to about 10000 x 10000 values) is initialized outside of a for-loop and is then overwritten many times within the loop like this:
A = zeros(6588,6588);
for i =1:1000
% B changes to a new value here and is a 27x6588 matrix
A = A + B' * B;
end
I wanted to remove the loop altogether but I think it’s impossible for me to calculate all instances of transpose(B)*B outside of the loop beforehand since I run out of memory.
Is there anything I could do to this code segment to speed it up? Computationally it’s just a few simple operations but they still take over 70% of my scripts runtime and I can’t figure out a way to improve this. Is it possible?
1 Kommentar
Alfonso Nieto-Castanon
am 14 Aug. 2015
there may be ways to reduce the computation time depending on whether B{i} show some dependencies between iterations (e.g. B{i} = C{i}*B0 with B0 having less rows than columns) or depending on what you are later using A for (e.g. eigenvectors of A may be more easily computed). Could you say a few more details of what these computations are meant to produce?
Akzeptierte Antwort
Weitere Antworten (4)
Echoing Walter's response, you need more memory if you want it to go faster. One thing you might consider is using single precision variables instead of double, which effectively doubles the number of variables you can handle, but at the obvious cost of numerical precision.
If you're willing to get more memory or reduce your precision, you can get a 20x speed-up by pre-computing all of your B matrices ahead of time, then reshaping to make the computation just a single (albeit very large) matrix multiplication. You might also get a modest speed-up using James Tursa's venerable mtimesx package from the File Exchange (link here ).
Feel free to play with the function below to get some ideas. Note that I have an obscenely large quantity of RAM to play with, so you might want to reduce k_max to something more reasonable than 1000 before running it:
function symmat_add(k_max)
m = 27;
n = 6588;
%k_max = 1000; % use input variable...
% baseline
rng(0);
tic;
A = zeros(n, n);
for k = 1:k_max
% B changes to a new value here and is a 27x6588 matrix
B = rand(m, n);
A = A + B.'*B;
end
toc;
% mtimesx basic
rng(0);
tic;
B_mtx = rand(m, n, k_max);
A_mtx = sum(mtimesx(B_mtx, 'T', B_mtx), 3);
toc;
% mtimesx BLAS (i.e. SSYRK/DSYRK)
rng(0);
tic;
B_blas = rand(m, n, k_max);
A_blas = sum(mtimesx(B_blas, 'T', B_blas, 'blas'), 3);
toc;
% mtimesx speed
rng(0);
tic;
B_speed = rand(m, n, k_max);
A_speed = mtimesx(B_speed, 'T', B_speed, 'speed');
A_speed = sum(A_speed, 3);
toc;
% permute
rng(0);
tic;
B_perm = rand(m, n, k_max);
B_perm = reshape(permute(B_perm, [1, 3, 2]), m*k_max, n);
A_perm = B_perm.'*B_perm;
toc;
% verify results
norm(A - A_mtx, 'fro')
norm(A - A_blas, 'fro')
norm(A - A_speed, 'fro')
norm(A - A_perm, 'fro')
end % function symmat_add
For k_max = 10, I get:
>> symmat_add(10)
Elapsed time is 1.937278 seconds.
Elapsed time is 1.710109 seconds.
Elapsed time is 1.664435 seconds.
Elapsed time is 1.532890 seconds.
Elapsed time is 0.207303 seconds.
ans = 0
ans = 0
ans = 0
ans = 1.1347e-10
For k_max = 100, I get:
>> symmat_add(100)
Elapsed time is 17.950650 seconds.
Elapsed time is 17.012750 seconds.
Elapsed time is 17.257742 seconds.
Elapsed time is 15.000076 seconds.
Elapsed time is 0.855950 seconds.
ans = 0
ans = 0
ans = 0
ans = 1.4813e-09
For k_max = 1000, there seems to be an issue with mtimesx as it starts using TB's of memory, so I killed it. However, comparing just the baseline and permute portions of the code, I get:
>> symmat_add(1000)
Elapsed time is 171.627770 seconds.
Elapsed time is 6.422042 seconds.
ans = 3.9958e-08
Note that while the permute trick is by far the fastest, it also gives the largest apparent error, and the difference grows with the size of k_max; it may not be wise to mix this with single precision...
In summary, try to get more RAM, pre-compute B, then consider using permute to accelerate the computation. Your 27-by-6588-by-1000 B matrix should only require about 1.4 GB in double precision, which is large but reasonable, especially if you can pre-compute it and just load the resulting A matrix when you need to compute other large variables in your workspace.
I hope this helps.
Walter Roberson
am 13 Aug. 2015
Are the values of B independent of the previous iteration? Could you calculate any one B by knowing an initial value and the number of the iteration that is being done ("i" in your sample code)?
If so and if you have the Parallel Computing Toolbox, you could use parfor for this. Your "A" would be a "reduction variable":
parfor i = 1 : 1000
B = some calculation that does not depend upon the previous B but might depend on "i"
A = A + B' * B;
end
2 Kommentare
Peta
am 14 Aug. 2015
If your memory gets low, install more memory. This is trivial, but efficient. Much more efficient than any smart tricks. Except for the trick to avoid repeated calculations. Are you able to create B'*B directly using the algorithm to create B?
Note that Matlab's JIT acceleration suffers significantly from variables, which change their type and similar things. So please post the relevant part of the code, which reproduces the problem. Perhaps we find an easy way to accelerate the code without changing the method.
Any parallel processing requires more RAM for physical reasons.
I wanted to remove the loop altogether but I think it’s impossible for me to calculate all instances of transpose(B)*B outside of the loop beforehand since I run out of memory.
Maybe you don't have enough memory to hold all instances of B a priori, but you should have enough memory to hold batches of, say, 200 of them at a time. Concatenating 200 generations of B.' column-wise would give you a 6588x5400 matrix, still smaller then A. This would allow you to do the update of A in bigger, vectorized chunks, as follows:
A = zeros(6588,6588);
for n=1:5
for i =1:200
BTcell{1,i}=...%put transpose(B) in here
end
Bt=cell2mat(Bcell);
A = A + Bt * Bt';
end
So, now, the update A = A + B' * B effectively only gets executed 5 times instead of 1000. This is important, because no matter how many rows B has, the computation of B'*B always results in a 6588x6588 memory allocation. So, pre-concatenating more B row data, like above, ensures that the memory allocation is repeated only 5 times versus 1000.
3 Kommentare
Peta
am 22 Aug. 2015
B is right now a 1073x1 cell array where each cell is 27x6588.
Basically what I'm saying is, instead of partitioning your data into 27x6588 chunks called B , partition instead in bigger N x 6588 chunks called Z where N>>27 is as large as your computer can handle. Then update your A matrix in the same way, but with Z
A = A + Z'*Z;
This way, the total number of updates of A and the accompanying 6588 x 6588 memory allocations will be reduced.
Now, the naive way to create the Z chunks from your current B chunks is to concatenate some of your B data row-wise. But row-wise concatenation is slow, so if there's a way you can either
(a) generate Z chunks directly, or
(b) generate B from scratch in 6588 x 27 form, i.e., with rows and columns interchanged, and then concatenate column-wise, or
(c) transpose the 27 x 6588 data before storing them to your cell and similarly concatenate column-wise,
then I think you will be better off. However, if you build the Z matrices by column-wise concatenation, then the way you update A must get modified to
A=A+Z*Z.';
First, generate all the 27x6588 chunks as you are now, except that instead make them 6588x27, either by transposition
Matt J
am 22 Aug. 2015
Compare:
B=rand(27,6588);
Z=repmat(B,20,1);
A1=zeros(6588);
tic;
for i=1:40
A1=A1+B.'*B;
end
toc
%Elapsed time is 21.234253 seconds.
A2=zeros(6588);
tic
for i=1:2
A2=A2+Z'*Z;
end
toc
%Elapsed time is 1.343334 seconds.
>> Error=max(abs(A1(:)-A2(:)))
Error =
1.1937e-12
Josh Lee
am 3 Okt. 2015
0 Stimmen
Is your data comprised only of real values? If so have you tried using .' instead of '? The first performs only a transposition operation while the second performs a complex conjugation and transposition operations. I'm sure there are greater performance gains to be had by implemented the rather well reasoned solutions mentioned earlier. But every little bit helps, right?
1 Kommentar
Kategorien
Mehr zu Matrix Indexing finden Sie in Hilfe-Center und File Exchange
Produkte
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!