Multithreaded FILTER?

I've found several references which tell that Matlab's filter is multi-threaded since R2007a, e.g. MathWorks:Solution 1-4PG4AN.
The test:
function myFilterTest
x = rand(1e6, 2);
x1 = x(:,1);
x2 = x(:,2);
% [B, A] = butter(3, 0.2, 'low'); % Butterworth 3rd order lowpass filter
B = [0.01809893300751, 0.05429679902254, 0.05429679902254, 0.01809893300751];
A = [1, -1.760041880343, 1.182893262037, -0.2780599176345];
tic;
for i=1:100
y = filter(B, A, x); % Matrix
% clear('y'); % Avoid smart JIT interferences => same effects!
end
toc
tic;
for i=1:100
y1 = filter(B, A, x1); % Two vectors
y2 = filter(B, A, x2);
% clear('y1', 'y2'); % No qualitative changes
end
toc
[EDITED, 12-Dec-2012 22:38 UTC]: Explicite A and B instead of calling butter of SPT
Results on a Windows7/64 Core2Duo:
Matlab 2009b/64:
5.34 sec (matrix)
5.22 sec (two vectors)
Matlab 2009b/64 started with -singleCompThread:
5.23 sec (matrix)
5.24 sec (two vectors)
Matlab 2011b:
4.75 sec (matrix)
4.99 sec (two vectors)
My expectations: 1. The value of a filtered signal to a specific time depends on the complete history for an IIR filter like the Butterworth. Therefore the filtering of a vector cannot take advantage from multi-threading (is this correct?). 2. In opposite to this, filtering a [n x 2] signal should occupy two cores, such that a multi-threaded filter should need approximately the same time as for a [n x 1] signal (is this correct?).
But my double-core processor has a load of 57% during the calculations and the filtering needs nearly the same time, when I start Matlab with the -singleCompThread flag.
My conclusion: It looks like filter is not multi-threaded. Can somebody confirm this impression for 4 or 8 cores? Then with "x = rand(1e6, 8)" and "x1" to "x8". I get equivalent results for FIR filter parameters with A=1. For a 12th order Butterworth the matrix method gets an advantage of 10%.
Thanks.

5 Kommentare

Jan
Jan am 6 Dez. 2011
It does not concern the question, but I'm still surprised, that my naive *single threaded* C-Mex implementation of the algorithm showed in "doc filter" needs only 2.1 seconds for the above problem.
Daniel Shub
Daniel Shub am 6 Dez. 2011
I wonder if the "for" loop is screwing something up... Can you turn the JIT accelerator off without turning multi threading off? My concern is that maybe the JIT accelerator realizes that it can use a parfor or some sort of multi threadingy to speed things up. Even better would be if the JIT accelerator realized that y is overwritten on every loop and therefore only the last iteration needs to be run.
Jan
Jan am 6 Dez. 2011
@Daniel: You can insert a "clear('y')" and "clear('y1'); clear('y2')" in the loops to avoid effects of the smart JIT acceleration. But this does not change the quality of the speed. I've replaced the FOR-loop by creating an M-file, which contains 100 calls of FILTER - I still see the same timings.
The JIT cannot enable PARFOR magically, because I do not have PARFOR installed on my computer.
The processing time *and* the 57% processor load seems to show consistently, that the multi-threading of FILTER does not work properly. Therefore I'm interested in the processor load for "filter(rand(1e6, 8))" on a 8-core CPU.
Walter Roberson
Walter Roberson am 12 Dez. 2011
Sorry, we do not have butter() here so I cannot run the test.
Jan
Jan am 12 Dez. 2011
I'm sorry, Walter, that I forgot to add the Signal Processing Toolbox to the "Add Products" list. Now B and A are defined explicitely.

Melden Sie sich an, um zu kommentieren.

 Akzeptierte Antwort

Dr. Seis
Dr. Seis am 12 Dez. 2011

1 Stimme

I ran this on my 8-core machine using R2009a:
function myFilterTest
x = rand(1e6, 8);
x1 = x(:,1);
x2 = x(:,2);
x3 = x(:,3);
x4 = x(:,4);
x5 = x(:,5);
x6 = x(:,6);
x7 = x(:,7);
x8 = x(:,8);
[B, A] = butter(3, 0.2, 'low');
tic;
for i=1:100
y = filter(B, A, x); % Matrix
% clear('y'); % Avoid smart JIT interferences => same effects!
end
toc
tic;
for i=1:100
y1 = filter(B, A, x1); % Eight vectors
y2 = filter(B, A, x2);
y3 = filter(B, A, x3);
y4 = filter(B, A, x4);
y5 = filter(B, A, x5);
y6 = filter(B, A, x6);
y7 = filter(B, A, x7);
y8 = filter(B, A, x8);
% clear('y1', 'y2'); % No qualitative changes
end
toc
clear all;
And got this:
Elapsed time is 16.865596 seconds.
Elapsed time is 16.117599 seconds.
Only one core was active during each test.
I ran this on my 8-core machine using R2011a and got:
Elapsed time is 12.542615 seconds.
Elapsed time is 16.268821 seconds.
All eight cores were active for the first test (on the matrix) and only a single core for the seconds test (on individual vectors).
I added this to the bottom of the test:
y_par = zeros(size(x));
matlabpool(8);tic;
parfor j = 1:8
for i=1:100
y_par(:,j) = filter(B, A, x(:,j));
end
% clear('y_par'); % No qualitative changes
end
toc; matlabpool close;
And got this when using R2011a:
Elapsed time is 13.305009 seconds.
Elapsed time is 16.398203 seconds.
Starting matlabpool using the 'local' configuration ... connected to 8 labs.
Elapsed time is 3.542021 seconds.
Sending a stop signal to all the labs ... stopped.

8 Kommentare

Daniel Shub
Daniel Shub am 12 Dez. 2011
Thanks for this. Could you please start MATLAB with the -singleCompThread flag and see what happens.
I think the fact that the two versions take the same amount of time means either filter is multi-threaded for both Nx1 arrays and NxM arrays or is not multi-threaded for NxM arrays. It would really surprise me if filter is not multi-threaded for NxM arrays.
Dr. Seis
Dr. Seis am 12 Dez. 2011
I re-ran using R2011a (before was using R2009a) and got different results.
Dr. Seis
Dr. Seis am 12 Dez. 2011
I re-ran using "parfor" in R2011a. One word - sweet! From this one question I discovered (1) I have R2011a on my company machine and (2) I know kung-... I mean I know how to use "parfor".
Jan
Jan am 12 Dez. 2011
I do not see a runtime difference between filtering one RAND(1e6,1) and two RAND(0.5e6,1) signals:
x=rand(1e6,1); x1=x(1:0.5e6);x2=x(0.5e6+1:end);
tic; for i=1:100; y=filter(B,A,x); end; toc
tic; for i=1:100; y1=filter(B,A,x1); y2=filter(B,A,x2); end; toc
But looking at the algorithm a "vertical" multi-threading cannot be applied directly (if A is not [1]) - in opposite to e.g. SUM, where the i.th thread can sum the elements iThread:nThread:end. But the "horizontal" (column by column) multi-threading is easy and efficient.
@Elige: I confirm that the timings in 2011b are better - but still slower than a naive single-threaded C-mex.
Jan
Jan am 12 Dez. 2011
@Elige: Timings in 2011b and the naive single-threaded C-mex:
Elapsed time is 11.220632 seconds. (Matrix)
Elapsed time is 18.033120 seconds. (Vector by vector)
Elapsed time is 8.642208 seconds. (C-Mex FilterM, see FEX)
What timings do you get with the PARFOR loop?
Jan
Jan am 12 Dez. 2011
"FEX" means FileExchange, see link on top of this page. "FilterM" is http://www.mathworks.com/matlabcentral/fileexchange/32261-filterm .
Dr. Seis
Dr. Seis am 12 Dez. 2011
Oh, did you want the the elapsed time for the "parfor" I added? I was thinking you wanted me to use that FilterM inside the "parfor" to see how fast that was since I already had listed the elapsed time for that test. See the last section above for the elapsed time using the "parfor" - it was around 3.5 seconds (compared to 13.3 seconds and 16.4 seconds for the matrix and individual vector implementation, respectively).
Jan
Jan am 12 Dez. 2011
I did not see the PARFOR timings when I've written the comment. Thanks for measuring PARFOR - really impressing.
A comparison with FilterM will be more interesting, if the multithreaded version is published in the FEX.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (5)

Titus Edelhofer
Titus Edelhofer am 12 Dez. 2011

1 Stimme

Hi Jan,
I gave it a try on my quadcore laptop. Using your test code using MATLAB singlethreaded or multithread indeed nearly makes no difference (in the multithreaded case the code runs with about 15% CPU in contrast to the usual 12.5% (because of hyperthreading) I see for singlethreaded code.
But: if I increase the number of columns I do see a benefit (I changed x to be rand(5e5, 20) and added a loop for the call to filter on the columns. The comparison probably isn't that fair anymore, but at least the CPU runs at about 50% ...
I contacted our development for clarification, my personal impression so far: yes, filter is multithreaded, but does not benefit as strongly as other functions do ...
Titus

1 Kommentar

Jan
Jan am 12 Dez. 2011
Thanks Titus! Now I'm feeling less alone ;-)

Melden Sie sich an, um zu kommentieren.

Daniel Shub
Daniel Shub am 12 Dez. 2011

1 Stimme

As a summary answer:
The MATLAB documentation says FILTER is multi-threaded. As of r2011b, it is neither multi-threaded for Nx1 arrays nor NxM arrays. For NxM arrays a parfor loop allows for considerable speedups.

1 Kommentar

Jan
Jan am 12 Dez. 2011
And the multi-threading in 2007a to 2009b does not make FILTER faster.

Melden Sie sich an, um zu kommentieren.

Walter Roberson
Walter Roberson am 12 Dez. 2011

1 Stimme

Testing without butter()
R208b on Linux 64, 8 Xeon E5410 processors.
Default (maxNumCompThread is 8)
Elapsed time is 9.219355 seconds.
Elapsed time is 9.215591 seconds.
maxNumCompThread = 1
Elapsed time is 9.215270 seconds.
Elapsed time is 9.226053 seconds.
Really though the differences in timing are within the margin of error: my various runs had more variance than the difference between the figures I post above.

1 Kommentar

Mark Shore
Mark Shore am 13 Dez. 2011
2011b, 64-bit Windows 7, quad core i7, 2.80 GHz.
As revised with explicit A, B.
Elapsed time is 3.507754 seconds.
Elapsed time is 3.356668 seconds.
CPU usage ~12% (one thread of eight).

Melden Sie sich an, um zu kommentieren.

Ken Atwell
Ken Atwell am 13 Dez. 2011

1 Stimme

For me, running 11b on a dual-core MacBook Pro (i5), multi-threading kicks in only if variable x is at least 8 columns wide. Like most other reports in this post, my machine takes ~2.5 seconds per column when the column count is low. If I kick the column count up to 16, the time gets down to under 1.5 seconds per column.

1 Kommentar

Jan
Jan am 13 Dez. 2011
Good point, Ken! It seems like you've revealed the mechanism: FILTER is multi-threaded for >= 8 columns in 2011b.
for i=1:32; x=rand(1e6,i); tic; for j=1:10, y=filter(B,A,x); end; disp([i, toc/i]); end
2011b/64, 2 cores/Win7:
1 0.23365 Processor load: 57-64%
2 0.22441
3 0.22309
4 0.22450
5 0.22344
6 0.22479
7 0.22434
8 0.14223 <-- Processor load 95-100% !!!
9 0.14491
10 0.13942 etc, time per column stays at 0.14 sec.
For 2009a I constantly get 0.26 seconds. For 2011b -singleCompThread the times are 0.22 sec for all numbers of columns. Equivalent behaviour for RAND(1000, i).

Melden Sie sich an, um zu kommentieren.

Daniel Shub
Daniel Shub am 6 Dez. 2011

0 Stimmen

I believe that some functions require large enough sizes (and possibly lengths equal to powers of 2) before they can benefit from multi-threading. As for the complete history part, you can implement filtering with convolution. Convolution is multiplication in the frequency domain. This means you can implement filtering based on the FFT. There are multi-threaded versions of the FFT. I believe these implementations are quite complicated and require message passing and substantial overhead. I believe that the FFT of a column vector is also multi-threaded in MATLAB.

5 Kommentare

Jan
Jan am 6 Dez. 2011
@Daniel: It is correct that the mutli-threading of function is useful above a certain input size only and Matlab considers this e.g. in SUM:
x = rand(1, 88999); tic; for i=1:2000, y=sum(x); end; toc
Elapsed time is 0.240737 seconds.
x = rand(1, 89000); tic; for i=1:2000, y=sum(x); end; toc
Elapsed time is 0.133147 seconds. (unfortunately 0.366305 in 2009a! It gets much slower with multi-threading! If Matlab is started with -singleCompThread the time gets even 0.70 seconds. Cruel. So, please, TMW, hit the programmer, who was responsible for the 2009a "multi-threading" on the fingers!)
For SUM the limit for multi-threading the opration for a vector is at 89000.
But for FILTER I could not find this limit. It should be lower than 1e7. The signal length must not be a power of 2 for FILTER and using the FFT is not the point.
An older discussion about SUM:
http://www.mathworks.com/matlabcentral/newsreader/view_thread/274419
Walter Roberson
Walter Roberson am 6 Dez. 2011
The filter() doc page describes the algorithm, but there is no reference to FFT there.
I do not understand the described algorithm well enough to see any opportunities for multithreading in it.
Jan
Jan am 6 Dez. 2011
@Walter: Here you find the Matlab version of FILTER:
http://www.mathworks.com/matlabcentral/answers/9900-use-filter-constants-to-hard-code-filter .
The effect I've observed does *not* concern FFT.
It would be helpful, if anyone could run my code on 4, 6 or 8 cores and observe the processor load.
Daniel Shub
Daniel Shub am 8 Dez. 2011
I am not convinced that MATLAB actually does what the documentation says it does. It wouldn't surprise me if MATLAB changed the internal workings of filter to an algorithm that could be multithreaded and didn't bother telling us. One such algorithm could be based on convolution via the FFT. As for performance on a 4+ core machine, wish I had one.
Jan
Jan am 8 Dez. 2011
Thanks for your answer, Daniel. I thought the (missing) speed of FILTER would be of greater interest.
After I've upgraded to a dual core, I'm dissapointed, that e.g. the "multi-threaded" SUM is *slower* in Matlab 2009a (above the magic limit of 88999). Using -singleCompThread let SUM even be slower again. In 2011b the speed of SUM scales with the number of cores. But upgrading Matlab is expensive.
Mutli-threading the FILTER operation for a [N x 2] matrix should be trivial, because both columns can/could/should be processed separately.
The posted M-version and the C-implementation (published in the FEX) create identical results as FILTER without rounding differences. For a FFT method I'd expect at least some differences in the least signifcant bit.

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Performance and Memory finden Sie in Hilfe-Center und File Exchange

Gefragt:

Jan
am 6 Dez. 2011

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by