## Signal Processing Algorithm Acceleration in MATLAB

Note

The benchmarks in this example have been measured on a machine with four physical cores.

This example shows how to accelerate a signal processing algorithm in MATLAB® using the codegen (MATLAB Coder) and dspunfold functions. You can generate a MATLAB executable (MEX function) from an entire MATLAB function or specific parts of the MATLAB function. When you run the MEX function instead of the original MATLAB code, simulation speed can increase significantly. To generate the MEX equivalent, the algorithm must support code generation.

To use codegen (MATLAB Coder), you must have MATLAB Coder™ installed. To use dspunfold, you must have MATLAB Coder and DSP System Toolbox™ installed.

To use dspunfold on Windows and Linux, you must use a compiler that supports the Open Multi-Processing (OpenMP) application interface. See Supported Compilers.

### FIR Filter Algorithm

Consider a simple FIR filter algorithm to accelerate. Copy the firfilter function code into the firfilter.m file.

function [y,z1] = firfilter(b,x)
% Inputs:
%   b - 1xNTaps row vector of coefficients
%   x - A frame of  noisy input

% States:
%   z, z1 - NTapsx1 column vector of states

% Output:
%   y - A frame of filtered output

persistent z;

if (isempty(z))
z = zeros(length(b),1);
end
Lx = size(x,1);
y = zeros(size(x),'like',x);

z1 = z;
for m = 1:Lx
% Load next input sample
z1(1,:) = x(m,:);

% Compute output
y(m,:) = b*z1;

% Update states
z1(2:end,:) = z1(1:end-1,:);
z = z1;
end

The firfilter function accepts a vector of filter coefficients, b, a noisy input signal, x, as inputs. Generate the filter coefficients using the fir1 function.

NTaps = 250;
Fp = 4e3/(44.1e3/2);
b = fir1(NTaps-1,Fp);

Filter a stream of a noisy sine wave signal by using the firfilter function. The sine wave has a frame size of 4000 samples and a sample rate of 192 kHz. Generate the sine wave using the dsp.SineWave System object™. The noise is a white Gaussian with a mean of 0 and a variance of 0.02. Name this function firfilter_sim.  The firfilter_sim function calls the firfilter function on the noisy input.

function totVal = firfilter_sim(b)
% Create the signal source
Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200);
totVal = zeros(4000,500);
R  = 0.02;

clear firfilter;

% Iteration loop. Each iteration filters a frame of the noisy signal.
for i = 1 : 500
trueVal = Sig();                        % Original sine wave
noisyVal = trueVal + sqrt(R)*randn;     % Noisy sine wave
filteredVal = firfilter(b,noisyVal);    % Filtered sine wave
totVal(:,i) = filteredVal;              % Store the entire sine wave
end

Run firfilter_sim and measure the speed of execution. The execution speed varies depending on your machine.

tic;totVal = firfilter_sim(b);t1 = toc;
fprintf('Original Algorithm Simulation Time: %4.1f seconds\n',t1);
Original Algorithm Simulation Time:  7.8 seconds

### Accelerate the FIR Filter Using codegen

Call codegen on firfilter, and generate its MEX equivalent, firfilter_mex. Generate and pass the filter coefficients and the sine wave signal as inputs to the firfilter function.

Ntaps = 250;
Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200); % Create the Signal Source
R = 0.02;
trueVal = Sig();                    % Original sine wave
noisyVal = trueVal + sqrt(R)*randn; % Noisy sine wave
Fp = 4e3/(44.1e3/2);
b = fir1(Ntaps-1,Fp);               % Filter coefficients

codegen firfilter -args {b,noisyVal}

In the firfilter_sim function, replace firfilter(b,noisyVal) function call with firfilter_mex(b,noisyVal). Name this function firfilter_codegen.

function totVal = firfilter_codegen(b)
% Create the signal source
Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200);
totVal = zeros(4000,500);
R  = 0.02;

clear firfilter_mex;

% Iteration loop. Each iteration filters a frame of the noisy signal.
for i = 1 : 500
trueVal = Sig();                           % Original sine wave
noisyVal = trueVal + sqrt(R)*randn;        % Noisy sine wave
filteredVal = firfilter_mex(b,noisyVal);   % Filtered sine wave
totVal(:,i) = filteredVal;                 % Store the entire sine wave
end

Run firfilter_codegen and measure the speed of execution. The execution speed varies depending on your machine.

tic;totValcodegen = firfilter_codegen(b);t2 = toc;
fprintf('Algorithm Simulation Time with codegen: %5f seconds\n',t2);
fprintf('Speedup factor with codegen: %5f\n',(t1/t2));
Algorithm Simulation Time with codegen: 0.923683 seconds
Speedup factor with codegen: 8.5531

The speedup gain is approximately 8.5.

### Accelerate the FIR Filter Using dspunfold

The dspunfold function generates a multithreaded MEX file which can improve the speedup gain even further.

dspunfold also generates a single-threaded MEX file and a self-diagnostic analyzer function. The multithreaded MEX file leverages the multicore CPU architecture of the host computer. The single-threaded MEX file is similar to the MEX file that the codegen function generates. The analyzer function measures the speedup gain of the multithreaded MEX file over the single-threaded MEX file.

Call dspunfold on firfilter and generate its multithreaded MEX equivalent, firfilter_mt. Detect the state length in samples by using the -f option, which can improve the speedup gain further. -s auto triggers the automatic state length detection. For more information on using the -f and -s options, see dspunfold.

dspunfold firfilter -args {b,noisyVal} -s auto -f [false,true]
State length: [autodetect] samples, Repetition: 1, Output latency: 8 frames, Threads: 4
Analyzing: firfilter.m
Creating single-threaded MEX file: firfilter_st.mexw64
Searching for minimal state length (this might take a while)
Checking stateless ... Insufficient
Checking 4000 samples ... Sufficient
Checking 2000 samples ... Sufficient
Checking 1000 samples ... Sufficient
Checking 500 samples ... Sufficient
Checking 250 samples ... Sufficient
Checking 125 samples ... Insufficient
Checking 187 samples ... Insufficient
Checking 218 samples ... Insufficient
Checking 234 samples ... Insufficient
Checking 242 samples ... Insufficient
Checking 246 samples ... Insufficient
Checking 248 samples ... Insufficient
Checking 249 samples ... Sufficient
Minimal state length is 249 samples
Creating multi-threaded MEX file: firfilter_mt.mexw64
Creating analyzer file: firfilter_analyzer.p

The automatic state length detection tool detects an exact state length of 259 samples.

Call the analyzer function and measure the speedup gain of the multithreaded MEX file with respect to the single-threaded MEX file. Provide at least two different frames for each input argument of the analyzer. The frames are appended along the first dimension. The analyzer alternates between these frames while verifying that the outputs match. Failure to provide multiple frames for each input can decrease the effectiveness of the analyzer and can lead to false positive verification results.

firfilter_analyzer([b;0.5*b;0.6*b],[noisyVal;0.5*noisyVal;0.6*noisyVal]);
Analyzing multi-threaded MEX file firfilter_mt.mexw64. For best results,
please refrain from interacting with the computer and stop other processes until the
analyzer is done.
Latency = 8 frames
Speedup = 3.2x

firfilter_mt has a speedup gain factor of 3.2 when compared to the single-threaded MEX file, firfilter_st. To increase the speedup further, increase the repetition factor using the -r option. The tradeoff is that the output latency increases. Use a repetition factor of 3. Specify the exact state length to reduce the overhead and increase the speedup further.

dspunfold firfilter -args {b,noisyVal} -s 249 -f [false,true] -r 3
State length: 249 samples, Repetition: 3, Output latency: 24 frames, Threads: 4
Analyzing: firfilter.m
Creating single-threaded MEX file: firfilter_st.mexw64
Creating multi-threaded MEX file: firfilter_mt.mexw64
Creating analyzer file: firfilter_analyzer.p

Call the analyzer function.

firfilter_analyzer([b;0.5*b;0.6*b],[noisyVal;0.5*noisyVal;0.6*noisyVal]);
Analyzing multi-threaded MEX file firfilter_mt.mexw64. For best results,
please refrain from interacting with the computer and stop other processes
until the analyzer is done.
Latency = 24 frames
Speedup = 3.8x

The speedup gain factor is 3.8, or approximately 32 times the speed of execution of the original simulation.

For this particular algorithm, you can see that dspunfold is generating a highly optimized code, without having to write any C or C++ code. The speedup gain scales with the number of cores on your host machine.

The FIR filter function in this example is only an illustrative algorithm that is easy to understand. You can apply this workflow on any of your custom algorithms. If you want to use an FIR filter, it is recommended that you use the dsp.FIRFilter System object in DSP System Toolbox. This object runs much faster than the benchmark numbers presented in this example, without the need for code generation.

### Kalman Filter Algorithm

Consider a Kalman filter algorithm, which estimates the sine wave signal from a noisy input. This example shows the performance of Kalman filter with codegen and dspunfold.

The noisy sine wave input has a frame size of 4000 samples and a sample rate of 192 kHz. The noise is a white Gaussian with mean of 0 and a variance of 0.02.

The function filterNoisySignal calls the kalmanfilter function on the noisy input.

type filterNoisySignal
function totVal = filterNoisySignal
% Create the signal source
Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200);
totVal = zeros(4000,500);
R  = 0.02;
clear kalmanfilter;
% Iteration loop to estimate the sine wave signal
for i = 1 : 500
trueVal = Sig();                    % Actual values
noisyVal = trueVal + sqrt(R)*randn; % Noisy measurements
estVal = kalmanfilter(noisyVal);    % Sine wave estimated by Kalman filter
totVal(:,i) = estVal;               % Store the entire sine wave
end
type kalmanfilter
function [estVal,estState] = kalmanfilter(noisyVal)
% This function tracks a noisy sinusoid signal using a Kalman filter
%
% State Transition Matrix
A = 1;
stateSpaceDim = size(A,1);

% Measurement Matrix
H = 1;
measurementSpaceDim = size(H,1);
numTsteps = size(noisyVal,1)/measurementSpaceDim;

% Containers to store prediction and estimates for all time steps
zEstContainer = noisyVal;
xEstContainer = zeros(size(noisyVal));

Q = 0.0001; % Process noise covariance
R = 0.02;   % Measurement noise covariance
persistent xhat P xPrior PPrior;

% Local copies of discrete states
if isempty(xhat)
xhat = 5; % Initial state estimate
end

if isempty(P)
P = 1; % Error covariance estimate
end

if isempty(xPrior)
xPrior = 0;
end

if isempty(PPrior)
PPrior = 0;
end

% Loop over all time steps
for n=1:numTsteps

% Gather chunks for current time step
zRowIndexChunk = (n-1)*measurementSpaceDim + (1:measurementSpaceDim);
stateEstsRowIndexChunk = (n-1)*stateSpaceDim + (1:stateSpaceDim);

% Prediction step
xPrior = A * xhat;
PPrior =  A * P * A' + Q;

% Correction step. Compute Kalman gain.
PpriorH   = PPrior * H';
HPpriorHR  = H * PpriorH + R;
KalmanGain = (HPpriorHR \ PpriorH')';
KH  = KalmanGain * H;

% States and error covariance are updated in the
% correction step
xhat = xPrior + KalmanGain * noisyVal(zRowIndexChunk,:) - ...
KH * xPrior;
P = PPrior - KH * PPrior;

% Append estimates
xEstContainer(stateEstsRowIndexChunk, :) = xhat;
zEstContainer(zRowIndexChunk,:) = H*xhat;

end

% Populate the outputs
estVal = zEstContainer;
estState = xEstContainer;

end

Run filterNoisySignal.m and measure the speed of execution.

tic;totVal = filterNoisySignal;t1 = toc;
fprintf('Original Algorithm Simulation Time: %4.1f seconds\n',t1);
Original Algorithm Simulation Time: 21.7 seconds

### Accelerate the Kalman Filter Using codegen

Call the codegen function on kalmanfilter, and generate its MEX equivalent, kalmanfilter_mex.

The kalmanfilter function requires the noisy sine wave as the input.

Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200); % Create the Signal Source
R = 0.02;                           % Measurement noise covariance
trueVal = step(Sig);                % Actual values
noisyVal = trueVal + sqrt(R)*randn; % Noisy measurements
codegen -args {noisyVal} kalmanfilter.m

Replace kalmanfilter(noisyVal) in filterNoisySignal function with kalmanfilter_mex(noisyVal). Name this function as filterNoisySignal_codegen

function totVal = filterNoisySignal_codegen
% Create the signal source
Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200);
totVal = zeros(4000,500);
R  = 0.02;
clear kalmanfilter_mex;
% Iteration loop to estimate the sine wave signal
for i = 1 : 500
trueVal = Sig();                        % Actual values
noisyVal = trueVal + sqrt(R)*randn;     % Noisy measurements
estVal = kalmanfilter_mex(noisyVal);    % Sine wave estimated by Kalman filter
totVal(:,i) = estVal; % Store the entire sine wave
end

Run filterNoisySignal_codegen and measure the speed of execution.

tic; totValcodegen = filterNoisySignal_codegen; t2 = toc;
fprintf('Algorithm Simulation Time with codegen: %5f seconds\n',t2);
fprintf('Speedup with codegen is %0.1f',t1/t2);
Algorithm Simulation Time with codegen: 0.095480 seconds
Speedup with codegen is 227.0

The Kalman filter algorithm implements several matrix multiplications. codegen uses the Basic Linear Algebra Subroutines (BLAS) libraries to perform these multiplications. These libraries generate a highly optimized code, hence giving a speedup gain of 227.

### Accelerate the Kalman Filter Using dspunfold

Generate a multithreaded MEX file using dspunfold and compare its performance with codegen.

Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200);
% Create the signal source
R = 0.02;                           % Measurement noise covariance
trueVal = step(Sig);                % Actual values
noisyVal = trueVal + sqrt(R)*randn; % Noisy measurements
dspunfold kalmanfilter -args {noisyVal} -s auto
State length: [autodetect] frames, Repetition: 1, Output latency: 8 frames, Threads: 4
Analyzing: kalmanfilter.m
Creating single-threaded MEX file: kalmanfilter_st.mexw64
Searching for minimal state length (this might take a while)
Checking stateless ... Insufficient
Checking 1 frames ... Sufficient
Minimal state length is 1 frames
Creating multi-threaded MEX file: kalmanfilter_mt.mexw64
Creating analyzer file: kalmanfilter_analyzer.p

Call the analyzer function.

kalmanfilter_analyzer([noisyVal;0.01*noisyVal;0.05*noisyVal;0.1*noisyVal]);
Analyzing multi-threaded MEX file kalmanfilter_mt.mexw64. For best results,
please refrain from interacting with the computer and stop other processes until
the analyzer is done.
Latency = 8 frames
Speedup = 0.7x

kalmanfilter_mt has a speedup factor of 0.7, which is a performance loss of 30% when compared to the single-threaded MEX file, kalmanfilter_st. Increase the repetition factor to 3 to see if the performance increases. Also, detect the state length in samples.

dspunfold kalmanfilter -args {noisyVal} -s auto -f true -r 3
State length: [autodetect] samples, Repetition: 3, Output latency: 24 frames, Threads: 4
Analyzing: kalmanfilter.m
Creating single-threaded MEX file: kalmanfilter_st.mexw64
Searching for minimal state length (this might take a while)
Checking stateless ... Insufficient
Checking 4000 samples ... Sufficient
Checking 2000 samples ... Sufficient
Checking 1000 samples ... Sufficient
Checking 500 samples ... Sufficient
Checking 250 samples ... Insufficient
Checking 375 samples ... Sufficient
Checking 312 samples ... Sufficient
Checking 281 samples ... Sufficient
Checking 265 samples ... Sufficient
Checking 257 samples ... Insufficient
Checking 261 samples ... Sufficient
Checking 259 samples ... Sufficient
Checking 258 samples ... Insufficient
Minimal state length is 259 samples
Creating multi-threaded MEX file: kalmanfilter_mt.mexw64
Creating analyzer file: kalmanfilter_analyzer.p

Call the analyzer function.

kalmanfilter_analyzer([noisyVal;0.01*noisyVal;0.05*noisyVal;0.1*noisyVal]);
Analyzing multi-threaded MEX file kalmanfilter_mt.mexw64. For best results,
please refrain from interacting with the computer and stop other processes until the
analyzer is done.
Latency = 24 frames
Speedup = 1.4x

dspunfold gives a speedup gain of 40% when compared to the highly optimized single-threaded MEX file. Specify the exact state length and increase the repetition factor to 4.

dspunfold kalmanfilter -args {noisyVal} -s 259 -f true -r 4
State length: 259 samples, Repetition: 4, Output latency: 32 frames, Threads: 4
Analyzing: kalmanfilter.m
Creating single-threaded MEX file: kalmanfilter_st.mexw64
Creating multi-threaded MEX file: kalmanfilter_mt.mexw64
Creating analyzer file: kalmanfilter_analyzer.p

Invoke the analyzer function to see the speedup gain.

kalmanfilter_analyzer([noisyVal;0.01*noisyVal;0.05*noisyVal;0.1*noisyVal]);
Analyzing multi-threaded MEX file kalmanfilter_mt.mexw64. For best results, please refrain
from interacting with the computer and stop other processes until the analyzer is done.
Latency = 32 frames
Speedup = 1.5x

The speedup gain factor is 50% when compared to the single-threaded MEX file.

The performance gain factors codegen and dspunfold give depend on your algorithm. codegen provides sufficient acceleration for some MATLAB constructs. dspunfold can provide additional performance gains using the cores available on your machine to distribute your algorithm via unfolding. As shown in this example, the amount of speedup that dspunfold provides depends on the particular algorithm to accelerate. Use dspunfold in addition to codegen if your algorithm is well-suited for distributing via unfolding, and if the resulting latency cost is in line with the constraints of your application.

Some MATLAB constructs are highly optimized with MATLAB interpreted execution. The fft function, for example, runs much faster in interpreted simulation than with code generation.