Reproducibility in Parallel Statistical Computations

Issues and Considerations in Reproducing Parallel Computations

A reproducible computation is one that gives the same results every time it runs. Reproducibility is important for:

Debugging — To correct an anomalous result, you need to reproduce the result.
Confidence — When you can reproduce results, you can investigate and understand them.
Modifying existing code — When you change existing code, you want to ensure that you do not break anything.

Generally, you do not need to ensure reproducibility for your computation. Often, when you want reproducibility, the simplest technique is to run in serial instead of in parallel. In serial computation you can simply call the rng function as follows:

s = rng % Obtain the current state of the random stream
% run the statistical function
rng(s) % Reset the stream to the previous state
% run the statistical function again, obtain identical results

This section addresses the case when your function uses random numbers, and you want reproducible results in parallel. This section also addresses the case when you want the same results in parallel as in serial.

Running Reproducible Parallel Computations

To run a Statistics and Machine Learning Toolbox™ function reproducibly:

Set the UseSubstreams option to true using statset.
Set the Streams option to a type that supports substreams: 'mlfg6331_64' or 'mrg32k3a'. For information on these streams, see RandStream.list.
To compute in parallel, set the UseParallel option to true.
To fit an ensemble in parallel using fitcensemble or fitrensemble, create a tree template with the 'Reproducible' name-value pair set to true:
```
t = templateTree('Reproducible',true);
ens = fitcensemble(X,Y,'Method','bag','Learners',t,...
    'Options',options);
```
Call the function with the options structure.
To reproduce the computation, reset the stream, then call the function again.

To understand why this technique gives reproducibility, see How Substreams Enable Reproducible Parallel Computations.

For example, to use the 'mlfg6331_64' stream for reproducible computation:

Create an appropriate options structure:

s = RandStream('mlfg6331_64');
options = statset('UseParallel',true, ...
    'Streams',s,'UseSubstreams',true);

Run your parallel computation. For instructions, see Quick Start Parallel Computing for Statistics and Machine Learning Toolbox.
Reset the random stream:
```
reset(s);
```
Rerun your parallel computation. You obtain identical results.

For examples of parallel computation run this reproducible way, see Reproducible Parallel Bootstrap and Train Classification Ensemble in Parallel.

Parallel Statistical Computation Using Random Numbers

What Are Substreams?

A substream is a portion of a random stream that RandStream can access quickly. There is a number M such that for any positive integer k, RandStream can go to the kMth pseudorandom number in the stream. From that point, RandStream can generate the subsequent entries in the stream. Currently, RandStream has M = 2⁷², about 5e21, or more.

Timeline showing the first three substreams of a sequence. Each substream has length M.

The entries in different substreams have good statistical properties, similar to the properties of entries in a single stream: independence, and lack of k-way correlation at various lags. The substreams are so long that you can view the substreams as being independent streams, as in the following picture.

Three boxes illustrating three consecutive substreams. Inside each box is a sequence of random numbers.

Two RandStream stream types support substreams: 'mlfg6331_64' and 'mrg32k3a'.

How Substreams Enable Reproducible Parallel Computations

When MATLAB^® performs computations in parallel with parfor, each worker receives loop iterations in an unpredictable order. Therefore, you cannot predict which worker gets which iteration, so cannot determine the random numbers associated with each iteration.

Substreams allow MATLAB to tie each iteration to a particular sequence of random numbers. parfor gives each iteration an index. The iteration uses the index as the substream number. Since the random numbers are associated with the iterations, not with the workers, the entire computation is reproducible.

To obtain reproducible results, simply reset the stream, and all the substreams generate identical random numbers when called again. This method succeeds when all the workers use the same stream, and the stream supports substreams. This concludes the discussion of how the procedure in Running Reproducible Parallel Computations gives reproducible parallel results.

Random Numbers on the Client or Workers

A few functions generate random numbers on the client before distributing them to parallel workers. The workers do not use random numbers, so operate purely deterministically. For these functions, you can run a parallel computation reproducibly using any random stream type.

The functions that operate this way include:

To obtain identical results, reset the random stream on the client, or the random stream you pass to the client. For example:

s = rng % Obtain the current state of the random stream
% run the statistical function
rng(s) % Reset the stream to the previous state
% run the statistical function again, obtain identical results

While this method enables you to run reproducibly in parallel, the results can differ from a serial computation. The reason for the difference is parfor loops run in reverse order from for loops. Therefore, a serial computation can generate random numbers in a different order than a parallel computation. For unequivocal reproducibility, use the technique in Running Reproducible Parallel Computations.

Distributing Streams Explicitly

For testing or comparison using particular random number algorithms, you must set the random number generators. How do you set these generators in parallel, or initialize streams on each worker in a particular way? Or you might want to run a computation using a different sequence of random numbers than any other you have run. How can you ensure the sequence you use is statistically independent?

Parallel Statistics and Machine Learning Toolbox functions allow you to set random streams on each worker explicitly. For information on creating multiple streams, enter help RandStream/create at the command line. To create four independent streams using the 'mrg32k3a' generator:

s = RandStream.create('mrg32k3a','NumStreams',4,...
    'CellOutput',true);

Pass these streams to a statistical function using the Streams option. For example:

parpool(4) % if you have at least 4 cores
s = RandStream.create('mrg32k3a','NumStreams',4,...
    'CellOutput',true); % create 4 independent streams
paroptions = statset('UseParallel',true,...
    'Streams',s); % set the 4 different streams
x = [randn(700,1); 4 + 2*randn(300,1)];
latt = -4:0.01:12;
myfun = @(X) ksdensity(X,latt); 
pdfestimate = myfun(x);
B = bootstrp(200,myfun,x,'Options',paroptions);

This method of distributing streams gives each worker a different stream for the computation. However, it does not allow for a reproducible computation, because the workers perform the 200 bootstraps in an unpredictable order. If you want to perform a reproducible computation, use substreams as described in Running Reproducible Parallel Computations.

If you set the UseSubstreams option to true, then set the Streams option to a single random stream of the type that supports substreams ('mlfg6331_64' or 'mrg32k3a'). This setting gives reproducible computations.