Why is batch() so slow?

Question

1 Stimme

I'm trying to use batch() to load some data from a slow disk in the background, but it is extremely slow. See code example with timings below. I think it is slower than what can be explained by the overhead of communicating with the worker (consider that I am not even transferring the loaded data from the worker to the client in the example).

>> a = rand(512, 512, 1000);
>> save('a');
>> tic; load('a'); toc
Elapsed time is 5.574926 seconds.
>> tic; b = batch(@load, 1, {'a'}); toc; tic; wait(b); toc;
Elapsed time is 0.444297 seconds.
Elapsed time is 41.229590 seconds.

You can see that the time until the batch job is done is more than 35 s longer than the same operation on the client. This is not because a new Matlab worker has to be started -- in my example, a worker was already running (if no worker were running, the batch(...) command itself would take longer, not the wait(b)).

Where does this overhead come from? How can I avoid it? (I also tried parfeval, but parfeval is plagued by a memory leak that makes it unusable -- confirmed as a known bug by MathWorks).

Thanks, Matthias

2 Kommentare
Keine anzeigen Keine ausblenden

Matthias am 16 Dez. 2014

Bearbeitet: Matthias am 16 Dez. 2014

Even more bizarrely, if I right-click on the finished job in the Job Monitor and select Show Details, the displayed report indicates that the running duration of the job is 6 seconds. That's the same as the time it took on the client session. What happens in those 35 remaining seconds?

(I got this result on two different machines. Both running 2014b, however.)

Matthias am 16 Dez. 2014

In MATLAB Online öffnen

Some more data:

>> disp(datestr(now, 'HH:MM:SS:FFF')); ...
b = batch(@batchTest, 1); ...
disp(datestr(now, 'HH:MM:SS:FFF')); ...
wait(b); ...
disp(datestr(now, 'HH:MM:SS:FFF'));
21:18:35:124
21:18:35:934
21:19:17:319
>> diary(b)
--- Start Diary ---
21:18:40:762
21:18:46:237
--- End Diary ---

Function batchTest:

function a = batchTest
disp(datestr(now, 'HH:MM:SS:FFF'));
load('a');
disp(datestr(now, 'HH:MM:SS:FFF'));

This shows that after executing the batch(...) command, ~5 s pass before the worker starts executing batchTest(). The worker is done executing batchTest() after another ~6 s, and hence executes that function just as fast as the clients. Then, another >30 s pass before wait(...) returns.

What happens in this time? Maybe the initial 5 s have to do with setting up the environment on the worker. But the 30 s after the job is done?

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Answer 1

Edric Ellis am 16 Dez. 2014

3 Stimmen

Firstly, if you're using the local cluster type, then the batch command absolutely does need to launch the worker MATLAB process - it is not already running - you can verify this using Task Manager or similar. (Clusters of type MJS keep the workers running). The time for the batch command is simply the time needed to create the parallel.Job and parallel.Task objects needed for running the batch job, and saving those to disk.

Roughly speaking, the time taken to execute submitting and waiting for the results can be broken down like this:

Time taken to create and submit the batch job to the scheduler
Time taken to launch the worker process (unless you're using MJS)
Time taken for the worker to load the job and task information
Time for the worker to actually run the task
Time for the worker to save the task results to disk (or database for MJS)

I suspect that the "missing" time is probably largely related to item 5 in the list above - as you've written it, the 512x512x1000 array is returned by your task function @load, and this result gets saved to disk.

How long does your save('a') command take? I suspect item 5 would take at least that long.

Note that there are several additional properties on the job object that can help you work out what's going on - see the reference page. In particular, note CreateTime, SubmitTime, StartTime, and FinishTime. The underlying task object has the same properties (except SubmitTime).

10 Kommentare
8 ältere Kommentare anzeigen 8 ältere Kommentare ausblenden

Edric Ellis am 16 Dez. 2014

In MATLAB Online öffnen

The function running within a batch job can invoke parfeval, parfor etc providing you start your batch job with an appropriate 'Pool' argument. parfeval, parfor etc. never use the disk for communication.

I suspect there is some confusion here between having an open parallel pool of workers available, and running a batch job. When you open a parallel pool (either manually, or using parpool), then workers are launched and remain idle until you issue parfeval, parfor etc. When you launch a batch job, new workers are launched - these will always be new MATLAB worker processes (unless you're using MJS, in which case the worker processes might be recycled). The lead worker running a batch job has a pool available to it if you specify the 'Pool' argument.

To be perfectly honest, using batch with the local cluster type is of limited benefit since the workers are only able to run while the desktop MATLAB is running. You'd almost certainly be better off using parfeval. Here's some timings using that:

>> tic, f = parfeval(@load, 1, 'a'); toc
Elapsed time is 0.005802 seconds.
>> tic, wait(f); toc
Elapsed time is 10.865625 seconds.
>> tic, a = f.OutputArguments{1}; toc
Elapsed time is 1.872169 seconds.

(Note that on my machine, loading a.mat takes about 10 seconds). Note that it still takes ~2 seconds to read the outputs - that's because the result of the load command is stored in memory but in a transferrable form (because it has been transferred from the workers), so that 2 seconds is still overhead that you cannot avoid.

It would be much better if you could load 'a.mat' on a worker and operate on it there too. Here's a slightly contrived example using getfield so that I can write everything in one expression:

>> tic; f= parfeval(@() mean(getfield(load('a'), 'a')), 1); toc
Elapsed time is 0.005312 seconds.
>> tic; wait(f); toc
Elapsed time is 10.863416 seconds.
>> tic; f.OutputArguments; toc
Elapsed time is 0.006360 seconds.

Matthias am 16 Dez. 2014

Bearbeitet: Matthias am 16 Dez. 2014

In MATLAB Online öffnen

The bugfix removes the memory leak! Thanks a lot!

However, loading in the background with parfeval still doesn't work as intended: Parfeval may not block the client Matlab instance, but it apparently does block other parallel functions. See this example:

fprintf('Start: %s\n', datestr(now, 'HH:MM:SS:FFF'));
f = parfeval(@pause, 0, 10);
fprintf('Outside parfor: %s\n', datestr(now, 'HH:MM:SS:FFF'));
parfor i = 1:10
    fprintf('Inside parfor: %s\n', datestr(now, 'HH:MM:SS:FFF'));
end
wait(f);
fprintf('End: %s\n', datestr(now, 'HH:MM:SS:FFF'));

Output:

Start: 14:50:45:204
Outside parfor: 14:50:45:219
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:312
Inside parfor: 14:50:55:312
Inside parfor: 14:50:55:312
Inside parfor: 14:50:55:312
End: 14:50:55:328

The timings suggest that the execution works like this: 1. Parfeval sends jobs to one worker. 2. Parfor waits until all workers are available. 3. Parfor executes.

I had hoped that it would be more like this: 1. Parfeval sends job to one worker; then continues execution in main Matlab instance. 2. Parfor runs on whichever workers are available; parfeval continues to run on one worker until done.

Is the behavior I'm observing intended? Maybe I just didn't properly understand the way the parallel toolbox worked...right now, it seems frustratingly inflexible.

Edric Ellis am 17 Dez. 2014

Unfortunately, as you observe, PARFOR cannot proceed while there are outstanding PARFEVAL requests (the same applies for SPMD). Your best option in this case is to recast your PARFOR loop as a series of PARFEVAL requests.

Melden Sie sich an, um zu kommentieren.

Why is batch() so slow?

2 Kommentare
Keine anzeigen Keine ausblenden

Akzeptierte Antwort

10 Kommentare
8 ältere Kommentare anzeigen 8 ältere Kommentare ausblenden

Weitere Antworten (0)

Kategorien

Produkte

Tags

Community Treasure Hunt

Why is batch() so slow?

2 Kommentare Keine anzeigen Keine ausblenden

Akzeptierte Antwort

10 Kommentare 8 ältere Kommentare anzeigen 8 ältere Kommentare ausblenden

Weitere Antworten (0)

Kategorien

Produkte

Tags

Siehe auch

Community Treasure Hunt

2 Kommentare
Keine anzeigen Keine ausblenden

10 Kommentare
8 ältere Kommentare anzeigen 8 ältere Kommentare ausblenden