Potential bug with parfeval; cumulative slowing down after several hours of operation. Can even exceed 10x the initial compute time.

4 Ansichten (letzte 30 Tage)

Pavel Sinha am 15 Jul. 2018

1
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/410403-potential-bug-with-parfeval-cumulative-slowing-down-after-several-hours-of-operation-can-even-exce

Kommentiert: Pavel Sinha am 12 Sep. 2018

There seems to be a bug using parfeval in the parallel pressing tool box. After hours of running the computation time for the compute time of the parallel tasks start increasing.

I tried running in all serial mode and the compute time remains similar even after hours of operation.

I have monitored memory usage and doesn't increase with time.

I tried monitoring the parallel compute time and maintained a low and high compute time. Once the difference exceeds certain (20%) threshold, I manually performed the following;

delete(gcp('nocreate'));    
POOL=parpool('local', NO_PAR_POOLS);

The reset of the parallel pool seems to bring back the parallel compute time back to expected.

Here is the pseudo code:

%%%%%%%%%%%%%%%%%%%%%%%%
tic_sum1=0;
tic_sum1_high=0;
tic_sum1_low=0;
for mini_batch_no=1:NO_OF_MINI_BATCHES
tic;
% Launch (N-1) parallel asynchronous jobs
job{1} = parfeval(POOL, @read_dataset_from_hdd, 1, mini_batch_no, CONST_DATA, 1);
job{2} = parfeval(POOL, @compute_cpu_task, 1, BUFF_DATA_TRAIN.batch_file_read(:,:,:,:,set_cpu_num), CONST_DATA);
job{3} = parfeval(POOL, @compute_cpu_dwt_task, 1, BUFF_DATA_TRAIN.batch_file_process_1(:,:,:,:,set_cpu_dwt_num), CONST_DATA.IM_RESIZE, CONST_DATA);
% Perform the Nth parallel job on the host
compute_gpu_task({BUFF_DATA_TRAIN.batch_file_process_2(:,:,:,:,set_gpu_num), BUFF_DATA_TRAIN.batch_file_label_read(:,:,:,:,set_gpu_num)});
% Collect result from parallel jobs
result{1} = fetchOutputs(job{1});
result{2} = fetchOutputs(job{2});
result{3} = fetchOutputs(job{3});
tic_sum1=tic_sum1+toc;
%%%Perform reset of parallel pool if hi-low diff exceed threshold percentage
if (tic_sum1>=tic_sum1_high)
tic_sum1_high=tic_sum1;
end
if (tic_sum1<=tic_sum1_low)
tic_sum1_low=tic_sum1;
elseif (tic_sum1_low==0)
tic_sum1_low=tic_sum1;
end
tic_sum1=0;
if (tic_sum1_high~=0 && tic_sum1_low~=0)
if ((100*(tic_sum1_high-tic_sum1_low)/tic_sum1_low)>CPU_ALLOWABLE_COMPUTE_TIME_LOW_VS_HIGH_DIFF_PERCENTAGE)
delete(gcp('nocreate'));
POOL=parpool('local', NO_PAR_POOLS);
tic_sum1_high=0;
tic_sum1_low=0;
end
end
end
%%%%%%%%%%%%%%%%%%%%%%%%

The iterations go over 5-6 days and after the 1st day the total time of operation exceeds 10x the initial time.

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

Pavel Sinha am 23 Jul. 2018

Hi,

Thank you for your reply. Here are my thoughts:

1) It cannot be race condition because each of the async tasks are carried out on independent buffers. So, all read/write to/from happen within independent buffers during execution of the tasks. No communication or dependencies between the tasks, completely independent. All async tasks are synced, then data is collected before the next launch of the parallel threads happens, each time.

2) The above point also makes sure that there is no deadlock. It comes from the sync before launching the parallel jobs and no interaction between the async threads during execution.

Pavel Sinha am 12 Sep. 2018

I found that by initializing Matlab with just 1 parallel worker from settings and then activate required number of parallel workers from code, fixes this issue to a great extent. As in, now it takes much longer time before the processing loops take up significantly longer time to compute than the initial once. But eventually once the processing loop exceeds 20% of the initial compute time, I still have to reset the parallel cores and re-assign the number of required parallel workers.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.