Parallel optimization hanging on getCompleteIntervals

26 Ansichten (letzte 30 Tage)
Samuel Nathan
Samuel Nathan am 30 Mär. 2020
Kommentiert: 宇龙 am 19 Nov. 2022
I'm using a cloudcentre cluster with parpool and the optimization runs until suddenly hanging. The code does not always hang but does 9/10 times. Suspected deadlock but I have made sure each worker has the files it requires. After it hangs I can exit with ctr-c but i have to restart the server in order to get the optimization running again else it hangs waiting for the pool to be ready.
Init code
c = parpool('AttachedFiles',{'OptimiseModel.m','decreasing_amplitude_01.mat','ArmModelV2.slx','MapData.m','sim_model_test.m','slprj'});
mpiSettings('DeadlockDetection','on')
mpiSettings('MessageLogging','on')
mpiSettings('MessageLoggingDestination','CommandWindow')
My obj function optimise model runs a simulink model with passed values from the particle swarm algorithm
if init == true
simIn = MapData;
init = false;
end
simOut = sim(simIn);
RMSE = simOut.get('rmse');
each worker has it's own copy of simin and the init stuff is a hack to allow the fuction to be evaluated by the client instance which happens once at the beginning of the particleswarm algo. (Don't know why)
spmd
model = load_system('ArmModelV2');
set_param(model, 'SimulationCommand', 'stop')
set_param(model,'FastRestart','on');
set_param(model,'SimulationMode','Accelerator');
set_param(model,'AccelVerboseBuild','on')
simIn = MapData();
end
~~~~~~~~~~~~
fun = @(x)OptimiseModel(init,MCV_B,x(1),x(2),x(3),x(4),x(5),VMO_B,x(6),x(7),x(8),MCV_T,x(9),x(10),x(11), ...
x(12),x(13),VMO_T,x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23),x(24),x(25),x(26),x(27));
options = optimoptions('particleswarm','UseParallel',true,'UseVectorized',false,'PlotFcn','pswplotbestf');
[x,rmse_best] = particleswarm(fun,27,lb,ub,options);
All looks good until out of nowhere the workers stop running the obj function and the code hangs here which is part of the src for remoteparfor:
while isempty(r)
assert(obj.NumIntervalsInController > 0, ...
'Internal error in PARFOR - no intervals to retrieve.');
r = q.poll(1, timeUnitSeconds);
obj.displayOutput();
WHY? Can anybody help me? Can provide more of the code if required (I didn't include all as most is irrelevent - at least i thought so). Any suggestions on further debugging strategys would be great also.
Thanks alot!
EDIT Code works in serial
  1 Kommentar
Samuel Nathan
Samuel Nathan am 1 Apr. 2020
Further Investigation is showing that a number of workers are crashing even when modifying the particleswarm parfor with instructions from https://uk.mathworks.com/help/simulink/ug/not-recommended-using-sim-function-within-parfor.html#brsk7nj looking at a way to restart workers/cancel and restart jobs on workers.

Melden Sie sich an, um zu kommentieren.

Antworten (1)

Edric Ellis
Edric Ellis am 31 Mär. 2020
A few notes:
  1. The deadlock detection is for labSend and labReceive. Your parallel code is using parfor. There is no way that parfor can encounter a cyclic deadlock because the workers operating on the body of the loop do not communicate with each other (except possibly via the file system). (When writing labSend and labReceive code inside spmd, you can write a cyclic deadlock, and that's what the deadlock detection setting can help you discover).
  2. Your mpiSettings calls should be run on the workers - i.e. inside an spmd block. (But see point (1) - I don't think they're relevant here)
  3. The method getCompleteIntervals is a completely normal part of parfor operation - this is where the client waits for the workers to return their results. The only thing that you can deduce from the client waiting at that point is that the workers haven't finished their parfor loop iterations yet
  4. I am suspicious of your use of accelerated simulation mode. I'm not an expert, but I think that this might possibly cause the workers to interfere with one another via the filesystem.
Here's what I would try: try running with a parallel pool of size 1. If that fixes things, then perhaps the workers are interfering with one another via the file system.
You could force the workers to temporarily change to a unique directory prior to running the simulations by doing something like this:
% force the workers into a unique directory
spmd
myTempDir = tempname(); % tempname returns a globally unique name
oldWd = pwd();
mkdir(myTempDir);
cd(myTempDir);
end
% ... run stuff in parfor
particleswarm();
% Put the workers back into the original working directory
spmd
cd(oldWd);
end
But that's a complete stab in the dark without having reproduction steps that I can try out.
  5 Kommentare
Jinsu Kim
Jinsu Kim am 31 Mai 2021
I also encountered same problem. GA optimization with parallel copmuting (using 'parfor') stucked in the message belows:
r = q.poll(1, timeUnitSeconds);
obj.displayOutput();
Is your problem resolved now?
宇龙
宇龙 am 19 Nov. 2022
I think you may try to use less logical processors

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Parallel Computing Fundamentals finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by