Parallel optimization hanging on getCompleteIntervals

Question

Samuel Nathan am 30 Mär. 2020

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/514053-parallel-optimization-hanging-on-getcompleteintervals

Kommentiert: 宇龙 am 19 Nov. 2022

I'm using a cloudcentre cluster with parpool and the optimization runs until suddenly hanging. The code does not always hang but does 9/10 times. Suspected deadlock but I have made sure each worker has the files it requires. After it hangs I can exit with ctr-c but i have to restart the server in order to get the optimization running again else it hangs waiting for the pool to be ready.

Init code

c = parpool('AttachedFiles',{'OptimiseModel.m','decreasing_amplitude_01.mat','ArmModelV2.slx','MapData.m','sim_model_test.m','slprj'});
mpiSettings('DeadlockDetection','on')
mpiSettings('MessageLogging','on')
mpiSettings('MessageLoggingDestination','CommandWindow')

My obj function optimise model runs a simulink model with passed values from the particle swarm algorithm

if init == true
simIn = MapData;
init = false;
end
simOut = sim(simIn);
RMSE = simOut.get('rmse');

each worker has it's own copy of simin and the init stuff is a hack to allow the fuction to be evaluated by the client instance which happens once at the beginning of the particleswarm algo. (Don't know why)

spmd
model = load_system('ArmModelV2');
set_param(model, 'SimulationCommand', 'stop')
set_param(model,'FastRestart','on');
set_param(model,'SimulationMode','Accelerator');
set_param(model,'AccelVerboseBuild','on')
simIn = MapData();
end
~~~~~~~~~~~~
fun = @(x)OptimiseModel(init,MCV_B,x(1),x(2),x(3),x(4),x(5),VMO_B,x(6),x(7),x(8),MCV_T,x(9),x(10),x(11), ...
    x(12),x(13),VMO_T,x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23),x(24),x(25),x(26),x(27));
options = optimoptions('particleswarm','UseParallel',true,'UseVectorized',false,'PlotFcn','pswplotbestf');
[x,rmse_best] = particleswarm(fun,27,lb,ub,options);

All looks good until out of nowhere the workers stop running the obj function and the code hangs here which is part of the src for remoteparfor:

                while isempty(r)
                    assert(obj.NumIntervalsInController > 0, ...
                           'Internal error in PARFOR - no intervals to retrieve.');
                    r = q.poll(1, timeUnitSeconds);
                    obj.displayOutput();
                   

WHY? Can anybody help me? Can provide more of the code if required (I didn't include all as most is irrelevent - at least i thought so). Any suggestions on further debugging strategys would be great also.

Thanks alot!

EDIT Code works in serial

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Samuel Nathan am 1 Apr. 2020

Further Investigation is showing that a number of workers are crashing even when modifying the particleswarm parfor with instructions from https://uk.mathworks.com/help/simulink/ug/not-recommended-using-sim-function-within-parfor.html#brsk7nj looking at a way to restart workers/cancel and restart jobs on workers.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Edric Ellis am 31 Mär. 2020

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/514053-parallel-optimization-hanging-on-getcompleteintervals#answer_423066

In MATLAB Online öffnen

A few notes:

The deadlock detection is for labSend and labReceive. Your parallel code is using parfor. There is no way that parfor can encounter a cyclic deadlock because the workers operating on the body of the loop do not communicate with each other (except possibly via the file system). (When writing labSend and labReceive code inside spmd, you can write a cyclic deadlock, and that's what the deadlock detection setting can help you discover).
Your mpiSettings calls should be run on the workers - i.e. inside an spmd block. (But see point (1) - I don't think they're relevant here)
The method getCompleteIntervals is a completely normal part of parfor operation - this is where the client waits for the workers to return their results. The only thing that you can deduce from the client waiting at that point is that the workers haven't finished their parfor loop iterations yet
I am suspicious of your use of accelerated simulation mode. I'm not an expert, but I think that this might possibly cause the workers to interfere with one another via the filesystem.

Here's what I would try: try running with a parallel pool of size 1. If that fixes things, then perhaps the workers are interfering with one another via the file system.

You could force the workers to temporarily change to a unique directory prior to running the simulations by doing something like this:

% force the workers into a unique directory
spmd
    myTempDir = tempname(); % tempname returns a globally unique name
    oldWd = pwd();
    mkdir(myTempDir);
    cd(myTempDir);
end
% ... run stuff in parfor
particleswarm();
% Put the workers back into the original working directory
spmd
    cd(oldWd);
end

But that's a complete stab in the dark without having reproduction steps that I can try out.

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Jinsu Kim am 31 Mai 2021

I also encountered same problem. GA optimization with parallel copmuting (using 'parfor') stucked in the message belows:

r = q.poll(1, timeUnitSeconds);

obj.displayOutput();

Is your problem resolved now?

宇龙 am 19 Nov. 2022

I think you may try to use less logical processors

Melden Sie sich an, um zu kommentieren.

Parallel optimization hanging on getCompleteIntervals

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Antworten (1)

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

Parallel optimization hanging on getCompleteIntervals

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Antworten (1)

5 Kommentare 3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden