Why workers keep aborting during parallel computation on cluster?

Muh Alam

7 Dez. 2020

1 Antwort

Antwort akzeptiert

Aktualisiert 8 Feb. 2021

5 Ansichten (30 Tage)

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Ältere Kommentare anzeigen

In MATLAB Online öffnen

0 Stimmen

I keep getting the warning

Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
 In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 392)
In parallel_function>distributed_execution (line 741)
In parallel_function (line 573)
In fuction_pa1 (line 100)]

when I run a simulation that has parfor loop on the cluster. I noticed that workers abort excution one after another and that seems to happen more when on a cluster compated to my PC.

I would like to know the reason of this issue, and is there a way to avoid it ?

Thanks.

19 Kommentare
17 ältere Kommentare anzeigen 17 ältere Kommentare ausblenden

Kojiro Saito am 7 Feb. 2021

Heterogenous would be a cause. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;

"Parallel processing constructs that work on the infrastructure enabled by parpool—parfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."

Muh Alam am 8 Feb. 2021

Interesting point! I think this is the reason in my case. Thank you @koj@Kojiro Saito

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Akzeptierte Antwort

Kojiro Saito am 8 Feb. 2021

0 Stimmen

Heterogenous environment would be a cause of this issue. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;

2 Kommentare
Keine anzeigen Keine ausblenden

Muh Alam am 8 Feb. 2021

I forgot to ask, are there ways to work around this issue? would chosing computing nodes on the cluster having same archeticture suffice ? if so how to do that using slurm ?

Kojiro Saito am 8 Feb. 2021

In MATLAB Online öffnen

If you know the nodes names which are homogeneous, you can specify the nodes with sbatch. For example, if node0 to node4 are the same OS, you can use nodelist option (or -w option).

sbatch --nodelist node[0-4] yourscript.sh

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Melden Sie sich an, um diese Frage zu beantworten.

Kategorien

Mehr zu Third-Party Cluster Configuration finden Sie in Hilfe-Center und File Exchange

Produkte

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by

Why workers keep aborting during parallel computation on cluster?

19 Kommentare 17 ältere Kommentare anzeigen 17 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigen Keine ausblenden

Weitere Antworten (0)

Kategorien

Produkte

Tags

Siehe auch

Community Treasure Hunt

19 Kommentare
17 ältere Kommentare anzeigen 17 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigen Keine ausblenden