Why do I get MPI_Abort errors when trying to submit a parallel job?

3 Ansichten (letzte 30 Tage)
Paul Zhang
Paul Zhang am 23 Mai 2014
Kommentiert: Edric Ellis am 23 Mai 2014
The core of my job submission code is below:
jopt.email_notif = 0;
jopt.toggleleft = left_list(j);
jopt.toggleCausalDir = dir_list(k);
jopt.toggleChoice = choice(l);
jopt.od_number = od_list(i);
jopt.connectivity = 1;
sched = findResource('scheduler', 'configuration', 'NeuroEcon.local')
set(sched,'SubmitArguments', '-l walltime=0:20:00')
pjob = createParallelJob(sched);
set(pjob, 'FileDependencies', {'multiDCMset1.m'})
set(pjob, 'MaximumNumberOfWorkers', 1)
set(pjob, 'MinimumNumberOfWorkers', 1)
t = createTask(pjob, @multiDCMset1, 1, {jopt})
t_all{1,jj}=t; jj=jj+1;
submit(pjob);
---------------------------------------
The following is the error message I get in the job submission log, after the job finishes running. I don't understand the error or what could cause it. I do know that the same script runs fine on another person's computer. Do I need some specific settings to submit parallel jobs?
------------------
Node file: /opt/torque/aux//2075983.neuroecon.caltech.edu
Starting SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -s -phrase MATLAB -port 25983
All SMPDs launched
"/opt/matlab//bin/mw_mpiexec" -phrase MATLAB -port 25983 -l -n 1
-machinefile /opt/torque/aux//2075983.neuroecon.caltech.edu -genvlist
MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE _CONSTRUCTOR,MDCE_JOB_LOCATION,MDCE_DEBUG
"/opt/matlab/bin/worker" -parallel
[0]which: no shopt in
(/opt/matlab/bin:/usr/kerberos/bin:/usr/java/latest/bin:/opt /intel/itac/7.1/bin:/opt/intel/fce/10.1.018/bin:/opt/intel /idbe/10.1.018/bin:/opt/intel/cce/10.1.018/bin:/usr/local /bin:/bin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/opt /openmpi/bin/:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin: /opt/rocks/bin:/opt/rocks/sbin)
[0] < M A T L A B (R) >
[0] Copyright 1984-2009 The MathWorks, Inc.
[0] Version 7.8.0.347 (R2009a) 64-bit (glnxa64)
[0] February 12, 2009
[0]
[0] To get started, type one of these: helpwin, helpdesk, or demo.
[0] For product information, visit www.mathworks.com.
[0]
job aborted:
rank: node: exit code[: error message]
0: compute-1-30: -2: application called MPI_Abort(MPI_COMM_WORLD, 42) -
process 0
Stopping SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -shutdown -phrase MATLAB -port
25983
Exiting with code: 42
  1 Kommentar
Edric Ellis
Edric Ellis am 23 Mai 2014
Is there any error in the task of the job? Check using:
pjob.Tasks(1).Error
or even
getReport(pjob.Tasks(1).Error)

Melden Sie sich an, um zu kommentieren.

Antworten (0)

Kategorien

Mehr zu Cluster Configuration finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by