Why do I get MPI_Abort errors when trying to submit a parallel job?
3 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
The core of my job submission code is below:
jopt.email_notif = 0;
jopt.toggleleft = left_list(j);
jopt.toggleCausalDir = dir_list(k);
jopt.toggleChoice = choice(l);
jopt.od_number = od_list(i);
jopt.connectivity = 1;
sched = findResource('scheduler', 'configuration', 'NeuroEcon.local')
set(sched,'SubmitArguments', '-l walltime=0:20:00')
pjob = createParallelJob(sched);
set(pjob, 'FileDependencies', {'multiDCMset1.m'})
set(pjob, 'MaximumNumberOfWorkers', 1)
set(pjob, 'MinimumNumberOfWorkers', 1)
t = createTask(pjob, @multiDCMset1, 1, {jopt})
t_all{1,jj}=t; jj=jj+1;
submit(pjob);
---------------------------------------
The following is the error message I get in the job submission log, after the job finishes running. I don't understand the error or what could cause it. I do know that the same script runs fine on another person's computer. Do I need some specific settings to submit parallel jobs?
------------------
Node file: /opt/torque/aux//2075983.neuroecon.caltech.edu
Starting SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -s -phrase MATLAB -port 25983
All SMPDs launched
"/opt/matlab//bin/mw_mpiexec" -phrase MATLAB -port 25983 -l -n 1
-machinefile /opt/torque/aux//2075983.neuroecon.caltech.edu -genvlist
MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE _CONSTRUCTOR,MDCE_JOB_LOCATION,MDCE_DEBUG
"/opt/matlab/bin/worker" -parallel
[0]which: no shopt in
(/opt/matlab/bin:/usr/kerberos/bin:/usr/java/latest/bin:/opt /intel/itac/7.1/bin:/opt/intel/fce/10.1.018/bin:/opt/intel /idbe/10.1.018/bin:/opt/intel/cce/10.1.018/bin:/usr/local /bin:/bin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/opt /openmpi/bin/:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin: /opt/rocks/bin:/opt/rocks/sbin)
[0] < M A T L A B (R) >
[0] Copyright 1984-2009 The MathWorks, Inc.
[0] Version 7.8.0.347 (R2009a) 64-bit (glnxa64)
[0] February 12, 2009
[0]
[0] To get started, type one of these: helpwin, helpdesk, or demo.
[0] For product information, visit www.mathworks.com.
[0]
job aborted:
rank: node: exit code[: error message]
0: compute-1-30: -2: application called MPI_Abort(MPI_COMM_WORLD, 42) -
process 0
Stopping SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -shutdown -phrase MATLAB -port
25983
Exiting with code: 42
1 Kommentar
Edric Ellis
am 23 Mai 2014
Is there any error in the task of the job? Check using:
pjob.Tasks(1).Error
or even
getReport(pjob.Tasks(1).Error)
Antworten (0)
Siehe auch
Kategorien
Mehr zu Cluster Configuration finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!