Filter löschen
Filter löschen

why does my job on cluster stop to produce output

4 Ansichten (letzte 30 Tage)
Patrick Laux
Patrick Laux am 12 Apr. 2016
Kommentiert: Simone Stünzi am 9 Jul. 2021
Hey, I am using parallel toolbox on a linux cluster (istan nodes and SLURM scheduler). The main routine (the parfor loop section) looks as follows. An 2-d array (MASKE) is used to extract time series which have values, and the function core_eQM is applied on these time series: ...
cd $WORKDIR
pc = parcluster('local')
pc.JobStorageLocation = strcat('$WORKDIR/',getenv('SLURM_JOB_ID'))
% start the matlabpool with maximum available workers
% control how many workers by setting ntasks in your sbatch script
matlabpool(pc, getenv('SLURM_CPUS_ON_NODE'))
...
pardim=size(MASKE,2);
XX1=NaN(7305,pardim);
parfor ii=1:pardim
if ~isnan(MASKE(1,ii))
fprintf('%i \t \n', ii); %shows me progress of job (creates files, which are empty)
if meth == 1;
xx1 = core_eQM(squeeze(VARIABLE_BSE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,3654:7305)))
elseif meth == 2
...
end
end
end
Now, I use the following script to submit it on the cluster:
#!/bin/bash
#SBATCH --job-name=imat_par_test
#SBATCH --output=matlab_parfor.out
#SBATCH --error=matlab_parfor.err
#SBATCH --partition=ivy
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=20
source /etc/profile.d/00-modules.sh
module load app/matlab2014b
cd $WORKDIR
# Create a local work directory
mkdir -p $WORKDIR/$SLURM_JOB_ID
#cd $WORKSDIR/$SLURM_JOB_ID
# Kick off matlab
matlab -nodesktop < script_apply_BC.m &
#wait
# Cleanup local work directory
rm -rf $WORKSDIR/$SLURM_JOB_ID
At the beginning (first few hours) the job runs fine. The size of pardim is 420. After pardim reaching approx. 250, the procedure slows down and finally does not "continue", i.e. the job is still running without producing output files. Thus, no problems are reported in the matlab_parfor.err file. I do not know exactly how I can analyse the problems in this case.
Any ideas?
  5 Kommentare
Patrick Laux
Patrick Laux am 9 Jul. 2021
unfortunately not, Simone. I just gave up.
If you find out more, I would be happy if you let me know.
Patrick
Simone Stünzi
Simone Stünzi am 9 Jul. 2021
I've increased idleTimeout to Inf and will let you know if that solves my issue.
Best, Simone

Melden Sie sich an, um zu kommentieren.

Antworten (0)

Kategorien

Mehr zu Startup and Shutdown finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by