Avoid repetition in job diary when running code in parallel

3 Ansichten (letzte 30 Tage)
L. Borealis
L. Borealis am 17 Aug. 2020
Bearbeitet: L. Borealis am 19 Aug. 2020
Hi,
I am using the parallel toolbox to run code that has been developed on a Mac and runs on a Unix cluster. I am using Windows 10 and want to set up the code from my machine to run on the Unix cluster.
If I use the Diary function after submitting a job, every 'someText' in
disp('someText')
shows up 9 or 10 times depending on how many CPUs the job is run on. Please let me know if I should not expect something like:
count
starting the function parallel run XXX start2end!
count
count
count
count
starting the function parallel run XXX start2end!
starting the function parallel run XXX start2end!
starting the function parallel run XXX start2end!
starting the function parallel run XXX start2end!
count
starting the function parallel run XXX start2end!
count
starting the function parallel run XXX start2end!
count
starting the function parallel run XXX start2end!
Warning: Table variable names were modified to make them valid MATLAB identifiers. The original names are saved in the VariableDescriptions property.
From running a job over 10 CPUs:
From coding in Fortran I know that it's possible to only display the output from a CPU of a certain rank rather than all (e.g. saying "if rank = 0" then print). Is there such a setting in MATLAB to enhance readability?
I am grateful for any advice on how to deal with this as I could not find a solution online.
Thank you and best wishes,
Linnéa

Antworten (2)

Raymond Norris
Raymond Norris am 18 Aug. 2020
Hi Linnéa,
The batch command with a Pool argument is a wrapper to createCommunicatingJob of type Pool (not SPMD). That simply means that parS needs to have a call to spmd somewhere within in order to reference labindex (i.e. rank).
If you're able to post parS, we might be able to provide more guidance.
Raymond
  1 Kommentar
L. Borealis
L. Borealis am 18 Aug. 2020
Actually, it is parfor and not spmd that we use. Sorry for the mixup! I am new to parallel computing and this toolbox. Thanks a lot! However, the code seems to now be submitted to a worker from the beginning rather than after parfor as I described above... This was not the case when the code was last run just under a year ago.

Melden Sie sich an, um zu kommentieren.


Raymond Norris
Raymond Norris am 17 Aug. 2020
Hi Linnéa,
It would help to see more of an example of your code. A parfor loop runs on a pool of workers that all think they are rank=0. However, an spmd uses MPI to communicatate to each of the rank, so each is assigned a different rank. Let me give you two quick examples (again, without knowing what you're code looks like) of calculating pi.
Submitting a job pool job, calling spmd. The top level task will then spawn a pool of workers (1 less then requested) to run the block.
c = parcluster('local');
j = c.createCommunicatingJob('NumWorkersRange',3, 'Type','pool');
j.createTask(@calcpi_spmd_block,0,{},'CaptureDiary',true);
j.submit
j.wait
j.Tasks(1).Diary
function calcpi_spmd_block
spmd
a = (labindex - 1)/numlabs;
b = labindex/numlabs;
fprintf('Subinterval: [%-4g, %-4g]\n', a, b);
myIntegral = integral(@iQuadPi, a, b);
fprintf('Subinterval: [%-4g, %-4g] Integral: %4g\n', ...
a, b, myIntegral);
piApprox = gplus(myIntegral);
end
approx1 = piApprox{1}; % 1st element holds value on worker 1.
fprintf('pi : %.18f\n', pi);
fprintf('Approximation: %.18f\n', approx1);
fprintf('Error : %g\n', abs(pi - approx1))
function y = iQuadPi(x)
y = 4./(1 + x.^2);
Submitting an spmd job, where the task is run on each worker.
c = parcluster('local');
j = c.createCommunicatingJob('NumWorkersRange',3, 'Type','spmd');
j.createTask(@calcpi_spmd_task,0,{},'CaptureDiary',true);
j.submit
j.wait
j.Tasks(1).Diary
j.Tasks(2).Diary
j.Tasks(3).Diary
function calcpi_spmd_task
a = (labindex - 1)/numlabs;
b = labindex/numlabs;
fprintf('Subinterval: [%-4g, %-4g]\n', a, b);
myIntegral = integral(@iQuadPi, a, b);
fprintf('Subinterval: [%-4g, %-4g] Integral: %4g\n', ...
a, b, myIntegral);
piApprox = gplus(myIntegral);
approx1 = piApprox; % 1st element holds value on worker 1.
fprintf('pi : %.18f\n', pi);
fprintf('Approximation: %.18f\n', approx1);
fprintf('Error : %g\n', abs(pi - approx1))
function y = iQuadPi(x)
y = 4./(1 + x.^2);
Notice the subtle differences between the two task functions. You'll see in both cases that we can make use of labindex (i.e. rank).
Raymond
  3 Kommentare
Raymond Norris
Raymond Norris am 18 Aug. 2020
Hi Linnéa,
Before I provide an answer, I'm perplexed by something you wrote in a previous post. I believe you are submitting from Windows to a Linux cluster. genpath will generate a list of subfolders from a given starting point. Take the following example:
genpath(strcat('/Users/',username,'/Documents/MATLAB/xxx/yyy'));
For starters, you might consider
genpath(fullfile('/Users',username,'Documents','MATLAB',',xxx','yyy'));
unless the file separator matters (calling strcat for CurrentFolder makes sense); however, you run strrep afterwards anyway. Secondly, on a Windows machine, I would expect this to return an empty string, since /User won't exist (correct?).
Because genpath works recurrisvely, you don't need to call genpath more than once under /Users/',username,'/Documents/MATLAB. xxx and zzz will automatically get picked up.
You can replace
clear Name; clear j; clear jj;
with
clear Name j jj
I don't see where
cluster = 'clusterName';
is being used.
L. Borealis
L. Borealis am 18 Aug. 2020
Bearbeitet: L. Borealis am 19 Aug. 2020
Dear Raymond,
Thanks so much for all your advice and for connecting my 2 entries! I will definitely consider fullfile - that will be great for being able to use it both on a Linux and a Windows system, which is what we are planning to do.
A big problem I have with adjusting this code to be runnable on Windows is that I have not worked on a Windows machine since high school/never coded on it. So thank you very much for pointing me to this. It saved me a ton of confusion. I am running everything again and will let you know about the results.
%% Set Cluster
% Get a handle to the cluster
c = parcluster;
%define username
username = 'f';
% Create a list of all the paths supplement it so that the
% files would be on path and could be read on cluster
%once it is running, modify to struct, i.e. franssel.OS = Windows
if strcmp(username,'f')
% Windows version
localPath1 = genpath(fullfile('c:\','Users',username,'Documents','MATLAB','xxx','yyy'));
localPath2 = genpath(fullfile('c:\','Users',username,'Documents','MATLAB','xxx','zzz','www'));
localPath = [localPath1,localPath2];
cellLocalPath = split(localPath,';');
for i = 1:length(cellLocalPath)
cellLocalPath{i} = strrep(cellLocalPath{i}, '\', '/');
cellLocalPath{i} = strrep(cellLocalPath{i},strcat('c:/Users/',username,'/Documents/MATLAB'),strcat('/remoteStoageLoc/home/',username));
end
else
%Unix version
localPath1 = genpath(strcat('/Users/',username,'/Documents/MATLAB/xxx/yyy'));
localPath2 = genpath(strcat('/Users/',username,'/Documents/MATLAB/xxx/zzz/www'));
localPath = [localPath1,localPath2];
cellLocalPath = split(localPath,':');
for i = 1:length(cellLocalPath)
cellLocalPath{i} = strrep(cellLocalPath{i},strcat('/Users/',username,'/Documents/MATLAB/'),strcat('/remoteStoageLoc/home/',username,'/'));
end
end
cellLocalPath = cellLocalPath(find(~cellfun(@isempty,cellLocalPath)));
cellLocalPath = cellLocalPath'
%
clear Name j jj;
Name{1} = 'A';
Name{2} = 'B';
Name{3} = 'E';
Name{4} = 'G';
Name{5} = 'T';
Name{6} = 'At';
numCores = 0;
c.AdditionalProperties.QueueName = 'defq';
Does this look better to you? The problem is that - while the paths look as good now - it does not run anymore (i.e. the diary is empty and the jobs reach 'failed' status quickly when running the "schedule jobs" section above. As the output on a windows machine from
cellLocalPath = cellLocalPath'
I get a 1×90 cell array with entries like:
/remoteStorageLoc/home/f/xxx/yyy
This is what I needed and the Windows paths look good too now. So this puzzles me! Could it have to do with the integration script being written for unix use only?
(Btw I tried running it with numCores = 0 because after chatting to my colleague, we realised that something must have happened between when he last ran the code about 9 months ago and now because he did not use to get the repetitive outputs from the parallelisation when he last ran it but he is now too. We run it in Matlab 2019a (I believe he may have ran it on 2018a or b previously) and call Python, which has been updated since. If you know whether any of this could cause parallelisation from the beginning rather than after parfor only, please let me know. Thanks a lot!
The cluster is defined again by the next fn that is called - so it is redundand where it is.
Once again, thanks very much!!

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Data Type Conversion finden Sie in Help Center und File Exchange

Produkte


Version

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by