Can't use as many cores as available

30 Ansichten (letzte 30 Tage)
Daniel
Daniel am 22 Mär. 2021
Kommentiert: Daniel am 24 Mär. 2021
Apologies first because I probably don't know enough about this to adequately describe the problem, but please ask questions (and maybe help me figure out how to answer them).
I'm using my company's computing resources to run Matlab and run scripts with parallel computing. There's some set up, but, as best I understand, once I'm in Matlab, I'm essentially remoted into a computer with a lot of cores, though those are a shared resource. To run a script, I start with parpool('local',<number of cores>). This only works if I request 64 or fewer cores.
Prior to this, I set up the cluster by validating it with 5 cores and then resetting the number of workers to 512, which is the maximum we're allowed. Before setting up the parpool, I have checked the number of cores available with feature('numCores') to ensure I'm not requesting more than available and/or checked the number of idle cores by running cee-lan-status -c in the terminal (I assume this is a standard command, but I don't know bash).
When I request more than 64 cores, I always get this error:
Error using parpool (line 149)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line 678)
Failed to initialize the interactive session.
Error using parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus (line 789)
The interactive communicating job failed with no message.
What else can I try?

Antworten (1)

Raymond Norris
Raymond Norris am 22 Mär. 2021
I believe what you're saying is that from your desktop machine you connect to some server. From there, you run MATLAB on a machine that has 512 cores.
You validate your local profile by changing the worker count to 5, then set it back to 512. Note: not sure which version of MATLAB, but there's a field in the validation to state the number of workers to use so that you don't have to toggle this.
You then run the following
nc = feature('numcores');
p = parpool('local',nc);
The caveat is that you can't be sure that you have access to number of cores, c. That's just want MATLAB sees. Are there other applications/users running on the same machine?
I've heard of issues crossing 64 local workers, but I think that was more on Windows and not Linux (which I'm guessing is what you're running on?). To capture the error, try the following:
c = parcluster('local');
p = c.parpool(nc);
% Parpool errors out. Look at log file.
c.getDebugLog(c.Jobs(end));
% After you look at the error, delete the job
c.Jobs(end).delete
  7 Kommentare
Raymond Norris
Raymond Norris am 23 Mär. 2021
My only suggestion is to set "max user processes" and "open files" to "unlimited". Best to speak to your adminstrator for how to make the change permenent (and not just in a temporary shell).
If that doesn't fix it, contact Technical Support (support@mathworks.com) to see if they can help.
Daniel
Daniel am 24 Mär. 2021
OK, I've passed that on to my administrator. If they can (or are willing) to change anything, I'll update here as to how it goes. Thanks for your help!

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Parallel Computing Fundamentals finden Sie in Help Center und File Exchange

Produkte


Version

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by