Why am I unable to validate my LSF configuration in the Parallel Computing Toolbox?

2 views (last 30 days)
I have MATLAB Parallel Server set up on a cluster running LSF. When I attempt to validate the cluster configuration it fails.

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 6 Feb 2023
Edited: MathWorks Support Team on 6 Feb 2023
There are several issues that can prevent the validation of the cluster. Run the following tests below to make sure that your configuration is setup properly. If at any point you receive an error message, you can submit a request to Installation support using the link at the bottom of the page. When submitting a request, be sure to include the following:
- Your license number
- The release of MATLAB on the client and the cluster
- The output of your validation (click details to get the full information)
- The results of the tests below
1) Test the licensing of MATLAB Parallel Server
The first step is to ensure that the licensing for MATLAB Parallel Server works on your cluster. This will also test to see if MATLAB is crashing on startup on your cluster. To test this, go to one of the cluster nodes and open up a Windows Command Prompt by clicking on the Start Menu and go to All Programs, Accessories, and click on Command Prompt. In the command prompt, run the following commands:
cd $MATLAB\bin (where $MATLAB is the installation folder for MATLAB on the cluster)
./matlab -dmlworker -nodisplay -logfile /var/tmp/output.txt -r "ver;exit"
This will generate an output.txt file in /var/tmp that contains the ver output on the cluster. If the log file contains a network license manager error, this is the issue. In that case, check the support site for the license manager error number and take the appropriate action to resolve the license error before proceeding.
2) Check the releases of MATLAB on the cluster and the client where you validated
If you get the output of the "ver" command in the log file, check the releases of all the products in the list. The release of each product should match for all the products. Additionally, the release should match the release that is installed on the client where you ran the validation. To check the release on the client, run the ver command in MATLAB's command window. If the release of Parallel Computing Toolbox and MATLAB do not match the release of MATLAB and MATLAB Parallel Server on the cluster, you will not be able to use this configuration until the installations are at the same release.
3) Check to make sure that your configuration meets the scheduler requirements
In order to use MATLAB Parallel Server with LSF, there are some additional requirements in the setup. Check the scheduler requirements page here for more details:
Additionally, this configuration requires the following:
- The LSF binaries need to be accessible from the MATLAB client that runs Parallel Computing Toolbox. If the client does not have the binaries, it is recommended to remotely access one of the cluster nodes to run the MATLAB client.
- Your cluster should be completely homogeneous. Mixing different platforms or distributions is not recommended especially for parallel computation.
- This configuration requires that the data for the jobs be stored on a shared file space between the clients and the cluster nodes. When creating the configuration, set the "DataLocation" variable to be a path that is accessible to all computers.
- Since the "DataLocation" variable needs to be accessible by the same path from all computers, you cannot use a client machine of a different platform (Ex: Running a Windows client to access a Linux cluster).
If the 4 requirements above are not met, the default LSF configurations are not supported. In that case, it is still possible to submit jobs to the cluster. For this setup, see the related solution: 1-34TP79 - "How can I use the MATLAB Distributing Computing products with the LSF scheduler in a nonshared file system?"
4) Check to ensure you have correctly configured the client configuration
In your client MATLAB, go to the Parallel menu to Manage Configurations. Right click on your LSF configuration and select Properties. You must set the appropriate values for ClusterMatlabRoot (the directory where is MATLAB installed on the cluster), DataLocation (where the data will be stored, NOTE: This must be accessible from the same path from all computers), ClusterOsType (Unix-based or PC), and HasSharedFilesystem (should be set to True).
If you have a cluster that has a mixture of different operating systems, you must use the "SubmitArguments" field to target only one type of operating system. For example:
‘-R “type==NTX86”’ - Targets 32-bit Windows
‘-R “type==NTX64”’ - Targets 64-bit Windows
‘-R “type==LINUX86”’ - Targets 32-bit Linux
‘-R “type==LINUX64”’ - Targets 64-bit Linux
NOTE: Run the "lshosts" command on the cluster and use the "type" column to find the type that you should target.
If you have confirmed all of the settings above, do all stages fail during validation, or just parallel and Matlabpool? If you are able to pass the Distributed Job phase, the validation may be reporting false errors. To confirm you can manually validate your cluster. To do so:

1. Distributed job:

To run a simple distributed job, run the following:
lsf = findResource('scheduler','configuration','<ConfigurationName>')
Where <ConfigurationName> is the name of the configuration you created
job = createJob(lsf);
createTask(job, @sum, 1, {[1 1]});
createTask(job, @sum, 1, {[2 2]});
createTask(job, @sum, 1, {[3 3]});
submit(job)
waitForState(job, 'finished', 60)
To confirm the job completed, run the following:
results = getAllOutputArguments(job)
If you get the following output, your cluster is configured and operating correctly.
results =
[2]
[4]
[6]

2. Parallel job:

After completing the distributed job, run the following:
pj = createParallelJob(lsf);
createTask(pj, @labindex, 1, {});
set(pj, 'MaximumNumberOfWorkers', 3);
set(pj, 'MinimumNumberOfWorkers', 3);
submit(pj)
waitForState(pj, 'finished', 60)
To confirm the job completed, run the following:
results = getAllOutputArguments(pj)
If you get the following output, your cluster is configured and operating correctly.
results =
[1]
[2]
[3]

3. MATLAB pool job:

To test MATLABPool or PMODE, simply run the command:
matlabpool open <ConfigName> <#ofLabs>
Where <Configname> is the name of the configuration and <#ofLabs> is the number of nodes to use in the cluster.
If your prompt is returned, your configuration is working. To quit MATLAB pool, simply type "exit".
If the MATLAB pool did not start and you did not receive an error message, try running:
setSchedulerMessageHandler(@disp)
and then try the MATLAB pool commands above. This should capture the error messages and forward them to the MATLAB command window.
If the manual tests passed, your configuration is working and you should be able to submit jobs.
If you are still having an issue, contact Installation support here:
NOTE
: Starting in R2019a the following name changes occurred:
  • MATLAB Distributed Computing Server was renamed to MATLAB Parallel Server 
  • mdce_def was renamed to mjs_def
  • mdce binary was renamed to mjs

More Answers (0)

Categories

Find more on Introduction to Installation and Licensing in Help Center and File Exchange

Tags

No tags entered yet.

Products


Release

R2009b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by