MATLAB® supports training a single deep neural network using multiple GPUs in parallel. This can be achieved using multiple GPUs on your local machine, or on a cluster or cloud using parallel workers with GPUs. Using multiple GPUs can speed up training significantly. To decide if you expect multi-GPU training to deliver a performance gain, consider the following factors:
How long is the iteration on each GPU? If each GPU iteration is short, then the added overhead of communication between GPUs can dominate. Try increasing the computation per iteration by using a larger batch size.
Are all the GPUs on a single machine? Communication between GPUs on different machines introduces a significant communication delay. You can mitigate this if you have suitable hardware. For more information, see Advanced Support for Fast Multi-Node GPU Communication.
To train a single network using multiple GPUs on your local machine, you can
simply specify the
ExecutionEnvironment option as
"multi-gpu" without changing the rest of your code.
trainNetwork automatically uses your available GPUs for training
When you train on a remote cluster, specify the
ExecutionEnvironment option as
"parallel". If the cluster has access to one or more GPUs, then
trainNetwork only uses the GPUs for training. Workers
without a unique GPU are never used for training computation.
If you want to use more resources, you can scale up deep learning training to clusters or the cloud. To learn more about parallel options, see Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud. To try an example, see Train Network in the Cloud Using Automatic Parallel Support.
Using a GPU or parallel options requires Parallel Computing Toolbox™. Using a GPU also requires a supported GPU device. For information on supported devices, see GPU Support by Release (Parallel Computing Toolbox). Using a remote cluster also requires MATLAB Parallel Server™.
If you run MATLAB on a single machine in the cloud that you connect to via ssh or remote desktop protocol (RDP), then network execution and training uses the same code as if you were running on your local machine.
If you have access to a machine with multiple GPUs, you can simply specify the
ExecutionEnvironment option as
"multi-gpu" option allows you to use multiple GPUs in a
local parallel pool. If there is no current parallel pool,
classify automatically start a local parallel pool using
your default cluster profile settings. The pool has as many workers as the number of
For information on how to perform custom training using multiple GPUs in your local machine, see Run Custom Training Loops on a GPU and in Parallel.
For training and inference with multiple GPUs in a remote cluster, use the
If there is no current parallel pool,
start a parallel pool using your default cluster profile settings. If the pool has
access to GPUs, then only workers with a unique GPU perform training computation. If
the pool does not have GPUs, then training takes place on all available CPU workers
For information on how to perform custom training using multiple GPUs in a remote cluster, see Run Custom Training Loops on a GPU and in Parallel.
Convolutional neural networks are typically trained iteratively using mini-batches
of images. This is because the whole dataset is usually too large to fit into GPU
memory. For optimum performance, you can experiment with the mini-batch size by
MiniBatchSize name-value option using the
The optimal mini-batch size depends on your exact network, dataset, and GPU hardware. When training with multiple GPUs, each image batch is distributed between the GPUs. This effectively increases the total GPU memory available, allowing larger batch sizes. A recommended practice is to scale up the mini-batch size linearly with the number of GPUs, in order to keep the workload on each GPU constant. For example, if you are training on a single GPU using a mini-batch size of 64, and you want to scale up to training with four GPUs of the same type, you can increase the mini-batch size to 256 so that each GPU processes 64 observations per iteration.
Because increasing the mini-batch size improves the significance of each iteration, you can increase the learning rate. A good general guideline is to increase the learning rate proportionally to the increase in mini-batch size. Depending on your application, a larger mini-batch size and learning rate can speed up training without a decrease in accuracy, up to some limit.
If you do not want to use all of your GPUs, you can select the GPUs that you want to use for training and inference directly. Doing so can be useful to avoid training on a poor-performance GPU, for example, your display GPU.
If your GPUs are in your local machine, you can use the
gpuDeviceTable (Parallel Computing Toolbox) and
gpuDeviceCount (Parallel Computing Toolbox) functions to
examine your GPU resources and determine the index of the GPUs you want to use.
For single GPU training with the
"gpu" options, by default, MATLAB uses the GPU device with index
1. You can use a
different GPU by selecting the device before you start training. Use
gpuDevice (Parallel Computing Toolbox) to select the desired
GPU using its
classifyautomatically use the selected GPU when you set the
For multiple GPU training with the
"multi-gpu" option, by
default, MATLAB uses all available GPUs in your local machine. If you want to exclude
GPUs, you can start the parallel pool in advance and select the devices manually.
For example, suppose you have three GPUs but you only want to use the devices with
3. You can use the following
code to start a parallel pool with two workers and select one GPU on each
useGPUs = [1 3]; parpool('local', numel(useGPUs)); spmd gpuDevice(useGPUs(labindex)); end
classify automatically use the current parallel pool when
you set the
ExecutionEnvironment option to
"parallel" for the same
Another option is to select workers using the
name-value argument in
parpool('local', 5); opts = trainingOptions('sgdm', 'WorkerLoad', [1 1 1 0 1], ...)
If you want to train multiple models in parallel with one GPU each, start a
parallel pool with one worker per available GPU, and train each network on a
different worker. Use
to simultaneously execute a network on each worker. Use the
trainingOptions function to set the
ExecutionEnvironment name-value option to
"gpu" on each worker.
For example, use code of the following form to train multiple networks in parallel on all available GPUs:
options = trainingOptions("sgdm","ExecutionEnvironment","gpu"); parfor i=1:gpuDeviceCount("available") trainNetwork(…,options); end
To run in the background without blocking your local MATLAB, use
parfeval. For examples showing how to train
multiple networks using
Some multi-GPU features in MATLAB, including
trainNetwork, are optimized for direct communication via fast interconnects for improved performance.
If you have appropriate hardware connections, then data transfer between multiple GPUs uses fast peer-to-peer communication, including NVLink, if available.
If you are using a Linux compute cluster with fast interconnects between machines such as Infiniband, or fast interconnects between GPUs on different machines, such as GPUDirect RDMA, you might be able to take advantage of fast multi-node support in MATLAB. Enable this support on all the workers in your pool by setting the environment variable
1. Set this environment variable in the Cluster Profile Manager.
This feature is part of the NVIDIA NCCL library for GPU communication. To configure it, you must set additional environment variables to define the network interface protocol, especially
NCCL_SOCKET_IFNAME. For more information, see the NCCL documentation and in particular the section on NCCL Environment Variables.