Datastores for Deep Learning

Datastores in MATLAB® are a convenient way of working with and representing collections of data that are too large to fit in memory at one time. Because deep learning often requires large amounts of data, datastores are an important part of the deep learning workflow in MATLAB.

Select Datastore

For many applications, the easiest approach is to start with a built-in datastore. For more information about the available built-in datastores, see Select Datastore for File Format or Application (MATLAB). However, only some types of built-in datastores can be used directly as input for network training, validation, and inference. These datastores are:

Other built-in datastores can be used as input for deep learning, but the data read from these datastores must be preprocessed into a format required by a deep learning network. For more information on the required format of read data, see Input Datastore for Training, Validation, and Inference. For more information on how to preprocess data read from datastores, see Transform and Combine Datastores.

For some applications, there may not be a built-in datastore type that fits your data well. For these problems, you can create a custom datastore. For more information, see Develop Custom Datastore (MATLAB). All custom datastores are valid inputs to deep learning interfaces as long as the read function of the custom datastore returns data in the required two-column form.

Input Datastore for Training, Validation, and Inference

Datastores are valid inputs in Deep Learning Toolbox™ for training, validation, and inference.

Training and Validation

To use an image datastore as a source of training data, use the imds argument of trainNetwork. To use all other types of datastore as a source of training data, use the ds argument of trainNetwork. To use a datastore for validation, use the 'ValidationData' name-value pair argument in trainingOptions.

To be a valid input for training or validation, the read function of a datastore (with the exception of ImageDatastore) must return data as either a two-column cell array or a two-column table. The first column of data represents inputs to the network and the second column of data represents responses. Each row of data represents a separate observation. For ImageDatastore only, trainNetwork and trainingOptions support data returned as integer arrays and single-column cell array of integer arrays.

The table shows sample output of calling the read function for datastore ds.

Format of Read DataSample Output
Two-column cell array
data = read(ds)
data =

  4×2 cell array

    {28×28 double}    {[7]}
    {28×28 double}    {[7]}
    {28×28 double}    {[9]}
    {28×28 double}    {[9]}
Two-column table
data = read(ds)
data =

  4×2 table

        input         response
    ______________    ________

    {28×28 double}       7    
    {28×28 double}       7    
    {28×28 double}       9    
    {28×28 double}       9  


For inference using predict, classify, and activations, a datastore is only required to yield one column. The inference functions ignore additional columns of data beyond the first.

Specify Read Size and Mini-Batch Size

A datastore may return any number of rows (observations) for each call to read. Functions such as trainNetwork, predict, classify, and activations that accept datastores and support specifying a 'MiniBatchSize' call read as many times as is necessary to form complete mini-batches of data. As these functions form mini-batches, they use internal queues in memory to store read data. For example, if a datastore consistently returns 64 rows per call to read and MiniBatchSize is 128, then to form each mini-batch of data requires two calls to read.

For best runtime performance, it is recommended to configure datastores such that the number of observations returned by read is equal to the 'MiniBatchSize'. For datastores that have a 'ReadSize' property, set the 'ReadSize' to change the number of observations returned by the datastore for each call to read.

Transform and Combine Datastores

Deep learning frequently requires the data to be preprocessed and augmented before data is in an appropriate form to input to a network. The transform and combine functions of datastore are useful in preparing data to be fed into a network.

Transform Datastores

The transform function creates an altered form of a datastore, called an underlying datastore, by transforming the data read by the underlying datastore.

  • For complex transformations involving several preprocessing operations, define the complete set of transformations in your own function. Then, specify a handle to your function as the @fcn argument of transform. For more information, see Create Functions in Files (MATLAB).

  • For simple transformations that can be expressed in one line of code, you can specify a handle to an anonymous function as the @fcn argument of transform. For more information, see Anonymous Functions (MATLAB).

The function handle provided to transform must accept input data in the same format as returned by the read function of the underlying datastore.

Example: Transform Image Datastore to Train Digit Classification Network

This example uses the transform function to create a training set in which randomized 90 degree rotation is added to each image within an image datastore. Pass the resulting TransformedDatastore to trainNetwork to train a simple digit classification network.

Create an image datastore containing digit images.

digitDatasetPath = fullfile(matlabroot,'toolbox','nnet', ...
imds = imageDatastore(digitDatasetPath, ...
    'IncludeSubfolders',true, ...

Set the mini-batch size equal to the ReadSize of the image datastore.

miniBatchSize = 128;
imds.ReadSize = miniBatchSize;

Transform images in the image datastore by adding randomized 90 degree rotation. The transformation function, preprocessForTraining, is defined at the end of this example.

dsTrain = transform(imds,@preprocessForTraining,'IncludeInfo',true)
dsTrain = 

  TransformedDatastore with properties:

    UnderlyingDatastore: [1×1]
             Transforms: {@preprocessForTraining}
            IncludeInfo: 1

Specify layers of the network and training options, then train the network using the transformed datastore dsTrain as a source of data.

layers = [ ...
    imageInputLayer([28 28 1],'Normalization','none')
options = trainingOptions('adam', ...
    'Plots','training-progress', ...
net = trainNetwork(dsTrain,layers,options);

Define a function that performs the desired transformations of data, data, read from the underlying datastore. The function loops through each read image and performs randomized rotation, then returns the transformed image and corresponding label as a two-column cell array as expected by trainNetwork.

function [dataOut,info] = preprocessForTraining(data,info)
numRows = size(data,1);
dataOut = cell(numRows,2);
for idx = 1:numRows
    % Randomized 90 degree rotation
    imgOut = rot90(data{idx,1},randi(4)-1);
    % Return the label from info struct as the 
    % second column in dataOut.
    dataOut(idx,:) = {imgOut,info.Label(idx)};

Combine Datastores

The combine function associates two datastores of the same length to create the two-column format expected of training and validation data. Combining datastores maintains the parity between the datastores. Each call to the read function of the resulting CombinedDatastore returns data from corresponding parts of the underlying datastores.

For example, if you are training an image in, image out regression network, then you can create the training data set by combining two image datastores. This sample code demonstrates combining two image datastores named imdsX and imdsY. Image datastores return data as a cell array, therefore the combined datastore imdsTrain returns data as a two-column cell array.

imdsX = imageDatastore(___);
imdsY = imageDatastore(___);
imdsTrain = combine(imdsX,imdsY)
imdsTrain = 

  CombinedDatastore with properties:

    UnderlyingDatastores: {1×2 cell}

Use Datastore for Parallel Training and Prefetch Read Optimization

Datastores used for parallel training or multi-GPU training must be partitionable. Specify parallel or multi-GPU training using the 'ExecutionEnvironment' name-value pair argument of trainingOptions.

Many built-in datastores are already partitionable because they support the partition function. Transformed datastores are partitionable if their underlying datastore is partitionable. Using the transform function with built-in datastores frequently maintains support for parallel and multi-GPU training.

If you need to create a custom datastore that supports parallel or multi-GPU training, then your datastore must implement the class.

Some partitionable datastores support reading training data using asynchronous prefetch reading, which queues data in memory while the GPU is working. Specify prefetch reading using the 'DispatchInBackground' name-value pair argument of trainingOptions. Prefetch reading requires Parallel Computing Toolbox™.

There are some limitations when using datastores with parallel training, multi-GPU training, and prefetch read optimization.

  • Datastores do not support specifying the 'Shuffle' name-value pair argument of trainingOptions as 'none'.

  • Combined datastores are not partitionable and therefore do not support parallel training, multi-GPU training, or prefetch reading.

See Also

| | | |

Related Examples

More About