Matlab trainNetwork CNN training pauses iterating intermittently at random then continues
3 views (last 30 days)
I'm attempting to train a DnCNN network with a grayscale image patch dataset I've collected and aggregated into training and validation imageDatastore objects. I'm using trainNetwork to execute the training routine. When training on imageDatastore train and validation objects containing 50,000 and 5,000 files, respectively, training iterations appear to execute with the same time duration between each iteration (for example, it appears to take less than 1 second for each minibatch size of 128 to be completed and iterate to the next minibatch).
However, when I increase the amount of training and validation files in the imageDatastore objects passed into the trainNetwork function to 350,000 and 35,000, respectively, during training, random iterations appear to hang/pause such that the time duration for the "paused" iteration is 20-30 seconds longer than the normal ~1 second per iteration timeframe. This pausing happens intermittently and frequently significantly increasing my training time and I don't understand why. My memory resources via RAM and GPU have plenty of available memory during training and modification of batchsize, learning rate and optimizer (ADAM, SGDM) do not eliminate this pausing action. The problem appears to be directly related to the number of files in the imageDatastore objects used for training.
Has anyone dealt with this before? Is there some type of data cleanup action being performed via trainNetwork that is executing causing iterations to pause randomly when the imageDatastore objects contain large numbers of files?
Any insight would be greatly appreciated! Thanks
Joss Knight on 11 Aug 2022
Is the pause associated with a validation measurement being added to the training plot? With 7 times as much validation data it will take 7 times longer to take a validation measurement.