Problems with multi-gpus

Question

Andres Ramirez am 4 Nov. 2017

1
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/365107-problems-with-multi-gpus

Bearbeitet: Joss Knight am 20 Nov. 2017

I am using this function to train a CNN:

function [trainedNet,trainingSet,testSet] = OurNetCBIR
    outputFolder = fullfile('Database');
    rootFolder=fullfile(outputFolder, 'Oliva');
    imds = imageDatastore(fullfile(rootFolder),'IncludeSubfolders', true, 'LabelSource', 'foldernames');
    imds.ReadFcn = @(filename)readAndPreprocessImage(filename);
    function Iout = readAndPreprocessImage(filename)
        I = imread(filename);
        if ismatrix(I)
            I = cat(3,I,I,I);
        end
        Iout = imresize(I, [227 227]);  
    end
    [trainingSet,testSet] = splitEachLabel(imds, 0.7, 'randomize');
    layers = [
    imageInputLayer([227 227 3],'DataAugmentation','none')                                % (1)
    convolution2dLayer(7,50,'Stride', 2, 'Padding', 0,'Name','Conv1')                     % (2)111x111x50
    reluLayer('Name','ReLu1')                                                             % (3)111x111x50
    maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling1')                    % (4)55x55x50
    crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm1')  % (5)55x55x50
    convolution2dLayer(5,100,'Stride', 1, 'Padding', 2,'Name','Conv2')                    % (6)55x55x100
    reluLayer('Name','ReLu2')                                                             % (7)55x55x100
    maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling2')                    % (8)27x27x100
    crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm2')  % (9)27x27x100
    convolution2dLayer(3,256,'Stride', 1,'Padding', 2,'Name','Conv3')                     % (10)27x27x256
    reluLayer('Name','ReLu3')                                                             % (11)27x27x256
    maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling3')                    % (12)13x13x256
    crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm3')  % (13)13x13x256
    convolution2dLayer(3,400,'Stride', 1,'Padding', 1,'Name','Conv4')                     % (14)13x13x400
    reluLayer('name','ReLu4')                                                             % (15)13x13x400
    convolution2dLayer(3,400,'Stride', 1,'Padding', 1,'Name','Conv5')                     % (16)13x13x400
    reluLayer('Name','ReLu5')                                                             % (17)13x13x400
    convolution2dLayer(3,256,'Stride', 1,'Padding', 1,'Name','Conv6')                     % (18)13x13x256
    reluLayer('Name','ReLu6')                                                             % (19)13x13x256
    maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling4')                    % (20)6x6x256
    fullyConnectedLayer(4800,'Name','fc1')                                                % (21)1x1x4800
    reluLayer('Name','ReLu7')                                                             % (22)1x1x4800
    dropoutLayer(0.5,'Name','dropout1')                                                   % (23)1x1x4800
    fullyConnectedLayer(2400,'Name','fc2')                                                % (24)1x1x2400
    reluLayer('Name','ReLu8')                                                             % (25)1x1x2400
    dropoutLayer(0.5,'Name','dropout2')                                                   % (26)1x1x2400
    fullyConnectedLayer(8,'Name','fc3')                                                   % (27)
    softmaxLayer()
    classificationLayer()];
    options = trainingOptions('sgdm',...
      'InitialLearnRate',0.001,...   
      'LearnRateSchedule','piecewise',...
      'LearnRateDropFactor',0.1,...  
      'LearnRateDropPeriod',30,... 
      'MaxEpochs',10,... 
      'Momentum',0.9,...
      'L2Regularization',0.0005,...
      'MiniBatchSize',25,...
      'ExecutionEnvironment','gpu');
trainedNet = trainNetwork(trainingSet,layers,options);
end

I have no problem training with a single gpu, but when I try to train with multiple gpus, matlab generates the following error:

Starting parallel pool (parpool) using the 'local' profile ...
connected to 4 workers.
Error using trainNetwork (line 140)
An invalid indexing request was made.
Error in OurNetCBIR (line 110)
trainedNet = trainNetwork(trainingSet,layers,options);
Caused by:
  Error using Composite/subsasgn (line 103)
  An invalid indexing request was made.
      Struct contents reference from a non-struct array object.
The client lost connection to worker 1. This might be due to network problems, or the interactive communicating job might have
errored.

Can someone help me please?

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Joss Knight am 5 Nov. 2017

2
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/365107-problems-with-multi-gpus#answer_289485

In MATLAB Online öffnen

I can reproduce your issue. It seems the issue is your use of an anonymous function to call a nested function for your datastore ReadFcn. Something about that is causing a crash when the datastore is deserialised on your pool worker (i.e. copied to it). This is a bug which we will investigate - thanks very much for bringing it to our attention.

Still, your issue is easily fixed. Reference your nested function directly rather than via an anonymous function:

imds.ReadFcn = @readAndPreprocessImage;

However, in R2017b you should be using augmentedImageSource to resize your images, since use of a ReadFcn cripples performance. This doesn't give you a way to convert grayscale images to RGB, but the best solution is to do that offline and save new files.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Answer 2

Andres Ramirez am 19 Nov. 2017

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/365107-problems-with-multi-gpus#answer_291918

Bearbeitet: Andres Ramirez am 19 Nov. 2017

In MATLAB Online öffnen

Hello Joss:

I modified my program using the augmentedImageSource function (not the AugmentedImageSource function) to change the size of the image instead of ReadFcn. To which the new scrip list:

%
%
outputFolder = fullfile('Database');        
rootFolder=fullfile(outputFolder, 'Oliva'); 
imds = imageDatastore(fullfile(rootFolder),'IncludeSubfolders', true, 'LabelSource', 'foldernames');
[trainingSet,testSet] = splitEachLabel(imds, 0.7, 'randomize');
imageSize = [227 227 3];
datasourcetraining = augmentedImageSource(imageSize,trainingSet,'BackgroundExecution',false);
%%Definimos las capas de la CNN
layers = [
  imageInputLayer([227 227 3])                                                          % (1)
    convolution2dLayer(7,50,'Stride', 2, 'Padding', 0,'Name','Conv1')                     % (2)111x111x50
    reluLayer('Name','ReLu1')                                                             % (3)111x111x50
    maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling1')                    % (4)55x55x50
    crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm1')  % (5)55x55x50
    convolution2dLayer(5,100,'Stride', 1, 'Padding', 2,'Name','Conv2')                    % (6)55x55x100
    reluLayer('Name','ReLu2')                                                             % (7)55x55x100
    maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling2')                    % (8)27x27x100
    crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm2')  % (9)27x27x100
    convolution2dLayer(3,256,'Stride', 1,'Padding', 1,'Name','Conv3')                     % (10)27x27x256
    reluLayer('Name','ReLu3')                                                             % (11)27x27x256
    maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling3')                    % (12)13x13x256
    crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm3')  % (13)13x13x256
    convolution2dLayer(3,400,'Stride', 1,'Padding', 1,'Name','Conv4')                     % (14)13x13x400
    reluLayer('name','ReLu4')                                                             % (15)13x13x400
    convolution2dLayer(3,400,'Stride', 1,'Padding', 1,'Name','Conv5')                     % (16)13x13x400
    reluLayer('Name','ReLu5')                                                             % (17)13x13x400
    convolution2dLayer(3,256,'Stride', 1,'Padding', 1,'Name','Conv6')                     % (18)13x13x256
    reluLayer('Name','ReLu6')                                                             % (19)13x13x256
    maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling4')                    % (20)6x6x256
    fullyConnectedLayer(4800,'Name','fc1')                                                % (21)1x1x4800
    reluLayer('Name','ReLu7')                                                             % (22)1x1x4800
    dropoutLayer(0.5,'Name','dropout1')                                                   % (23)1x1x4800
    fullyConnectedLayer(2400,'Name','fc2')                                                % (24)1x1x2400
    reluLayer('Name','ReLu8')                                                             % (25)1x1x2400
    dropoutLayer(0.5,'Name','dropout2')                                                   % (26)1x1x2400
    fullyConnectedLayer(8,'Name','fc3')                                                   % (27)
    softmaxLayer
    classificationLayer];
options = trainingOptions('sgdm',...
    'InitialLearnRate',0.001,...             
    'LearnRateSchedule','piecewise',...
    'LearnRateDropFactor',0.1,...            
    'LearnRateDropPeriod',30,... 
    'MaxEpochs',100,... 
    'Momentum',0.9,...
    'L2Regularization',0.0005,...
    'MiniBatchSize',128,...
    'Verbose',true,...
    'ExecutionEnvironment','multi-gpu');
%%entrenamiento
trainedNet = trainNetwork(datasourcetraining,layers,options);

When I use 'BackgroundExecution' is true and 'ExecutionEnvironment' is 'auto'; the network trains without problem using the 8 CPUs that the machine has; when I put 'BackgroundExecution' is false and 'ExecutionEnvironment' is 'gpu'; the network trains without problem with a single GPU; but when I change 'ExecutionEnvironment' is 'multi-gpu' the network begins to train using the 4 gpus and after a certain decade the training is interrupted and matlab throws the following messages:

Starting parallel pool (parpool) using the 'local' profile ...
connected to 4 workers.
Initializing image normalization.
|=========================================================================================|
|     Epoch    |   Iteration  | Time Elapsed |  Mini-batch  |  Mini-batch  | Base Learning|
|              |              |  (seconds)   |     Loss     |   Accuracy   |     Rate     |
|=========================================================================================|
|            1 |            1 |         4.98 |       2.0800 |       14.06% |       0.0010 |
|            4 |           50 |       196.97 |       2.0736 |       14.84% |       0.0010 |
|            8 |          100 |       393.91 |       2.0451 |       11.72% |       0.0010 |
|           11 |          150 |       591.06 |       1.8704 |       26.56% |       0.0010 |
|           15 |          200 |       788.22 |       1.5337 |       47.66% |       0.0010 |
|           18 |          250 |       984.79 |       1.4195 |       48.44% |       0.0010 |
Lab 2: 
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Error using trainNetwork (line 140)
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
Error in afr_prueba (line 75)
trainedNet = trainNetwork(datasourcetraining,layers,options);
Caused by:
  Error using nnet.internal.cnn.ParallelTrainer/train (line 69)
  Error detected on worker 2.
      Error using parallel.internal.mpi.gopReduce (line 44)
      An unexpected error occurred during CUDA execution. The CUDA error was:
      CUDA_ERROR_LAUNCH_TIMEOUT

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Joss Knight am 19 Nov. 2017

Bearbeitet: Joss Knight am 19 Nov. 2017

Well, strictly speaking this is a different question, but okay. Timeouts are a consequence of using graphics cards in WDDM mode on Windows. A quick search would give you your answer, for instance:

https://uk.mathworks.com/matlabcentral/answers/360033-cuda-error-semantic-segmentation

You can turn off timeouts, or reduce the amount of work your GPUs are doing so they don't occur.

I don't know why you're getting timeouts in multi-gpu mode but not on a single GPU. Are your other GPUs much lower powered than your main one, or are they all the same?

Melden Sie sich an, um zu kommentieren.

Answer 3

Andres Ramirez am 20 Nov. 2017

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/365107-problems-with-multi-gpus#answer_292017

Bearbeitet: Andres Ramirez am 20 Nov. 2017

In MATLAB Online öffnen

Hello, thanks for answering.

The 4 GPUs that I have are the same. Will it have something to do with one of the GPUs that controls the video of the machine?

 CUDADevice with properties:
                      Name: 'GeForce GTX 1080'
                     Index: 1
         ComputeCapability: '6.1'
            SupportsDouble: 1
             DriverVersion: 9
            ToolkitVersion: 8
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 8.5899e+09
           AvailableMemory: 7.0066e+09
       MultiprocessorCount: 20
              ClockRateKHz: 1733500
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

I considered adjusting the TDRLevel and reduced the WDDM TDR delay to 0; The results with multi-gpu are the following:

Starting parallel pool (parpool) using the 'local' profile ...
connected to 4 workers.
Initializing image normalization.
|=========================================================================================|
|     Epoch    |   Iteration  | Time Elapsed |  Mini-batch  |  Mini-batch  | Base Learning|
|              |              |  (seconds)   |     Loss     |   Accuracy   |     Rate     |
|=========================================================================================|
|            1 |            1 |         5.17 |       2.0800 |       14.06% |       0.0010 |
|            4 |           50 |       205.09 |       2.0736 |       14.84% |       0.0010 |
|            8 |          100 |       408.72 |       2.0451 |       11.72% |       0.0010 |
Lab 2: 
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
  In spmdlang.remoteBlockExecution (line 50)
Error using trainNetwork (line 140)
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
Error in afr_prueba (line 75)
trainedNet = trainNetwork(datasourcetraining,layers,options);
Caused by:
  Error using nnet.internal.cnn.ParallelTrainer/train (line 69)
  Error detected on worker 2.
      Error using parallel.internal.mpi.gopReduce (line 44)
      An unexpected error occurred during CUDA execution. The CUDA error was:
      CUDA_ERROR_LAUNCH_TIMEOUT

I have disabled the WDDM TDR and the network already running with multiple GPUs. Perform two tests one with a single GPU with which an average time of 0.4119 seconds was obtained per decade, another with the four GPUs with which an average time of .9690 seconds was obtained per decade.

Using the 4 GPUs you get an average time per decade 2 times greater than that obtained with a single GPU. I do not understand what happens? It is assumed that when using more GPUs should reduce the time for decades considerably

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Joss Knight am 20 Nov. 2017

Bearbeitet: Joss Knight am 20 Nov. 2017

In MATLAB Online öffnen

Unfortunately on Windows the delay for communication between GPUs is significant. You can only manage this by increasing the MiniBatchSize as much as possible, trying to get it to the maximum achievable with your available memory - this improves the compute/communication ratio. It depends on the hardware but it's not always possible on Windows to get multi-gpu to go faster than single GPU. The general advice is to keep the MiniBatchSize per GPU the same. You can also scale up the learning rate commensurately because a large batch size lets you train faster (although sometimes you need to 'boot' your network with a smaller learn rate at first). Also, if running Linux is an option for you that will ameliorate this issue.

The behaviour of TDR is often confusing, with timeouts not necessarily being related (it seems) to the execution time of a single kernel. I don't know why the timeouts still seem to be occurring even after you've disabled them - I've only seen this before when the user has not rebooted after changing the registry keys. Did you reboot?

The fact that one of your cards is running graphics will definitely be interfering. You could try removing it from the pool. One way to do that is to set CUDA_VISIBLE_DEVICES on MATLAB startup to ensure only the non-display cards are used:

setenv CUDA_VISIBLE_DEVICES 0,2,3

...or whatever the indexes of those cards are (noting that the indices for this environment variable are 1 less than the indices shown by gpuDevice).

Melden Sie sich an, um zu kommentieren.

Problems with multi-gpus

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (3)

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Problems with multi-gpus

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (3)

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden