Multiple GPU setup slower than single GPU

For my research I have to perform a lot of repetition of the same optimization (for statistics). I already found out that my fitness function is way faster on the GPU and as such I am performing those calculations on the available GPUs. Fortunately, I have 3 GPUs at my disposal, I worked out a scheme where I open a parallel pool and using parfeval I assign each GPU to a different optimization.
When I checked the performance of the this setup, I noticed that the speed of a single GPU decreases a lot (by half) when it is used in the multiple GPU setup (3 workers) compared to a single GPU setup (1 worker).
I rechecked the implementations and saw no signs that data has to be sent from one GPU to the other so they never have to be synchronized.
Solutions I have tried: - Make a fitness function mfile for each GPU (did not work) - Open a matlab instance for each GPU separately (did not work)
Suggestions on this problem are appreciated?

9 Kommentare

Joss Knight
Joss Knight am 28 Apr. 2018
On the face of it you are accidentally using the same GPU on all the workers. What Index is reported by gpuDevice on each of your workers?
You should be able to get this working by running three MATLAB sessions. You just have to manually select a different gpuDevice on each one.
Beyond that I think we'd have to see some example code that reproduces your problem.
arvid Martens
arvid Martens am 9 Mai 2018
Bearbeitet: arvid Martens am 9 Mai 2018
I actually force them on different GPUs with an spmd statement.
However, I've noticed that for some reason that if I initialize a Matlab instance (not running any code) on a single GPU, the memory of the others change as well. Maybe it has something to do with the drivers?
Joss Knight
Joss Knight am 10 Mai 2018
Can you explain exactly what you mean by that? The memory on GPUs 2 and 3 changes when you select GPU 1?
arvid Martens
arvid Martens am 11 Mai 2018
Apparently, I was wrong about this. What happens is that when I run my code with a parpool of size 1 (so one GPU is used) some memory is used by the other GPUs. However when I use a normal for loop (also one GPU) no additional memory is used by the GPU 2 and 3.
Also to remove the memory used by GPU 2 and 3 I need to manually reset them.
arvid Martens
arvid Martens am 11 Mai 2018
Bearbeitet: arvid Martens am 11 Mai 2018
I have found when the problem occurs. It happens whenever the GPUs (when they are doing the same task in parallel) have to perform an ifft on a large dataset say (6000x20x4096) along the third dimension. I've also tested if the problems is due the fft being along the third dimension by doing it along the first dimension but problem still is still there.
Maybe you can use this as a sample code:
for ii=1:1000
T = gpuArray(rand(20,751,10,150));
signals = ifft(T,4096,4,'symmetric').*4096;
end
I've noticed that if the code is run on GPU 1 (Titan V) it has a steady utilization percentage (not going lower than 50%). However, when a second instance of matlab is booted at runs this code at the same time on GPU 2, the utilization wobbles around more at goes to zero quite often (slowing down the calculation).
My system contains: - Titan V (default GPU) - Tesla K40 - Quadro K6000
Joss Knight
Joss Knight am 14 Mai 2018
Bearbeitet: Joss Knight am 14 Mai 2018
It doesn't surprise me that your code runs slower on 3 GPUs than one, because the Quadro K6000 will be hundreds of times slower at double-precision computation than the other two cards; your whole computation is sitting waiting for the Quadro card to finish.
As for what is going on with memory usage, can you explain more? The above code: when you run it as you wrote, do you see memory being used on the unselected cards? And if so how much? And how are you measuring that? And what is your Operating System?
I do see impact on a Quadro card from loading and running MATLAB but of course that card is doing graphics so it's not particularly surprising.
I ran this on a machine with 4 Titan XP GPUs in TCC mode. I found that there was a very small impact on unselected devices for each worker (8MiB of memory). Loading the CUDA driver into a process incurs some memory costs for each device; and then when you create a CUDA context by selecting a device, a large chunk of memory is reserved, for each process.
I also ran this on a machine with three different GPUs, like yours and saw much the same behaviour, with or without pools.
arvid Martens
arvid Martens am 14 Mai 2018
DP precision of the quadro as at the same level of the tesla. it is only in the newer architectures (maxwell, pascal) that DP level of quadros is low compared to tesla.
My problem actually occurs when the GPU are working independently, so three seperate matlab sessions with the varaible T loaded. Then if I perform the ifft on a single GPU the percentage utilization is at a stable 60% (titan V), however when a second operation is started on another matlab instance with a diffrent GPU the percentage drops (and fluctuates) of the first GPU. The second GPU also fluctuates and performance of both of the GPUs has dropped.
In my current model I have circumvented the problem by limiting the amount of data in the variable T. I noticed that if the amount of the data is below a treshold the problem does not occur and GPU utilization is at a stable 95%-100% on all three of them. Above the treshold the utilization starts to fluctuates and calculation time increases.
Joss Knight
Joss Knight am 17 Mai 2018
You're right, sorry (about the double precision performance).
I wouldn't put too much stock in the Utilization measure, it is only weakly linked to performance. Much better would be to look at how long it is taking to run your code.
The only thing I can think of is that you are being limited by shared system resources. All three processes are sharing the PCI bus and system memory - perhaps there is a lot of data transfer. Or perhaps you are doing some large computations on the CPU that use all your cores? Even some GPU functions do that because they are hybrid algorithms (e.g. mldivide, eig, chol etc). Waiting for the CPU would slow the rate at which kernels are being launched on the GPU.
If you are running on Linux it would be interesting to see whether you can get any benefit out of using the Multi-Process Service.

Melden Sie sich an, um zu kommentieren.

Antworten (0)

Kategorien

Gefragt:

am 24 Apr. 2018

Kommentiert:

am 17 Mai 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by