This is not uncommon. There is communication overhead with the GPU. It is most effective if you have extensive GPU computation with little data transfer (which does not necessarily mean small matrices being computed with.) In cases where you do a little bit of computing on large matrices being transferred then although the computations might be very fast you have to wait for the results to data transfer in both directions. If you are going to do further computation on data then leave a copy of it on the GPU even if you want a CPU copy, so that you do not need to transfer it up to the GPU again .