SPMD vs PARFOR and Memory Usage

Question

1 Stimme

Dear all,

I am trying to migrate some of the code I have which utilizes parfor to spmd in order to use the codistributed array features and save considerable memory (because parfor is copying the huge matrix var2 into all the workers). So I was hoping to distribute the matrix var2 between all the workers and end up saving some memory. I am running a toy example on our linux servers using the following function:

function accum_results=try_memory_smpd()
tic
var1=repmat(linspace(1,100,100),100,1);
var2=2*linspace(0.001,0.02,600000);
accum_results_temp=zeros(100,600000);
upper_bound=100;
lower_bound=1;
spmd
    D = codistributed(var1,codistributor('1d', 2));
    temp=getLocalPart(D);
    globalInd=globalIndices(D, 2);
    local_lower_bound = find(globalInd == lower_bound, 1);
    if ~isempty(local_lower_bound)
        fprintf('The lower bound found in Lab %d, indice %d\n',labindex,local_lower_bound);
    end
    if isempty(local_lower_bound) && min(globalInd)>lower_bound
        local_lower_bound=1;
    end
    local_upper_bound = find(globalInd == upper_bound, 1);
    if ~isempty(local_upper_bound)
        fprintf('The upper bound found in Lab %d, indice %d\n',labindex,local_upper_bound);
    end
    if isempty(local_upper_bound) && max(globalInd)<upper_bound
        local_upper_bound=size(temp,2);
    end
      if ~(isempty(local_upper_bound) || isempty(local_lower_bound))
          for j = local_lower_bound:local_upper_bound
              accum_results_temp = accum_results_temp+bsxfun(@times,var2,temp(:,j));
          end
          fprintf('Lab %d works between indice %d and %d \n',labindex,local_lower_bound,local_upper_bound);
      else
          fprintf('No work for Lab %d!!\n',labindex);
      end
      D=[];
      temp=[];
      var2=[];
  end
  accum_results=zeros(100,600000);
  for cell_ind=1:length(accum_results_temp)
      accum_results=accum_results+accum_results_temp{cell_ind};
  end
  toc
  end

Note that the sizes are fairly large and you may need to change the matrix sizes. Anyway, when I profile the code, it seems the bottleneck is mainly caused by the final for loop which adds up all the cell entries in the composite object returned by the spmd block (therefore the resulting matrix is 100x600000). I also note that the PARFOR implementation of the same code finishes in about half the time. The additional functionality added in the spmd code (if-checks etc) has no visible impact on performance. Using methods such as cell2mat etc. will defeat the purpose of the spmd implementation since it will create a copy of the data already stored on the workers. I'd be very grateful if someone can give me an idea/inspiration such that I can get away with using parallel code without excessive memory usage. Thanks in advance.

Cem

P.S. Here's the PARFOR implementation:

function accum_results=try_memory()
tic
var1=repmat(linspace(1,100,100),100,1);
var2=2*linspace(0.001,0.02,600000);
accum_results=zeros(100,600000);
parfor i=1:100
    accum_results=accum_results+bsxfun(@times,var2,var1(:,i));
end
toc
end

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Answer 1

Edric Ellis am 30 Sep. 2014

In MATLAB Online öffnen

3 Stimmen

Here's a version that I've reworked quite a bit to use codistributed arrays hopefully a little more effectively (it runs about twice as quickly on my machine here).

tic
N = 100;
M = 600000;
spmd
    % Build 'var1' on the workers directly to avoid communication
    var1=linspace(1,N,N);
    % Build 'var2' directly in codistributed form to save memory
    var2=2 * codistributed.linspace(0.001,0.02,M);
    %
    % We will operate on the local part of var2
    var2_lp = getLocalPart(var2);
    var2_cod = getCodistributor(var2);
    %
    % Build accum_results as a codistributed array directly, and ensure
    % it uses the same codistributor as var2 so that we can operate
    % on the local parts of the arrays together.
    accum_results = codistributed.zeros(N, M, var2_cod);
    %
    % Get the local part out of accum_results so that we can operate on it directly.
    ar_local = getLocalPart(accum_results);
    ar_codist = getCodistributor(accum_results);
    %
    % Loop 1:N applying the BSXFUN to the relevant local parts
    for idx = 1:N
        ar_local = ar_local + bsxfun(@times, var2_lp, repmat(var1(idx), N, 1));
    end
    % Put accum_results back together
    accum_results = codistributed.build(ar_local, ar_codist);
end
% Gather the results back to the host
accum_results=gather(accum_results);
toc

3 Kommentare
1 älteren Kommentar anzeigen 1 älteren Kommentar ausblenden

Edric Ellis am 1 Okt. 2014

Actually, the different labs are working on different slices of var2 and putting their results into their own slice of accum_results (via the local part ar_local). The codistributed.build takes the local part slices and reconstitutes a whole array.

Cem am 1 Okt. 2014

Yes it makes so much sense, by doing so we don't need to create the huge accum_array on each of the workers with full size! Thanks once more.

Melden Sie sich an, um zu kommentieren.

SPMD vs PARFOR and Memory Usage

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

3 Kommentare
1 älteren Kommentar anzeigen 1 älteren Kommentar ausblenden

Weitere Antworten (0)

Kategorien

Tags

Community Treasure Hunt

SPMD vs PARFOR and Memory Usage

0 Kommentare -2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

3 Kommentare 1 älteren Kommentar anzeigen 1 älteren Kommentar ausblenden

Weitere Antworten (0)

Kategorien

Tags

Siehe auch

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

3 Kommentare
1 älteren Kommentar anzeigen 1 älteren Kommentar ausblenden