Can I eliminate loop from this function?

Question

Noushin Farnoud am 9 Mär. 2016

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/272385-can-i-eliminate-loop-from-this-function

Geschlossen: MATLAB Answer Bot am 20 Aug. 2021

In MATLAB Online öffnen

Hello Matlab Community,

I have a table T with couple of million lines and 3 fields: gene1, gene2, sample. Here is what I want to do:

1) Find rows that have the same gene1-gene2 value (both string): This can be easily done by unique function.

2) For each unique pair of gene1-gene2, find their unique "sample", count them and generate a string from those sample-ids. This is where I have problem with. Below is what I have done which works but looping over a large table takes a long time. I was thinking maybe passing a function handle and doing a "varfun" improves it, but would love to hear if you have a better solution?

 [T_uniq]= unique(T(:,1:2),'rows'); 
 T_uniq_annot= myFun(T_uniq, T);
 function [x]= myFun(x, A)
    for i=1:size(x,1)
        ix_A=(strcmp(x.gene1(i),A.gene1) & strcmp(x.gene2(i),A.gene2)); 
        temp=unique(A.sample(ix_A));
        x.uniq_samples(i)=numel(temp);
        if numel(temp)>1
            temp=temp.';
            x.sample(i) = cellstr(strjoin(temp,', '));
        else 
            x.sample(i)= cellstr(temp);
        end
        clear ix_A temp;
    end

Thank You,

Nfarnoud

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Jan am 9 Mär. 2016

In MATLAB Online öffnen

"size(x, 1)" looks like x is an array. "x.sample(i)" looks like x is a scalar. Is this a typo in pseudo code?

What is the type of "temp"? Is this a cell string already? If so:

x.sample{i} = strjoin(strjoin(temp.', ', ');

is faster. Is "x.sample" pre-allocated?

Noushin Farnoud am 9 Mär. 2016

Both x and A are tables.

x is in fact a subset of the original table T with unique gene1-gene2 columns. In line 1 of the example above, I had called it "T_uniq". So inputs to myFun are:

x : T_uniq

A: T

In the function size(x,1) will give the total number of rows in table x (uniq_T). Then I loop over each line using 'i' to search for the line in table A (original T) to access it's "sample" field. This loop is what I like to avoid!

I have not done pre-allocation for x.sample. I will test it. Thanks.

Diese Frage ist geschlossen.

Answer 1

Jan am 9 Mär. 2016

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/272385-can-i-eliminate-loop-from-this-function#answer_212985

In MATLAB Online öffnen

Is x.sample pre-allocated?

function [x]= myFun(x, A)
n = size(x.gene1, 1);          % Is "size(x, 1)" a typo?!
x.sample = cell(1, n);         % Pre-allocate!!!
x.uniq_samples = zeros(1, n);  % Pre-allocate!!!
for i = 1:n
   ix_A = (strcmp(x.gene1(i),A.gene1) & strcmp(x.gene2(i),A.gene2)); 
   temp = unique(A.sample(ix_A));
   x.uniq_samples(i) = numel(temp);
   x.sample{i} = strjoin(strjoin(temp.', ', ');
   % Don't clear inside the loop, this wastes time only
end

Fex: CStrCatStr might be faster than strjoin.

Is A.gene1 and A.gene2 large? Then sorting them might be useful, because this is done in each call of unique again.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Ced am 9 Mär. 2016

In MATLAB Online öffnen

All columns in the table have the same length, so I believe

size(x,1)

should be fine, no? Maybe even a tiny bit faster than size(x.gene1,1).

Answer 2

Ced am 9 Mär. 2016

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/272385-can-i-eliminate-loop-from-this-function#answer_212989

Bearbeitet: Ced am 9 Mär. 2016

In MATLAB Online öffnen

Do your genes have upper and lower case letters? Otherwise, use strcmpi for a minor speed-up.

instead of strcmp.

Small dummy example gave the following:

base = 'gAuC';
N = 10000000; % we have a lot of genes
Tt = table(base(randi(4,N,4)),base(randi(4,N,4))); % create dummy genes
tic
Tt_unique = unique(Tt,'rows');
toc
M = 1000; % repeat tests 1000 times
% Using strcmp
tic
bla1 = zeros(M,1);
for i = 1:1000
    ix1 = strcmp(Tt_unique(:,1),'gauc');
    bla1(i) = numel(Tt_unique.Var1(ix1));
end
toc
% Using strcmpi
tic
bla2 = zeros(M,1);
for i = 1:1000
    ix2 = strcmpi(Tt_unique(:,1),'gauc');
    bla2(i) = numel(Tt_unique.Var1(ix2));
end
toc
% passing logical directly
tic
bla3 = zeros(M,1);
for i = 1:1000
    bla3(i) = numel(Tt_unique.Var1(strcmpi(Tt_unique(:,1),'gauc')));
end
toc
% Speed-up by sorting first?
tic
[ ~, ind_sort ] = sort(Tt.Var1);
Tt = Tt(ind_sort,:);
Tt_unique = unique(Tt,'rows');
toc

Result:

Elapsed time is 17.418828 seconds. % just applying unique
Elapsed time is 0.197103 seconds.  % using strcmp
Elapsed time is 0.178446 seconds.  % using strcmpi
Elapsed time is 0.177748 seconds.  % not saving the logical
Elapsed time is 73.118320 seconds.

I have no idea was real genes look like, so your results might be different. :) As we can see, the main burden comes from "unique" though, unless your samples are also huge. Sorting your table as @Jan suggested sounds like a good idea, but seems to be extremely slow here. Maybe someone knows what the mistake is?

EDIT: There is actually no need for

ix_A = (strcmp(x.gene1(i),A.gene1) & strcmp(x.gene2(i),A.gene2));

Using

[ T_uniq, index, rev_index ] = unique(T(:,1:2),'rows');

You can now extract how often each element of T_uniq appears from rev_index, i.e. just check how often each rev_index is repeated. This integer count should be a lot faster than checking two strings.

2 Kommentare
Keine anzeigenKeine ausblenden

Noushin Farnoud am 9 Mär. 2016

Bearbeitet: Noushin Farnoud am 9 Mär. 2016

Thanks @Jan and @Ced for great tips. The x table has about 676K lines and thus sorting based on x.gene1 field was very fast.

I will test the function with sorted x as well as using strcmpi (my data (gene names) is case-insensitive).

I will post an update. I guess there is no way I can completely get ride of this loop!

Ced am 9 Mär. 2016

Bearbeitet: Ced am 9 Mär. 2016

There might be, but I don't think cellfun & co would be the solution. They can be elegant because they allow to write more compact code. However, they are not always faster, since they generally have to loop internally as well, and on top of that may generate a significant overhead.

Can I eliminate loop from this function?

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Antworten (2)

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden

Siehe auch

Tags

Community Treasure Hunt

Can I eliminate loop from this function?

3 Kommentare 1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Antworten (2)

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

2 Kommentare Keine anzeigenKeine ausblenden

Siehe auch

Tags

Community Treasure Hunt

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden