Generate neighbors of a string

Question

Paolo Binetti am 9 Apr. 2017

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/334445-generate-neighbors-of-a-string

Kommentiert: Paolo Binetti am 3 Jun. 2017

I need to generate a cell array (or char array) containing all d-neighbors of a string of lenght k. This means all the k-long strings which differ from the input at most d characters.

My specific need concerns an alphabet of four letters, A, C, G, T. Here is an example: with string='CTA' and d=2 as inputs, the output array should have 37 elements, like 'AGA', 'CAA', 'GGA', ... and 'CTA'.

My code, attached, is the fastest of several I have tried. The main function "neighbors" calls auxiliary function "immediate_neighbors". The expected inputs of these function are provided as comments in the script.

I am convinced there are ways to achieve this much faster, but could not find any.

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Stephen23 am 9 Apr. 2017

"I'm not aware of a "j++" operation in Matlab"

There isn't in MATLAB, but there is in Octave.

Paolo Binetti am 9 Apr. 2017

Bearbeitet: Paolo Binetti am 9 Apr. 2017

Thank you for the feedback. I have documented the inputs in the function script, corrected the code, and tested on Matlab.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Guillaume am 9 Apr. 2017

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/334445-generate-neighbors-of-a-string#answer_262403

Bearbeitet: Guillaume am 9 Apr. 2017

In MATLAB Online öffnen

For short strings and small d, your code is slightly faster, but for longer strings with larger d (tested with 18 chars and d = 3) the following is much faster:

function patterns = neighbours(pattern, distance, charset)
    if nargin < 3
       charset = 'ACGT';
    end
    %build replacement list for each character:
    charsetreps = arrayfun(@(cidx) charset([1:cidx-1, cidx+1:end]), 1:numel(charset), 'UniformOutput', false);
    charsetreps = vertcat(charsetreps{:});
    %fill patterns for distances from 0 to distance
    patterns = {pattern};  %0 distance
    for d = 1:distance
        %get all combinations of d indices of pattern to replace at once
        charidx = nchoosek(1:numel(pattern), d);
        %storage for all replacement at d:
        allreps = cell(size(charidx, 1), (numel(charset)-1)^d);
        %get cartesion product of replacement column in charsetreps
        repcols = cell(1, d);
        [repcols{:}] = ndgrid(1:numel(charset)-1);
        repcols = reshape(cat(d+1, repcols{:}), [], d);
        repcols = numel(charset) * (repcols-1);  %conversion to linear indexing of the column
        %iterate over the charidx combinations
        for ci = 1:size(charidx, 1)
            [~, reprows] = ismember(pattern(charidx(ci, :)), charset);
            %iterate over the replacement characters
            for rep = 1:size(repcols, 1)
                replacement = pattern;
                replacement(charidx(ci, :)) = charsetreps(reprows + repcols(rep, :));  %reprow + repcols creates linear indices
                allreps{ci, rep} = replacement;
            end
        end
        patterns = [patterns; allreps(:)];  %#ok<AGROW>
    end
end

It is also a lot more generic as it works with any character set, not just 'ACGT'.

However, note that your own code could be optimised by taking the unique and assignment to f out of the loop.

2 Kommentare
Keine anzeigenKeine ausblenden

Paolo Binetti am 16 Apr. 2017

Bearbeitet: Paolo Binetti am 16 Apr. 2017

Guillaume, thank you for your help. Your code is indeed faster: with d = 3, I start seeing a difference when patterns are longer than 6 chars. Still, it takes quite some time for a task that seemed so simple at first sight.

As for my code, I have tried taking "unique" out of the loop, but it's actually much slower. Perhaps I did not understand your suggestion.

Paolo Binetti am 3 Jun. 2017

In MATLAB Online öffnen

Guillaume, after spending a while thinking of better ways to do it, I decided to take a closer look at your code. I was able to vectorize most of the inner loop, so now the code is significantly faster even with shorter strings:

function patterns = neighbors_4(pattern, distance, charset)
      if nargin < 3
       charset = 'ACGT';
    end
    %build replacement list for each character:
    charsetreps = arrayfun(@(cidx) charset([1:cidx-1, cidx+1:end]), 1:numel(charset), 'UniformOutput', false);
    charsetreps = vertcat(charsetreps{:});
    %fill patterns for distances from 0 to distance
    patterns = {pattern};  %0 distance
    for d = 1:distance
        %get all combinations of d indices of pattern to replace at once
        charidx = nchoosek(1:numel(pattern), d);
        %storage for all replacement at d:
        allreps = cell(size(charidx, 1), (numel(charset)-1)^d);
        %get cartesion product of replacement column in charsetreps
        repcols = cell(1, d);
        [repcols{:}] = ndgrid(1:numel(charset)-1);
        repcols = reshape(cat(d+1, repcols{:}), [], d);
        repcols = numel(charset) * (repcols-1);  %conversion to linear indexing of the column
        rep_mat = pattern(ones(size(repcols, 1), 1), :);
        if d == 1
          [~, reprows] = ismember(pattern(charidx)', charset);
        else
          [~, reprows] = ismember(pattern(charidx),  charset);
        end
        %iterate over the charidx combinations
        for ci = 1:size(charidx, 1)
            ccc = charidx(ci, :); %%%
            replacement_mat = rep_mat;
            replacement_mat(:, ccc) = charsetreps(reprows(ci, :) + repcols);
            allreps(ci, :) = cellstr(replacement_mat)';
        end
        patterns = [patterns; allreps(:)];  %#ok<AGROW>
    end
patterns = patterns';
end

In addition, I noticed that the computation of charsetreps is independent of the specific input sequence, and it is relatively time-consuming. Therefore, if one needs to call the function multiple times (for example to compute the neighbours of hundreds of different sequences), it is better to compute charsetreps just once outside of the function and provide it as an input.

So, ultimately, your function with these two changes, helped increade the speed of the main code using this function by about 10x.

Thank you again and congrats for providing the function so quickly!

Melden Sie sich an, um zu kommentieren.

Generate neighbors of a string

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Generate neighbors of a string

3 Kommentare 1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden