Generate neighbors of a string
2 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Paolo Binetti
am 9 Apr. 2017
Kommentiert: Paolo Binetti
am 3 Jun. 2017
I need to generate a cell array (or char array) containing all d-neighbors of a string of lenght k. This means all the k-long strings which differ from the input at most d characters.
My specific need concerns an alphabet of four letters, A, C, G, T. Here is an example: with string='CTA' and d=2 as inputs, the output array should have 37 elements, like 'AGA', 'CAA', 'GGA', ... and 'CTA'.
My code, attached, is the fastest of several I have tried. The main function "neighbors" calls auxiliary function "immediate_neighbors". The expected inputs of these function are provided as comments in the script.
I am convinced there are ways to achieve this much faster, but could not find any.
3 Kommentare
Stephen23
am 9 Apr. 2017
"I'm not aware of a "j++" operation in Matlab"
There isn't in MATLAB, but there is in Octave.
Akzeptierte Antwort
Guillaume
am 9 Apr. 2017
Bearbeitet: Guillaume
am 9 Apr. 2017
For short strings and small d, your code is slightly faster, but for longer strings with larger d (tested with 18 chars and d = 3) the following is much faster:
function patterns = neighbours(pattern, distance, charset)
if nargin < 3
charset = 'ACGT';
end
%build replacement list for each character:
charsetreps = arrayfun(@(cidx) charset([1:cidx-1, cidx+1:end]), 1:numel(charset), 'UniformOutput', false);
charsetreps = vertcat(charsetreps{:});
%fill patterns for distances from 0 to distance
patterns = {pattern}; %0 distance
for d = 1:distance
%get all combinations of d indices of pattern to replace at once
charidx = nchoosek(1:numel(pattern), d);
%storage for all replacement at d:
allreps = cell(size(charidx, 1), (numel(charset)-1)^d);
%get cartesion product of replacement column in charsetreps
repcols = cell(1, d);
[repcols{:}] = ndgrid(1:numel(charset)-1);
repcols = reshape(cat(d+1, repcols{:}), [], d);
repcols = numel(charset) * (repcols-1); %conversion to linear indexing of the column
%iterate over the charidx combinations
for ci = 1:size(charidx, 1)
[~, reprows] = ismember(pattern(charidx(ci, :)), charset);
%iterate over the replacement characters
for rep = 1:size(repcols, 1)
replacement = pattern;
replacement(charidx(ci, :)) = charsetreps(reprows + repcols(rep, :)); %reprow + repcols creates linear indices
allreps{ci, rep} = replacement;
end
end
patterns = [patterns; allreps(:)]; %#ok<AGROW>
end
end
It is also a lot more generic as it works with any character set, not just 'ACGT'.
However, note that your own code could be optimised by taking the unique and assignment to f out of the loop.
2 Kommentare
Weitere Antworten (0)
Siehe auch
Kategorien
Mehr zu Genomics and Next Generation Sequencing finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!