Finding the repeated substrings
2 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Reshma Ravi
am 1 Jun. 2017
Beantwortet: Steven Lord
am 14 Aug. 2019
I have a DNA sequence that is AAGTCAAGTCAATCG and I split into substrings such as AAGT,AGTC,GTCA,TCAA,CAAG,AAGT and so on. Then I have to find the repeated substirngs and their frequency counts ,that is here AAGT is repeated twice so I want to get AAGT - 2.How is this possible .
2 Kommentare
Akzeptierte Antwort
KSSV
am 1 Jun. 2017
str = {'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'} ;
idx = cellfun(@(x) find(strcmp(str, x)==1), unique(str), 'UniformOutput', false) ;
L = cellfun(@length,idx) ;
Ridx = find(L>1) ;
for i = 1:length(Ridx)
st = str(idx{Ridx}) ;
fprintf('%s string repeated %d times\n',st{1},length(idx{Ridx}))
end
0 Kommentare
Weitere Antworten (2)
Andrei Bobrov
am 1 Jun. 2017
A = 'AAGTCAAGTCAATCG';
B = hankel(A(1:end-3),A(end-3:end));
[a,~,c] = unique(B,'rows','stable');
out = table(a,accumarray(c,1),'VariableNames',{'DNA','counts'});
5 Kommentare
Ivan Savelyev
am 14 Aug. 2019
Hi.
I have a question. Some time i have a ladder-like results (nested sequences) like this :
AAAAAAAAA which will be calculated (with frame size 3 as) as 6 AAAA sequences, wich is not correct in some cases ( it is also about ATATATA type of sequences). Is there a solution or algorithms to filter nested repeats ?
Thanx a lot.
Steven Lord
am 14 Aug. 2019
For the original question you could convert the char data into a categorical array and call histcounts.
>> C = categorical({'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'})
C =
1×6 categorical array
AAGT AGTC GTCA TCAA CAAG AAGT
>> [counts, uniquevalues] = histcounts(C)
counts =
2 1 1 1 1
uniquevalues =
1×5 cell array
{'AAGT'} {'AGTC'} {'CAAG'} {'GTCA'} {'TCAA'}
0 Kommentare
Siehe auch
Kategorien
Mehr zu Genomics and Next Generation Sequencing finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!