Read text file lines and analyze

1 Ansicht (letzte 30 Tage)
Lmm3
Lmm3 am 24 Jul. 2017
Beantwortet: OCDER am 9 Sep. 2017
I would appreciate help with reading and analyzing a text file. The text file (rosalind_gc1.txt) is in this format:
>Rosalind_4949
ACTTCTATGTAGCGCGCTATTTCAAGGGATCGGCCAATAGTACGACGTGTTTCATCTAGT GCGACAAATGTATATACCGTTTTCATTACGTACCACGATAAGTTGAAGCCCGTATTC AGACGCGGGAGCCGTCTGCTGGACAAGTACTAGCTGGTCCATCCTCCCCACCAAAGGGAA
>Rosalind_7490
AACTGGGAATTTCTATATTGGGCGGTAAGCTCGGGGCAATCTATTAGTTGAATGCAACAG TAACAAACTTGCCGTCGGTCGCTGTTCGCGCAGCATTAATAATAACTCTGGCGAGTAGAT
>Rosalind_8337
CCTTGTTGTCTACCCACCAAGTCAGATAGACAGTTGGCTGTCTCCAACGCAGATTTTCTA CGCTTCATGCTCTTGCGACTCATGTCGCCTGGGTTTATTGCTTCTCTACGGGATAACCGC CCGGGCTCACTCTACCCGCGGGAAGGCCGCCCTCTCTCCCGTGTGCCTACATAA
I would like to determine the %GC for the data sets between each “>Rosalind” heading. For example, in the example above there are 3 data sets. The %GC for the text between “>Rosalind_4949” and “>Rosalind_7490” is 48.5876% and between “>Rosalind_7490” and “>Rosalind_8337” is 45.000%.
I’m trying to use the following code but I don’t know how to read the lines as blocks between each “>” and I don’t know how to concatenate the lines as I read them. I would appreciate any help.
fid = fopen('rosalind_gc1.txt');
while ~feof(fid)
templine = fgetl(fid);
a = strcmp(templine, '>');
if a == 0
G = length(strfind(templine,'G'));
C = length(strfind(templine,'C'));
z = length(templine);
%Per = (G+C)*100/z
end
end
Per = (G+C)*100/z

Akzeptierte Antwort

Lmm3
Lmm3 am 9 Sep. 2017
The following code is what I used to read from the data file and determine %GC:
fid = fopen('rosalind_gc.txt');
n = 1;
G = 0;
C = 0;
z = 1;
while ~feof(fid)
templine = fgetl(fid);
a = strfind(templine, '>');
TF = isempty(a);
if TF == 1;
n= n+1;
G(1) = 0;
C(1) = 0;
z(1) = 0;
G(n) = length(strfind(templine,'G'));
C(n) = length(strfind(templine,'C'));
z(n) = length(templine);
G(n) = G(n) + G(n-1);
C(n) = C(n) + C(n-1);
z(n) = z(n) + z(n-1);
continue
% Per(n) = (G(n)+C(n))*100/z(n)
else TF == 0 ;
Per = (G(end)+C(end))*100/z(end)
disp(templine)
G(:,:) = [];
C(:,:) = [];
z (:,:)=[];
continue
end
end
Per =(G(end)+C(end))*100/z(end)

Weitere Antworten (2)

KSSV
KSSV am 24 Jul. 2017
Bearbeitet: KSSV am 24 Jul. 2017
Let data.txt be your text file...You can count the number of G in your file as below:
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
N = 0 ;
for i = 1:length(S)
N = N+length(strfind(S{i}, 'G'));
end
Without loop :
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
Ni = strfind(S,'G') ;
N = sum(cellfun(@numel,Ni)) ;
  1 Kommentar
Lmm3
Lmm3 am 25 Jul. 2017
KSSV thank you for your response. Could you explain to me what the line S = S{1} is doing? The code returns the total number of "G" occurrences for the data file, but do you have a suggestion how to get the "G" occurrences between each of the headers that begin with ">Rosalind"? For example, in the data set above, I would like to get 3 values, the number of G occurrences between (“>Rosalind_4949” and “>Rosalind_7490”) between (“>Rosalind_7490” and “>Rosalind_8337”) and G occurrences below (">Rosalind_8337).

Melden Sie sich an, um zu kommentieren.


OCDER
OCDER am 9 Sep. 2017
If you deal with a lot of fasta files, look into fastaread (Matlab Bioinformatics Toolbox) or readFasta (a code I made for another project).
Also, cellfun and regexp become pretty handy tools.
To get GC %:
[Header, Seq] = readFasta('Seq.txt');
PercGC = cellfun(@(S)length(regexpi(S, 'G|C'))/length(S)*100, Seq);
PercGC =
48.5876
45.0000
55.1724

Kategorien

Mehr zu Cell Arrays finden Sie in Help Center und File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by