Remove intermittent text when reading in a table from a .dat file

1 Ansicht (letzte 30 Tage)
Hi,
I am trying to use readtable for read in a .dat file. The file looks like this, where there could be 1 to very many entries in the columns that start with a "1'" here.
# NetMHCIIpan version 4.0
# Input is in PEPTIDE format
# Prediction Mode: EL+BA
# Threshold for Strong binding peptides (%Rank) 2%
# Threshold for Weak binding peptides (%Rank) 10%
# Allele: HLA-DPA10103-DPB10101
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind Score_BA Affinity(nM) %Rank_BA BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1 HLA-DPA10103-DPB10101 AAAAAAAAAAAAAAA 3 AAAAAAAAA 0.380 Sequence 0.020745 81.44 NA 0.366182 951.24 32.45
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
# Allele: HLA-DPA10103-DPB10201
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind Score_BA Affinity(nM) %Rank_BA BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1 HLA-DPA10103-DPB10201 BBBBBBBBBBBBBBBB 2 BBBBBBBBB 0.960 Sequence 0.491911 1.02 NA 0.712020 22.55 0.27 <=SB
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
# Allele: HLA-DPA10103-DPB10202
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind Score_BA Affinity(nM) %Rank_BA BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1[.......]
These columns would then start 2,3,4,[...]. I successfully use
opts = detectImportOptions('filename.dat');
opts.DataLines = [16 Inf];
opts.VariableNamesLine = 14;
readtable(fullfile('path','filename.dat',opts,'ReadVariableNames', true);
for files with a large number of columns between the ----, i.e. e.g.
# Allele: HLA-DPA10103-DPB10101
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind Score_BA Affinity(nM) %Rank_BA BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1 HLA-DPA10103-DPB10101 AAAAAAAAAAAAAAA 3 AAAAAAAAA 0.380 Sequence 0.020745 81.44 NA 0.366182 951.24 32.45
2 HLA-....
3 ....
....
....
50 HLA....
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
However, this does not work for short "fillings" and my code very much depends on being robust in either scenario.
I tried playing with the opts but did not get it to work. I would be very grateful for any advice! Maybe a method other than readtable (readtext?) is needed and then a conversion to a table? In the end I will need a table like this:
Thank you very much for your advice! I have spent a long time deleoping the code around this and this is the final part that keeps breaking...

Akzeptierte Antwort

Vimal Rathod
Vimal Rathod am 22 Feb. 2021
  1 Kommentar
L. Borealis
L. Borealis am 25 Feb. 2021
Thanks, Vimal!
I had actually seen that question but even the question description was not particularly clear to me. So I had left it. Thanks for pointing me back to it. I did use part of it in the end to come up with a working (yet not elegant) solution. Maybe it is useful to someone in the future:
S = regexp(fileread('S:\scratch\cdr1pool2pep1\out_00.dat'), '\r?\n', 'split');
S = S(~cellfun('isempty',S));
if isempty(S{end}); S(end) = []; end %regexp split leaves empty at bottom if file ended in \n which is common
nonheader = cellfun(@isempty, regexp(S, '^\s*#|^\s*-|^\s*P|^\s*N' )); %permit space before #
starts = strfind([false nonheader], [false true]);
stops = strfind([nonheader false], [true false]);
num_blocks = length(starts);
lenRows = length(starts(1):stops(1));
S_temp = cell(num_blocks,lenRows);
for K = 1 : num_blocks
S_temp(K,:) = S(starts(K):stops(K));
end
S = reshape(S_temp,[num_blocks*lenRows,1]);
writecell(S,'data.dat')
opts = detectImportOptions('data.dat');
tbl=readtable(('data.dat'),opts);

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by