read ASCII data with varying lengths of space as delimiters

18 Ansichten (letzte 30 Tage)
Pq Yang
Pq Yang am 3 Apr. 2020
Bearbeitet: dpb am 4 Apr. 2020
I am having trouble in reading a structured dataset. The columns are separeted by spaces which have different lengths. Besides the varying lengths of space for delimiters, the column widths are varying as well.
I have tried all kinds of appraoches and functions, and searached online, but couldnot find a solution.
Can someone help me? Please.
A subset from a huge dataset is attached.
  1 Kommentar
dpb
dpb am 3 Apr. 2020
Need clear definition of the format -- does look like fixed-width but not absolutely positive about whether the apparent missing fields are defined or skipped.
One presumes somewhere whoever created these files also built a toolset to read them and published a definitive description.
I'd look for the first first rather than reinvent the wheel and lacking that w/o the second it's going to be a lot of detective work.

Melden Sie sich an, um zu kommentieren.

Antworten (1)

per isakson
per isakson am 3 Apr. 2020
Bearbeitet: per isakson am 4 Apr. 2020
subset.txt is a fixed-width text file. No doubt about that.
Since R2018b (R2017a?) it is straight forward to read fixed-width text file with readtable(). Before that it was a bit tricky to read fixed-width text files. There are a number of threads here in Answers on the topic.
The documentation includes a good description on how to read fixed-width text files. See fixedWidthImportOptions, Import options object for fixed-width text files.
IMO: The error prone part is to determine the widths of the columns. Here is a start.
  • copy&paste the column header line to the command window
  • use regexp() to determine the starting position of each column. The column headers start with a letter (the first one is special).
>> chr = '# DTG LOCATION NAME ...'
>> cac = regexp( chr, '\<\w', 'start' );
>> cac
cac =
3 22 42 90 110 130 150 170 190 210 230 250 270 290 310 330 350
  1 Kommentar
dpb
dpb am 4 Apr. 2020
Bearbeitet: dpb am 4 Apr. 2020
Is there only one header line? Maybe the browser is folding the long line that confused the issue...ah! Indeed it is. That makes the file look much more regular than with folded lines. Indeed with the comment character to skip header lines presuming don't need them (altho then the column header record will be skipped too, probably; I've never tested that).
Well, now have downloaded, let's see just how smart TMW was... :)
>> opt=detectImportOptions('subset.txt','FileType','fixedwidth','CommentStyle','#',"NumHeaderLines",20,'ReadVariableNames',1)
opt =
FixedWidthImportOptions with properties:
Format Properties:
Whitespace: '\b\t '
LineEnding: {'\n' '\r' '\r\n'}
CommentStyle: {'#'}
EmptyLineRule: 'skip'
Encoding: 'windows-1252'
Replacement Properties:
MissingRule: 'fill'
ImportErrorRule: 'fill'
ExtraColumnsRule: 'addvars'
PartialFieldRule: 'keep'
Variable Import Properties: Set types by name using setvartype
VariableNames: {'x_DTG', 'Var2', 'LOCATIO' ... and 12 more}
VariableTypes: {'datetime', 'duration', 'char' ... and 12 more}
VariableWidths: [10.00 9.00 9.00 21.00 4.00 8.00 2.00 35.00 19.00 16.00 99.00 20.00 20.00 40.00 55.00]
SelectedVariableNames: {'x_DTG', 'Var2', 'LOCATIO' ... and 12 more}
VariableOptions: Show all 15 VariableOptions
Access VariableOptions sub-properties using setvaropts/getvaropts
PreserveVariableNames: false
Location Properties:
DataLines: [22.00 Inf]
VariableNamesLine: 21.00
RowNamesColumn: 0
VariableUnitsLine: 0
VariableDescriptionsLine: 0
To display a preview of the table, use preview
>> opt.VariableNames
ans =
1×15 cell array
Columns 1 through 10
{'x_DTG'} {'Var2'} {'LOCATIO'} {'NNAME'} {'Var5'} {'Var6'} {'Var7'} {'LATITUDE'} {'LONGITUD'} {'EALTI'}
Columns 11 through 15
{'TUDEU_BOOL_10T_…'} {'EWP_SEA_10T_D'} {'RYB_SEA_10TN_'} {'DRYB_10T_WETB_1…'} {'DRYB_10U_10U_SE…'}
>>
Well, kinda' sorta' made a stab at it...will have to physically set the column widths and number variables -- it finds the time string separate from the date so inserts Var2 for it, then thinks the LOCATION column ends at the end of the shorter string data instead of the longer so it truncates the name as LOCATIO and includes the last N in the next column. But one gets an object to play with for starters.
Good trick for automagically guessing column widths, Per, will have to remember that going forward.
OP will still need to either account for the time column independently or to set the width to encompass both and read as string 'cuz can't handle the blank field on input parsing in readtable to convert to datetime.

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Cell Arrays finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by