Smart import of random files
Ältere Kommentare anzeigen
Good morning,
I can't find a way to properly set a script for import data as table.
Suppose you have many folders and in each of them there are many csv or xlsx or txt files and you want to create as output a single file ('pathOut/database.xlsx'). For each folder, appending one file to the other is easy.
The problem is that I'd like to write a script (actually an app with App Designer) that is able to load files that are structured in different way: for example, I have files which have the first 30 rows that are meaningless text and I want to neglect those. In another folder, files have 3 meaningless rows.
I'm using App Designer and I've created a Edit Field (numeric) which allows to indicate the data starting rows, the column delimiter and the headerline row. On the other hand, detectImportOptions is not able to properly attribute the right variable type; in addition, in some cases althought I told that delimiter=',' readtable understands that but not for the headerline: the resultant table is mainly full of string column, with an header line confused where comma (the delimeter) is converted with ' _ '.
I attached here two different examples (Input_type_1.xlsx was converted in xlsx because this webpage does not support the .dat file): I don't need to load both of them in the same run, but I'd like to create an app which is able to read both of them specifing as input the less that I can (e.g. data starting rows, column delimiter and headerline row).
Thus, how can I manage detectImportOptions to do that? (I have Matlab 2018b release).
Thank you for your suggestions and time!
Luca
8 Kommentare
Ive J
am 12 Jan. 2021
I read your question couple of times, but I couldn't figure it out clearly. So, you have different delimited files with different number of comment rows (junk rows you don't wanna read). Can you share a part of your script you've tried to handle these differences? What about similarities between these files?
Also, share chunks of your data to see what exactly you are dealing with (instead of converting to XLSX, just compress them).
Luca D'Angelo
am 12 Jan. 2021
The second should be no problem as shown it has no header lines and seems regular. There are no variable names in it, however, so those will either have to be supplied or use the default 'VarN'
The first is an ill-formed spreadsheet file, however -- it is all text in the first column, not a spreadsheet at all. That undoubtedly is a significant factor in parsing it; you've seriously misled detectImportOptions
If these two are representative of the population, the thing to do to help for the second is to read line-by-line first to find that header line with the distinctive string. You can then use that to set the VariableNamesLine position in the opt object.
However, the extraneous line between it and the real data is going to be an issue -- if that is not a constant then you also need to continue to scan to find where the real data do begin so you can also set the starting data line in the opt object; I'm not confident it will be able to do that on its own volition.
If you do that, then you should have a reasonable chance of reading the files; the better solution, of course, would be to have the files created originally with some discipline and with the thought of wanting to read them later.
dpb
am 12 Jan. 2021
>> optType1=detectImportOptions('Input_type_1.xlsx')
optType1 =
SpreadsheetImportOptions with properties:
Sheet Properties:
Sheet: ''
Replacement Properties:
MissingRule: 'fill'
ImportErrorRule: 'fill'
Variable Import Properties: Set types by name using setvartype
VariableNames: {'No_important_text'}
VariableTypes: {'char'}
SelectedVariableNames: {'No_important_text'}
VariableOptions: Show all 1 VariableOptions
Access VariableOptions sub-properties using setvaropts/getvaropts
PreserveVariableNames: false
Range Properties:
DataRange: 'A2' (Start Cell)
VariableNamesRange: 'A1'
RowNamesRange: ''
VariableUnitsRange: ''
VariableDescriptionsRange: ''
To display a preview of the table, use preview
>>
As shown, the ill-formed spreadsheet file is total gibberish to detectImportOptions by default, everything being just text in first column it can't be expected to know any better.
Luca D'Angelo
am 12 Jan. 2021
"What should I change to allow a correct reading?"
The file, maybe?
The biggest problem is one you are creating for yourself by building a spreadsheet file with all the data in the first column as text, not in separate columns as any respectable spreadsheet would be.
It's unreasonable to expect that file to be parsed as is...if you can't cause the files to be written correctly in the first place, then I would suggest writing a macro to do the column split inside Excel first and resave the file, then import that into MATLAB.
I'm no VBA/COM guru, it's probable you could create that code to do that as COM that you could call from MATLAB.
Luca D'Angelo
am 12 Jan. 2021
dpb
am 13 Jan. 2021
Did the instrument name that as a .xlsx on creation or was that done later?
Doesn't make sense to have done this way; who's the manufacturer of the instrument and what is the instrument itself?
I've used a lot of test gear over 40 years and never saw this symptom from any of them...
Akzeptierte Antwort
Weitere Antworten (2)
dpb
am 13 Jan. 2021
Save the file as a .csv instead of .xlsx. May have to coerce Excel to not double-quote the strings; I presume there's a way but I use Excel as little as possible so that's an "Exercise for Student".
Following that, it's pretty straightforward...
optType1CSV=detectImportOptions('Input_type_1.csv','Delimiter',',');
optType1CSV.VariableNamesLine=optType1CSV.DataLines(1)-3; % if this is variable, takes more work
tType1=readtable('Input_type_1.csv',optType1CSV,'ReadVariableNames',1);
tType1.Date=datetime(tType1.Date,'InputFormat','yyyy/MM/dd hh:mm:ss.S');
The result of this for your sample file is:
tType1 =
8×58 table
Date Status Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Stage7 Stage8 Stage9 Stage10 Stage11 Stage12 Stage13 Stage14 MISC ChargerI ChargerU TrapU ImpactorLowPressure AmbientPressure AuxChannel1 AuxChannel2 AuxChannel3 AuxChannel4 AuxChannel5 AuxChannel6 T1 T2 T3 T4 Reserved CAL Stage1_1 Stage2_1 Stage3_1 Stage4_1 Stage5_1 Stage6_1 Stage7_1 Stage8_1 Stage9_1 Stage10_1 Stage11_1 Stage12_1 Stage13_1 Stage14_1 CON ConcentrationValue COM UserComment PM10 PM2_5 PM1 PN10 PN2_5 PN1
____________________ _________________ ______ ______ ______ ______ ______ ______ ______ ______ ______ _______ _______ ________ _______ _______ ________ ________ ________ _____ ___________________ _______________ ___________ ___________ ___________ ___________ ___________ ___________ ______ ______ _____ _____ ________ _______ _________ ________ ________ ________ ________ ________ ________ ________ ________ _________ _________ _________ _________ _________ _______ __________________ _______ ___________ _______ _______ _______ _____ _____ _____
26-Feb-2019 00:30:21 {'CbTlsfpAvzij0'} 106.2 72.35 121.2 155.9 259.9 379.2 340.7 231.8 68.89 24.7 10.26 7.802 19.51 12.76 {'MISC'} 0.9982 3437 19.99 39.96 1021 0.01602 0.01564 0.01564 0.01556 0.01556 0.01564 101.5 -50.8 24.83 29.85 0 {'CAL'} 1.019e+05 18730 14200 8973 8909 6805 4143 1391 163.6 1.231 0 0.0107 0.8538 0.02063 {'CON'} 58360 {'COM'} {0×0 char} 0.05119 0.04368 0.04368 58360 58360 58360
26-Feb-2019 00:31:21 {'CbTlsfpAvzij0'} 100.7 66.72 118.6 152.5 252.8 372.3 329.9 226.5 69.89 24.14 12.06 4.145 20.63 8.241 {'MISC'} 0.9981 3551 19.99 40.05 1020 0.01602 0.01564 0.01572 0.01549 0.01556 0.01572 -423.7 -28.2 24.8 30.67 0 {'CAL'} 96650 17200 13960 8805 8681 6693 4015 1362 174.8 6.267 0.1556 0.6002 1.995 0.1764 {'CON'} 55490 {'COM'} {0×0 char} 0.06779 0.04344 0.04332 55490 55490 55490
26-Feb-2019 00:32:21 {'CbTlsfpAvzij0'} 106.9 78.78 123.5 154.4 249.9 365.3 327.2 223.4 70.82 24.62 12.62 7.961 20.35 8.58 {'MISC'} 0.9982 3507 19.99 40.47 1020 0.01595 0.01556 0.01579 0.01556 0.01549 0.01556 189.9 -244.5 24.79 31.08 0 {'CAL'} 1.025e+05 20630 14440 8850 8525 6532 3961 1333 171.8 2.635 0 0.006366 0.8741 0 {'CON'} 59010 {'COM'} {0×0 char} 0.04922 0.04255 0.04255 59010 59010 59010
26-Feb-2019 00:33:21 {'CbTlsfpAvzij0'} 107.5 79.25 123.9 152.4 247.5 359 320.1 221.6 67.33 26 10.5 4.261 22.98 20.01 {'MISC'} 0.9975 3547 19.99 40.27 1020 0.01579 0.01556 0.01579 0.01549 0.01556 0.01564 -177.6 -111.7 24.79 30.67 0 {'CAL'} 1.032e+05 20750 14490 8712 8436 6410 3865 1322 158.1 3.178 0 0 1.659 0.3021 {'CON'} 59220 {'COM'} {0×0 char} 0.0704 0.0414 0.0414 59220 59220 59220
26-Feb-2019 00:34:21 {'CbTlsfpAvzij0'} 96.54 59.88 111.2 144.4 246.7 362 323.7 223.2 63.95 23.99 7.791 -0.06385 23.19 10.1 {'MISC'} 0.9967 3551 19.99 40.19 1020 0.01595 0.01564 0.01572 0.01564 0.01564 0.01564 -420.4 -402.7 24.79 31.49 0 {'CAL'} 92620 15270 13080 8344 8493 6520 3952 1349 152.6 2.434 0 0 2.366 0 {'CON'} 52840 {'COM'} {0×0 char} 0.05984 0.04178 0.04178 52840 52840 52840
26-Feb-2019 00:35:21 {'CbTlsfpAvzij0'} 99.33 63.98 112.1 144.5 251.1 369 328.9 227.8 61.62 25.13 6.76 -3.935 27.56 3.491 {'MISC'} 0.9967 3464 19.99 39.86 1020 0.01595 0.01572 0.01579 0.01564 0.01572 0.01564 -405.6 -22.27 24.81 31.89 0 {'CAL'} 95300 16420 13130 8316 8639 6643 4012 1376 140.8 3.313 0 0 3.761 0 {'CON'} 54390 {'COM'} {0×0 char} 0.07067 0.04189 0.04189 54390 54390 54390
26-Feb-2019 00:36:21 {'CbTlsfpAvzij0'} 92.43 58.33 104.8 135.6 238.9 351.7 312 220.1 58.7 24.7 5.016 -6.357 28.83 -0.1046 {'MISC'} 0.9969 3450 19.99 40.11 1020 0.0161 0.01564 0.01572 0.01549 0.01556 0.01564 10.44 -369.9 24.83 32.3 0 {'CAL'} 88670 14920 12300 7818 8235 6341 3811 1336 135.5 4.209 0 0 4.619 0 {'CON'} 50630 {'COM'} {0×0 char} 0.07547 0.04022 0.04022 50630 50630 50630
26-Feb-2019 00:37:21 {'CbTlsfpAvzij0'} 93.91 53.83 104 134.8 235.2 343.8 306 216.3 57.03 24.13 4.745 -7.297 28.45 -0.4211 {'MISC'} 0.9969 3627 19.99 40.47 1019 0.01595 0.01556 0.01572 0.01556 0.01549 0.01549 -312.5 -352.6 24.85 32.3 0 {'CAL'} 90080 13540 12200 7768 8099 6191 3733 1310 129.5 3.907 0 0 4.487 0 {'CON'} 50720 {'COM'} {0×0 char} 0.07352 0.03927 0.03927 50720 50720 50720
>>
Moral: Quit trying to pound square peg into round hole; fix the underlying problem of the badly constructed file.
dpb
am 13 Jan. 2021
c=readcell('Input_type_1.xlsx'); % read as cellstr array
c=c(~cellfun(@(c)all(ismissing(c)),c,'UniformOutput',1)); % get rid of missing lines
c=c(find(contains(c,'DataOrder=')):end) % and everything before the data header line
c=c(contains(c,',')); % and anything else not comma delimited
writecell(c,'tempType1.csv',"QuoteStrings",0) % write to a temporary file
tType1=readtable('tempType1.csv','delimiter',',','ReadVariableNames',1)
This is about 99% there with only minor cleanup left...
>> readtable('tempType1.csv','delimiter',',','ReadVariableNames',1)
ans =
8×58 table
DataOrder_DateTime_yyyy_mm_ddHh_mm_ss_ Status Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Stage7 Stage8 Stage9 Stage10 Stage11 Stage12 Stage13 Stage14 MISC ChargerI ChargerU TrapU ImpactorLowPressure AmbientPressure AuxChannel1 AuxChannel2 AuxChannel3 AuxChannel4 AuxChannel5 AuxChannel6 T1 T2 T3 T4 Reserved CAL Stage1_1 Stage2_1 Stage3_1 Stage4_1 Stage5_1 Stage6_1 Stage7_1 Stage8_1 Stage9_1 Stage10_1 Stage11_1 Stage12_1 Stage13_1 Stage14_1 CON ConcentrationValue COM UserComment PM10 PM2_5 PM1 PN10 PN2_5 PN1
______________________________________ _________________ ______ ______ ______ ______ ______ ______ ______ ______ ______ _______ _______ ________ _______ _______ ________ ________ ________ _____ ___________________ _______________ ___________ ___________ ___________ ___________ ___________ ___________ ______ ______ _____ _____ ________ _______ _________ ________ ________ ________ ________ ________ ________ ________ ________ _________ _________ _________ _________ _________ _______ __________________ _______ ___________ _______ _______ _______ _____ _____ _____
{'2019/02/26 12:30:21.8'} {'CbTlsfpAvzij0'} 106.2 72.35 121.2 155.9 259.9 379.2 340.7 231.8 68.89 24.7 10.26 7.802 19.51 12.76 {'MISC'} 0.9982 3437 19.99 39.96 1021 0.01602 0.01564 0.01564 0.01556 0.01556 0.01564 101.5 -50.8 24.83 29.85 0 {'CAL'} 1.019e+05 18730 14200 8973 8909 6805 4143 1391 163.6 1.231 0 0.0107 0.8538 0.02063 {'CON'} 58360 {'COM'} NaN 0.05119 0.04368 0.04368 58360 58360 58360
{'2019/02/26 12:31:21.8'} {'CbTlsfpAvzij0'} 100.7 66.72 118.6 152.5 252.8 372.3 329.9 226.5 69.89 24.14 12.06 4.145 20.63 8.241 {'MISC'} 0.9981 3551 19.99 40.05 1020 0.01602 0.01564 0.01572 0.01549 0.01556 0.01572 -423.7 -28.2 24.8 30.67 0 {'CAL'} 96650 17200 13960 8805 8681 6693 4015 1362 174.8 6.267 0.1556 0.6002 1.995 0.1764 {'CON'} 55490 {'COM'} NaN 0.06779 0.04344 0.04332 55490 55490 55490
{'2019/02/26 12:32:21.8'} {'CbTlsfpAvzij0'} 106.9 78.78 123.5 154.4 249.9 365.3 327.2 223.4 70.82 24.62 12.62 7.961 20.35 8.58 {'MISC'} 0.9982 3507 19.99 40.47 1020 0.01595 0.01556 0.01579 0.01556 0.01549 0.01556 189.9 -244.5 24.79 31.08 0 {'CAL'} 1.025e+05 20630 14440 8850 8525 6532 3961 1333 171.8 2.635 0 0.006366 0.8741 0 {'CON'} 59010 {'COM'} NaN 0.04922 0.04255 0.04255 59010 59010 59010
...
to convert to datetime and set first column name.
More elegant would be to then use detectImportOptions on the cleaned-up file and can then set the time import format, etc., in it.
NB: If this format is consistent, you don't need to actually call detectImportOptions each time; create a library of import options objects for each particular type and save them as .mat files to reload and use where needed.
Either of the three ways gets around the problem your instrument vendor has created...you can't use their files in Excel as is, either. Something seems wrong in the setup; it doesn't make sense they would have done this deliberately.
Kategorien
Mehr zu Spreadsheets finden Sie in Hilfe-Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!