Is there a way to efficiently read a .csv file into a dataset in Matlab
1 Ansicht (letzte 30 Tage)
Ältere Kommentare anzeigen
Ok, so here is the deal.
I have a 2.5GB csv file. I'd like to have it as a dataset so that I can use some of the indexing functions (like grab a certain row provided a certain value) type functionality.
here are some sample lines:
rs180759811,1,83977,0.0078454,0.99052,0.512,'0000','1010',0.45,.,.,F,.,.,.,.,.,.,imputed, rs188652299,1,84156,0.0012772,0.99851,0.50381,'0000','1100',0.65,.,.,R,.,.,.,1,.,.,imputed, rs192830046,1,86282,0.00080435,0.99911,0.59506,'0000','1111',0,.,.,R,.,1,.,.,.,.,imputed, rs146027550,1,88429,0.018998,0.97847,0.53261,'0000','1001',0.2,.,.,R,.,.,.,1,.,.,imputed, rs187571096,1,114699,0.010444,0.98884,0.5583,'0000','1000',0.65,.,.,R,.,.,.,1,.,.,imputed, rs191891026,1,171529,0.011039,0.98724,0.51818,'0000','1001',0.2,.,.,R,.,.,.,1,.,.,imputed,
But, as I see it, there is not a good way to go from csv --> dataset.
Here are the options I've been considering:
fgetl --> regexp --> cell array --> cell2dataset
I know I can get that to work, but it can't be the most efficient way.
textscan--> textscan allows me to specify a bunch commas as the delimiter, which is useful, but i am not even sure if I can read 1 line at a time with text scan.
csvread --> will not work because most of the values are not numeric.
Is there another option that will turn a csv directly into an array or dataset without having to treat it as strings, regexp it, the whole 9 yards?
Thanks very much.
0 Kommentare
Antworten (1)
Walter Roberson
am 11 Sep. 2013
You can read a line at a time with textscan(), by specifying a count of 1 right after the format. But why not read it all with textscan() and then cell2dataset() the result, possibly after a horzcat() ?
cellinput = textscan(fid, '%s%f%f%f%f%f%s%s%f%s%s%s%s%s%s%s%s%s%s%s', 'delimiter', ',');
cell2dataset( horzcat(cellinput{:}) )
the horzcat() would take it from being a cell row vector with each member being a cell column vector, into being a row-and-column cell array.
For lack of better instruction, each column after the last consistent numeric column has been read in as a separate string. If you know that a certain column there will always be useless ".", then switch the corresponding %s to %*s . But for the column that is either 1 or ".", do not switch that to %g as %g will not gracefully match a "." in that column.
0 Kommentare
Siehe auch
Kategorien
Mehr zu Large Files and Big Data finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!