Import too large csv data file with strings

Question

Christos Antonakopoulos am 16 Nov. 2015

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings

Kommentiert: Jenny Smith am 19 Jul. 2018

My file is about 72 MB, almost 850000 rows and on average 7 columns, so some times the number of columns changes. Data is mostly comprised of strings so i used the:

http://www.mathworks.com/matlabcentral/fileexchange/23573-csvimport

as

name= 'etch.csv';
[C1, C2, C3, C4, C5, C6, C7] = csvimport(name, 'columns', [1:7], 'noHeader', true, 'delimiter', ';' );

(i am interested only in the 7 columns even there were cases with more data) This works perfectly for small data sets. For my case it took me almost 30 minutes or even more. Any idea for something better? Thank you

PS My data type is:

1: Device Name,Category,Date,Time,Source,Message,Condition,Name,Act

2: string1,string2,mm/dd/yyyy,hh:mm:ss.sss,string,string,string,1 or 0

.....

850000: and it goes on as line 2

last column most of the times has no data but does not interest me

2 Kommentare
Keine anzeigenKeine ausblenden

Mohammad Abouali am 16 Nov. 2015

have you tried readtable?

Christos Antonakopoulos am 17 Nov. 2015

it does not work for my case since the number of delimiters is not always the same

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Guillaume am 17 Nov. 2015

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#answer_200123

In MATLAB Online öffnen

No matter what, you're bound by the reading speed of matlab. Probably the fastest way to read the file is to rea it all once with fileread. You can then split the lines with strsplit. It is then a choice of applying either of textscan, strsplit or regexp on each line. You would have to see which is faster.

Here is how I would do it using regexp:

filecontent = fileread('etch.csv');
filelines = strsplit(filecontent, {'\r', '\n'}); %split at line ending. Copes with linux and windows termination
fields = regexp(filelines, '^([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);', 'tokens', 'once'); %only keep the first seven fields
fields = vertcat(fields{:})

The above takes about 3 seconds on my machine to read 85000 rows (only 8 MB of text though).

One thing it hasn't done is parse the date. This is fairly trivial to do with datetime if needed and takes no time at all.

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Christos Antonakopoulos am 18 Nov. 2015

I see, yes you are right my time was also reduced, but still i need a better pc. Thank you again

Jenny Smith am 19 Jul. 2018

Hello, I am trying to follow this thread and I'm reading through the regex documentation... I don't understand what you are doing with this expression with [^;]* I have a very similar problem, my text is separated by commas and I have seven columns, and I am trying to understand how to use this function similarly.

Melden Sie sich an, um zu kommentieren.

Answer 2

dpb am 16 Nov. 2015

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#answer_200062

In MATLAB Online öffnen

Can't do anything w/o at least a sample of the data file with whatever warts there are as far as missing fields, but why not go to the root i/o routines directly? For larger fields, "as near to the metal as you can get" is bound to be the ploy.

fmt='%s %s %2d/%2d/%2d %2d:%2d:%2d %s %s %s %*[^\n]';
d = textscan(fid,fmt,'delimiter',',','headerlines',1);

The result above will be a cell array of 7xN; if you do want the various variables then try same format string with textread instead.

Note there's a new %d formatting string with latest release to parse dates on input directly; I don't have past R2012b so return the m/d/y and h/m/s as numerics above. If you do want to retain the strings instead and do the conversion later (or perhaps don't need them any other way) it should be obvious where to replace the formatting to do so.

2 Kommentare
Keine anzeigenKeine ausblenden

dpb am 16 Nov. 2015

ADDENDUM OBTW, it might turn out to be faster to use a looping construct and read a smaller subset of the file each pass rather than the whole thing at once...with textscan you can pick up from previous read automagically; textread in this regards always closes the file so it would have to reopen it every time with an updated 'headerlines' argument; probably a losing proposition.

I don't know if this would help or not; you'd just have to 'spearmint to see if less memory requirements per read operation would outperform the alternate.

Christos Antonakopoulos am 17 Nov. 2015

Bearbeitet: Stephen23 am 17 Nov. 2015

In MATLAB Online öffnen

Device Name;Category;Date;Time;Source;Message;Condition Name;Act;Ack;Ena
CCT AC800 PEC Local;Event;08/26/2010;16:47:09.9550;PEC_MSG_25_10;SerialCommFault;Active;1;1;1
CCT AC800 PEC Local;Trip;08/26/2010;16:46:50.2530;PEC_MSG_1_08;LineUndervoltage;Active;1;1;1
CCT AC800 PEC Local;Trip;08/26/2010;16:46:50.2530;PEC_MSG_1_11;LineUnderfrequency;Active;1;1;1
CCT AC800 PEC Local;Trip;08/26/2010;16:47:09.9550;PEC_MSG_26_10;WaterPressure Fault;Active;1;0;1

That are exactly the first 5 lines, i am not interested on the last 3 columns though. As i said there are cases, in which my rows have less than 10 or more than 10 columns, that is why with csvimport function i had my problem solved since those cases were solved through padding or truncation.

Melden Sie sich an, um zu kommentieren.

Import too large csv data file with strings

2 Kommentare
Keine anzeigenKeine ausblenden

Akzeptierte Antwort

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Weitere Antworten (1)

2 Kommentare
Keine anzeigenKeine ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Import too large csv data file with strings

2 Kommentare Keine anzeigenKeine ausblenden

Akzeptierte Antwort

4 Kommentare 2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Weitere Antworten (1)

2 Kommentare Keine anzeigenKeine ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

2 Kommentare
Keine anzeigenKeine ausblenden

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden