Reading large number of csv files

Question

Sameer Gummuluru am 14 Aug. 2020

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/579768-reading-large-number-of-csv-files

Kommentiert: Sameer Gummuluru am 18 Aug. 2020

Akzeptierte Antwort: per isakson

In MATLAB Online öffnen

I have a large number of csv files to process. The files exist on AWS S3.

Currently I have a for loop like this

fds = fileDatastore(fp,'IncludeSubfolders',true,'ReadFcn',@csvread);
for i = 1:numFiles
    data = read(fds); % I have tried csvread(fileName) as well
end

Each of these file reads is taking 0.5 s on average. Considering that I have to process a large number of flies, is there anyway to speed this up?

P.S:Parallel Computing ToolBox is currently not a choice

Thank you in advance!

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Walter Roberson am 16 Aug. 2020

Bearbeitet: Walter Roberson am 16 Aug. 2020

I wonder if it would be productive to use the AWS Java interface https://docs.aws.amazon.com/AmazonS3/latest/dev/RetrievingObjectUsingJava.html to download a batch of files to local storage ?

I also wonder whether it would be practical to re-organize the storage so that groups of files were stored in .zip ? This would reduce the size of most text files (because of compression) and should reduce the per-file overhead.

Sameer Gummuluru am 18 Aug. 2020

Thank you Walter! I will givee that a try

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

per isakson am 16 Aug. 2020

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/579768-reading-large-number-of-csv-files#answer_480456

Bearbeitet: per isakson am 16 Aug. 2020

In MATLAB Online öffnen

Caveats

My notion of your use case (work flow) is vague. Is your local free disk space a restriction? How many GB in total will you download in one working session? "large number" How many is that?
I've never used fileDatastore and don't know how much overhead it brings.
It's a bit difficult to measure the elapsed time of reading files because after the first run the file will reside in the cache memory. I don't know how to "clear" the cache.

I made a small performance test

I use R2018b, Win10, a spinning HD,
Created one local csv-file with 20,000 rows and 10 columns (1.73 MB) and a few copies of that.
Run a script to compare csvread, textscan, fileread and fileDatastore

%%
fprintf( '%-16s', 'csvread' )
tic, m = csvread( 'sameer.csv' ); toc
%%
fprintf( '%-16s', 'textscan' )
tic, 
fid = fopen('sameer.csv'); 
cac = textscan( fid, '%f%f%f%f%f%f%f%f%f%f', 'Delimiter',',', 'Collectoutput',true );
fclose( fid );
toc
%%
fprintf( '%-16s', 'fileread' )
tic
txt = fileread('sameer.csv');  
toc
%%
sad = dir('d:\m\cssm\sameer*.csv');
fp  = fullfile( {sad.folder}, {sad.name} );
fprintf( '%-16s', 'fileDatastore' )
tic, fds = fileDatastore( fp, 'ReadFcn',@csvread ); toc
fprintf( '%-16s', 'read( fds )' )
tic, m1 = read( fds ); toc
fprintf( '%-16s', 'read( fds )' )
tic, m2 = read( fds ); toc

Result (with the files in memory cache)

>> cssm
csvread         Elapsed time is 0.049017 seconds.
textscan        Elapsed time is 0.024477 seconds.
fileread        Elapsed time is 0.007118 seconds.
fileDatastore   Elapsed time is 0.002117 seconds.
read( fds )     Elapsed time is 0.049034 seconds.
read( fds )     Elapsed time is 0.048331 seconds.

Discussion

the overhead of fileDatastore is small
textscan is twice as fast as csvread. csvread calls textscan to read and parse the file. Here the overhead is significant.
fileread is included to get a time for reading without parsing. Reading (from cache) takes a third of textscan's elapsed time.
"Reading the same file from local HD takes around 0.21 s." Does that refer to the first time you read this file? That is four times the elapsed time I see reading from cache. That's a large difference, first time or not.

Too me it seems as if the total time, 0.7 sec, is dominated by the download over the Internet. I've a free TB on my desktop and would start a download and go for lunch.

2 Kommentare
Keine anzeigenKeine ausblenden

Sameer Gummuluru am 17 Aug. 2020

Bearbeitet: Sameer Gummuluru am 17 Aug. 2020

Thank you for the detailed answer!

To clarify, when I said reading from the local HD, what I did is very similar to what you have done using filedatastore.

I have used FileDatastore with @ReadFcn csvread. I have performed this operation on a folder containing 50 files and I have never opened any of them.
The time taken to loop through 50 "read(fds)" commands is 0.0847s * 50 (Hence an average of 0.0847s per file). Sorry for the 0.21s I have mentioned. That was using load instead of csvread.

It is interesting to see from your analysis that textscan runs around twice as fast as csvread. So, I will give a try replacing csvread with textscan and see the difference in performance.

Considering that my data is growing continuously, it might make sense for me to invest in the parallel computing toolbox soon.

Thank you!

per isakson am 18 Aug. 2020

Bearbeitet: per isakson am 18 Aug. 2020

In MATLAB Online öffnen

"I have never opened any of them" Yes but FileDatastore relies on @fcn to do that. ( In your case csvread does it.)

"The time taken to loop through 50 "read(fds)"" Did you repeat that a number of times?

"load instead of csvread" Yes load is indeed slow with ascii

>> tic, load('sameer.csv','-ascii'); toc  <<< first time after start of Matlab
Elapsed time is 0.571249 seconds.
>> tic, load('sameer.csv','-ascii'); toc
Elapsed time is 0.201560 seconds.
>> tic, load('sameer.csv','-ascii'); toc
Elapsed time is 0.200004 seconds.

"replacing csvread with textscan" If not all csv-files have the same number of columns, the format specification might pose a problem. textscan has undocumented features that are used by several Matlab functions, e.g. csvread. In csvread that problem is solved. (I've forgotten how.)

"invest in the parallel computing toolbox soon" I'm not sure the main bottleneck is on your side. And I guess it's about moving around data rather than cpu-cycles. Furthermore, how will the Amazon server handle several "simultaneous" requests from the same IP-number? (I haven't a clue, google helped me figure out what AWS S3 stands for.) See Walter's comment to your question.

My questions here do not require answers.

Melden Sie sich an, um zu kommentieren.

Reading large number of csv files

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

Reading large number of csv files

4 Kommentare 2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden