Speeding up writing a very large text file

51 Ansichten (letzte 30 Tage)
davidwriter
davidwriter am 17 Sep. 2018
Bearbeitet: davidwriter am 21 Sep. 2018
I have to write very large text files (1-4 million lines; 100-400 MB) to disk. Writing binary output is NOT an option. To solve this problem I broke the output into blocks of 10000 lines and then used sprintf to write the formatted lines (each ending with a '\n') to a block of 10000 strings (dataop=string(10000,1)).
I have already opened the output file with the 'W' option and then successively write each block of strings using the command: fprintf(fid,dataop(:));
It is still taking an inordinate amount of time - each 1 million lines takes 60 minutes!
The machine is high-end: running Windows 10, 128GB RAM, dual Xeon 10 core processors, and WD Black 6TB drive. There are 5 other hard drives with are not active during the write process. I'm running 2017b
So - why is it so slow? Am I doing something stupid? (I'm more used to C++ than MATLAB)
  2 Kommentare
Stephen23
Stephen23 am 17 Sep. 2018
@davidwriter: are the data numeric, strings, or char vectors? Do you really need to call sprintf as an intermediate step, why not just call fprintf directly?
davidwriter
davidwriter am 17 Sep. 2018
The data is numeric, but in any one line each of the elements has different format. I tried using fprintf but it was very slow, so I was trying to speed it up - after reading Yair Altman and Loren I used the sprintf and block approach to try to speed it up - in fact it made it worse.

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

Stephen23
Stephen23 am 17 Sep. 2018
Bearbeitet: Stephen23 am 17 Sep. 2018
"To solve this problem I broke the output into blocks ..."
What problem? MATLAB has no problem with fprintf used on millions of elements. I don't see any reason, based on your explanation so far, why you need to process this in "blocks".
The problem is likely to be how you have coded this, but note that you forgot to actually upload your code, thus making it impossible for us to know what you are actually doing.
I just wrote a short script to test how long MATLAB requires to print 1 million character vectors (each char vector is arbitrarily sized 1x128) to a file:
C = cellstr(char(randi([33,126],1e6,128)));
tic
fid = fopen('temp4A.txt','wt');
fprintf(fid,'%s\n',C{:});
fclose(fid);
toc
It created a 127 megabyte file in
Elapsed time is 3.06831 seconds.
Then I tried a script with 16 floating point numbers per row, for 1 million rows:
N = 16;
M = rand(1e6,N);
F = repmat(',%.6f',1,N);
F = [F(2:end),'\n'];
tic
fid = fopen('temp4B.txt','wt');
fprintf(fid,F,M.');
fclose(fid);
toc
It created a 141 megabyte file in
Elapsed time is 28.429 seconds.
I can't get anywhere close to "each 1 million lines takes 60 minutes!"
  3 Kommentare
Jan
Jan am 19 Sep. 2018
@davidwriter: In "fprintf(fid,%7.1f \t %8.2f \t,..." the "..." might be the most interesting part. If you post some running code, it would be much easier to suggest improvements. It is not clear to me, what the inputs are. But converting numerical values to a string and again to a cell string is an indirection. See sprintfc or compose to create the required output format directly.
Creating a C-Mex function might be an option. But without knowing exactly what the inputs are, writing some code would include too much guessing.
davidwriter
davidwriter am 21 Sep. 2018
Bearbeitet: davidwriter am 21 Sep. 2018
Once again Stephen, my thanks, you are teaching me a lot.
The problem is fairly simple - I have ten arrays of dimension(nd, 1), where nd is usually between 1000000 - 4000000.
These arrays are: X Y Z Vx Vy Vz A R Pratio Frame all are stored as double, but some, like Frame, are integers and have to be written as such to the file. For the tests I created ten arrays dimensioned (1000000,1), filled with random numbers having the same range of values as the real data.
I tried to use sprintfc but it is very slow with complex formats. I used compose instead.
tstart=tic;
dopc=compose('%8.1f \t %8.1f \t %7.2f \t %7.1f \t %7.1f \t %7.1f \t %6.2f \t %d \t %7.3f \t %d',...
X, Y, Z, Vx, Vy, Vz, A, R, Pratio, Frame);
tst=toc(tstart);
disp(['Time to create string store ' num2str(tst)]);
avrfile = 'testcompose.avr';
disp(['Using compose : writing AVR file ' avrfile ' ...']);
fid_stm = fopen(stmfile,'Wt');
fprintf(fid_stm,'%s\n',dopc{:});
twstr=toc(tstart);
disp(['Compose: Writing time = ' num2str(twstr-tst) ' - Total time = ' num2str(twstr)]);
which gave the result:
Time to create string store: 279.9028
Using compose : writing AVR file testcompose.avr ...
Compose: Writing time = 2.0855 - Total time = 281.9884
Not good - but if I replaced the second line above with the following:
dataf = [X Y Z Vx Vy Vz A R Pratio Frame];
dopc=compose('%8.1f \t %8.1f \t %7.2f \t %7.1f \t %7.1f \t %7.1f \t %6.2f \t %d \t %7.3f \t %d',...
dataf);
A dramatic improvement -
Time to create string store: 11.691
Using compose : writing AVR file testcompose.avr ...
Compose: Writing time = 2.0849 - Total time = 13.8387
So it seems that this latter approach is the fastest and the easiest - 5x faster than direct fprintf with the 'W' option.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

KSSV
KSSV am 17 Sep. 2018
How about this approach?
S = rand(10000,3) ;
S = cellstr(num2str(S)) ;
fid = fopen('data.txt','w') ;
fprintf(fid,'%s\n',S{:});
fclose(fid) ;

Kategorien

Mehr zu Startup and Shutdown finden Sie in Help Center und File Exchange

Produkte

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by