Fastest Way to write data to a text file - fprintf

I am writing a lot of date to a text file one line at a time (1.7 million rows, 4 columns) that is comprised of different data types. I'm wondering if there is a better way to do this than 1 line at a time that might yield much faster results.
Here is what I'm doing now.
ExpSymbols = Char Array
ExpDates = Numeric Array
MyFactor = Numeric Array
FctrName = Char Array
ftemp = fopen('FileName','w' );
for i = 1:length(MyFactor)
fprintf(ftemp, '%s,%i,%f,%s\r\n',ExpSymbols(i,:), ExpDates(i,1), MyFactor(i,1),[FctrName '_ML']);
end
fclose(ftemp);
Thanks in advance,
Brian

 Akzeptierte Antwort

Jan
Jan am 2 Aug. 2013

2 Stimmen

You can try to suppress the flushing by opening the file in the 'W' instead of the 'w':
ftemp = fopen('FileName', 'W'); % uppercase W
Fmt = ['%s,%i,%f,', FctrName '_ML\r\n'];
for i = 1:length(MyFactor)
fprintf(ftemp, Fmt, ExpSymbols(i,:), ExpDates(i), MyFactor(i));
end
fclose(ftemp);

9 Kommentare

Brian
Brian am 3 Aug. 2013
Bearbeitet: Brian am 3 Aug. 2013
Jan can you tell me why this would make a difference? I'm not really familiar with the difference between 'w' and 'W' and what that would do for my write function speedwise.
Also, is it faster to define 'fmt' in a variable so that you're not redefining it with each iteration of fprint?
Thanks.
Jan
Jan am 4 Aug. 2013
Bearbeitet: Jan am 4 Aug. 2013
It is faster to include the string FctrName once into the fmt string, because it does not depend on the loops.
You find more information about the 'W' mode, when you read the documentation. But unfortunately this is one of the few points, where the docs are not clear enough. In the net you find http://undocumentedmatlab.com/blog/improving-fwrite-performance/ . This concerns fprintf also: 'W' does not flush the buffers after each operation. When several pieces of text are collected at first, the number of slow IO-operations can be reduced.
dpb
dpb am 4 Aug. 2013
JIT won't do the same thing when it finds the constant string???
Jan
Jan am 4 Aug. 2013
Bearbeitet: Jan am 4 Aug. 2013
The JIT is not documented and TMW does not want the programmers to rely on specific features. But a short test can check this:
tic;
s = cell(1, 10000);
c = 'String';
for k = 1:numel(s)
s{k} = sprintf('%d %f %s', k, k);
end
toc
tic;
s = cell(1, 10000);
c = 'String';
f = ['%d %f ', c];
for k = 1:numel(s)
s{k} = sprintf(f, k, k);
end
toc
(copied to a function, 2009a/64/Win)
Elapsed time is 0.238085 seconds.
Elapsed time is 0.211209 seconds.
(copied to a function, 2011b/64/Win)
Elapsed time is 0.248099 seconds.
Elapsed time is 0.232466 seconds.
Interesting results! Perhaps this means, that the JIT got more powerful from 2009 to 2011, but the total performance degraded.
dpb
dpb am 4 Aug. 2013
Constant removal from a loop is so basic an optimization one would expect it highly likely, documented or not was what I was relying/presuming on.
It seems clear that gains in some areas have been more than made up for by the expansion of features and language enhancements with time. I hadn't upgraded since R12 (ca '99 iirc) until 2012b. The latter brings my old machine of roughly same era to its knees for useful work--it's all it can do to test snippets for cs-sm and the forum and that is often very painfully slow owing to needing so much memory for the base product now that it causes disk thrashing for even tiny cases.
Brian
Brian am 5 Aug. 2013
Thanks Jan. Using fprintf and the lowercase 'w' param my export took 86 seconds just by suppressing the flushing using 'W' my export job is down to 53 seconds or so. This is a good improvement but I wish there was something else that could be done to make it quicker. Nevertheless, thanks for your help.
-Brian
@Brian: As dpb has suggested already, when speed matters and an optical control by a human is impossible due to the size of the file, binary output is a much faster method. A severe limitation of the currently applied method is the mixing of the 3 input arrays. It would be more efficient to export one data set after the other, because this uses the memory caches of the processor more efficiently. But then the disk transfer is the bottleneck. But you talk about a file of approximately 40 MB only, such that 53 seconds feels slower than I'd expect.
Please try this:
tic;
save('TestFile.mat', 'ExpSymbols', 'ExpDates', 'MyFactor', 'FctrName);
toc
tic;
Data = load('TestFile.mat')
toc
Brian
Brian am 5 Aug. 2013
Bearbeitet: Brian am 5 Aug. 2013
You're right, saving the variables by themselves is much quicker than writing to a flat file. I changed my code to write to C:\Temp (as you suggested above) and the save took .97 seconds and the load took .33 seconds. The formatted flat file is 62 MB in size and the .mat file is only 15MB or so. I do need a properly formatted file for the other system to read as it can't read .mat files.
All fields need to be in one file but it sounds like you're saying that the writing of mixed data types is what's making the write unnecessarily slow. Can I write one data type at a time to the same file using a loop structure for each data type?
dpb
dpb am 5 Aug. 2013
A) Can you offload the formatting from this code to a second one that processes the .mat files and writes the formatted ones? Won't save any overall but moves it to a different place where the bottleneck might not be so evident? For example, you could have a second background process doing that conversion while the primary analyses are done interactively? All depends on the actual workflow as to whether helps or not, of course.
B) Can your target app read the data variables sequentially one after the other instead of all a record at a time as you're currently writing them? If so, sure you can write each w/o any loop at all and it will likely be faster by at least a measurable amount as Jan suggests.
C) You might just see what the text option of save does in comparison for speed--don't know it'll help but what they hey...

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

dpb
dpb am 2 Aug. 2013
Bearbeitet: dpb am 3 Aug. 2013

0 Stimmen

It's a pita for mixed fields--I don't know of any clean way to mix them in fprintf c
I generally build the string array internally then write the whole thing...
cma=repmat(',',length(dates),1); % the delimiter column
out=[symb cma num2str(dates) cma factor cma names];
fprintf(fid, '%s\n', out);
fid=fclose(fid);
names is a placeholder for the FactorName that I guess may be a constant? If so, it can be inserted into the format string as Jan assumed; if not needs to be built as the column of commas to concatenate however it should be.

6 Kommentare

Brian
Brian am 3 Aug. 2013
Thanks for the reply dpb. Because both my dates field and my factor field are double, the conversion to string takes so long that it's not much more efficient (if any) to convert to string and concatenate. I will test again on Monday when back in the office and see if this helps.
dpb
dpb am 3 Aug. 2013
Bearbeitet: dpb am 4 Aug. 2013
I haven't done a test but I'd be real surprised if conversion in memory of full arrays isn't quite a lot faster than a loop record-at-a-time. But, it could happen I suppose.
Certainly the faster way would be to not write such large files as formatted but as stream -- who's going to be looking at such a large dataset, anyway?--and if they do need to, then use a viewing helper app.
Since I'm here now, I'll comment on your ? on Jan's comment... :)
The 'W' will leave the buffering/flushing of output buffer up to the OS rather than forcing it after each call -- it probably will help a little for a very large file.
I'm not sure about the JIT compiler; but I'd expect it to have parsed the constant format string and so I'd not expect any difference between those two constructions--again the proof is always in the timing, of course. I suspect Jan just did it for readability as much as anything.
Brian
Brian am 4 Aug. 2013
Thanks again for the reply dpb. I will give the methods a test Monday and reply back with what I conclude.
The reason I need so many records is because I work in Investment quantitative research. I'm Calculating a datapoint on 15000 securities monthly for 25 years and it becomes a fair amount of data to output. The reason I want this file creation to be quicker is that I'm calculating this on 100 different data elements or so and may be doing a few of them multiple times if I do something incorrectly. Another system that reads this input file calculates many of the necessary statistics for me etc...
dpb
dpb am 4 Aug. 2013
Bearbeitet: dpb am 4 Aug. 2013
Yeah, but why do they have to be formatted instead of stream?
You can't change the other input form for the other routine or just do that in Matlab, too, w/o having to write the files in between?
I did a little test but my machine is very old and memory limited -- before I ran out of memory it appeared to me that the in memory process helped but you can't use num2str except w/ a fixed format because it'll have different numbers of significant digits otherwise.
ADDENDUM--I reverted back to R12 to get a little more memory available for data w/o thrashing disk.
Turns out that at least there num2str() lives up to it's reputation as a performance dog--the loop beat the internal conversion hands down for larger sizes. OTOMH I can't think of another builtin way to generate the columns w/o looping constructs--sprintf() embeds a \n if use it which is ok for display purposes but not for output to file. I guess I don't have any other answer than to see if can use stream i/o instead, sorry. OBTW, the other thing that will help if you still must write it as formatted -- once you do the conversion in memory, then use fwrite to output the data.
Brian
Brian am 5 Aug. 2013
Just to convert my two numeric arrays to string takes 55 seconds. This is slower than writing the file with the mixed data types using fprintf and the 'W' argument. I'm still not sure what you are referring to when you talk about "stream." I'm not familiar with that.
dpb
dpb am 5 Aug. 2013
Also called "binary". It's unformatted i/o which has the benefits for speed of
a) full precision for float values at minimum number of bytes/entry, b) eliminates the format conversion overhead on both input and output
doc fwrite % and friends
or if could stay in Matlab then
doc save % and load is only slightly higher-level
The possible disadvantage is, of course, you can't just look at a file and read it; but who's going to manually be looking at such large files, anyway?

Melden Sie sich an, um zu kommentieren.

Produkte

Gefragt:

am 2 Aug. 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by