Array size of data from large file
3 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Hi People, I have some very large text files (~3GB) that I need to process, and can't read the whole thing in at once. I know I can use csvread to read the data in one line at a time, but I would like to know before I begin what the dimensions of the data array are. The files are only numbers, no text or anything. Any ideas would be appreciated.
0 Kommentare
Antworten (1)
Walter Roberson
am 30 Okt. 2013
csvread() invokes dlmread() which invokes textscan(). You can call upon that directly for increased efficiency. Note that you can tell textscan() how many times you want the format to be re-used, and so effectively can tell it how many lines you want to process at one time.
Knowing the number of items involved is good for pre-allocation. Unfortunately, there is no method in MATLAB or any of the supported operating systems to find out how many lines are in a file without reading through the file and counting the lines. That can be expensive for large files.
A strategy that can be used fairly effectively for variable-length datasets is to do allocations in chunks, fill in the chunks, allocate more if you need to, and so on until you read end of file, at which point you truncate the final chunk and include it in your data.
Depending on how you need to work with your data while you are reading it in, sometimes it is best to allocate more to your existing array by writing at the new provisional endpoint. Reminder if you do that: adding extra columns is more efficient than adding extra rows (the copying of the old data to the new array that MATLAB will do internally is most efficient when the data does not need to be reorganized as it goes, but allocating new rows requires reorganization.)
If you do not need to work with the data until the end, then a useful strategy can be to create a cell array, read in a chunk of data and add that chunk in as a cell, and keep reading chunks and pushing them into the cell array slots. Then at the end, you can vertcat() or horzcat() or cell2mat() the cell array chunks into one real array. This strategy will involve a lot less intermediate data movement than extending a numeric array by writing to the end of it will.
You might find that you can meaningfully process the data in chunks in your algorithm -- for example smoothing over a chunk boundary only requires the previous chunk and the current chunk to be available, after which you can discard the previous chunk. Or fill it with new data (two-buffer algorithms used to be pretty common in the days of smaller memory.)
Siehe auch
Kategorien
Mehr zu Text Files finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!