Find data from files that are too large to read in
Ältere Kommentare anzeigen
I have structured data files (each about 30 GB). I need to find all the lines in the file that contain a specific number in one of the fields. I am presently doing this by reading in each line in turn and checking the field, but it takes a long time ( > 1 hr) to scan through the file). The program HEX FIEND allows me to do this manually in a small fraction of the time. Is there a way to read a file up to the point that some condition is met? If there is, I suspect it will speed up finding and extracting the lines of the file I want.
2 Kommentare
Kevin Lehmann
am 20 Feb. 2024
Antworten (2)
Walter Roberson
am 17 Feb. 2024
0 Stimmen
Use buffer-fulls of data for increased efficiency.
fread() a block of data of fixed size. Scan backwards through the block looking for the last newline, keeping a count of how far you go. truncate the block there, and fseek() backwards by the number of bytes you had to scan backwards to reach the newline. Now process the in-memory block of data.
Repeat until you are at the end of file. Be careful because the file might potentially not end in newline.
10 Kommentare
Kevin Lehmann
am 17 Feb. 2024
Walter Roberson
am 18 Feb. 2024
1 gigabyte buffer is probably fine.
Kevin Lehmann
am 20 Feb. 2024
Walter Roberson
am 20 Feb. 2024
In all modern file systems, ASCII files and binary files are just streams of bytes. ASCII files use either linefeed or carriage-return followed by linefeed to signal the end of a line.
There is no reason you cannot fread() a block of data from an ASCII file. The only consequence is that the end of the block of (fix-length) data might not happen to end in a newline. So you scan backwards from the end of the block looking for the first newline, truncate the block there, and fseek() backwards by the number of bytes you moved backwards.
The result will be a block of characters that has internal newlines (and possibly carriage-returns as well) marking the end of lines. You can process that block as text by any of several different methods, including textscan
fid = fopen('sample.txt');
txt = fread(fid,[1 Inf],'*char');
fclose(fid);
class(txt)
disp(txt)
Kevin Lehmann
am 20 Feb. 2024
Walter Roberson
am 20 Feb. 2024
data = fread(FILEID, [37 25000], '*uchar').'; %about 1 gigabyte
%break it up into groups
first_group = data(:,1:5, 'evaluation', 'restricted');
second_group = data(:,6:7, 'evaluation', 'restricted');
Les Beckham
am 20 Feb. 2024
@Walter Roberson, did you, perhaps, mean this?
data = fread(FILEID, [37 25000], '*uchar').'; %about 1 gigabyte
%break it up into groups
first_group = str2num(data(:,1:5), 'evaluation', 'restricted');
second_group = str2num(data(:,6:7), 'evaluation', 'restricted');
Kevin Lehmann
am 21 Feb. 2024
Walter Roberson
am 21 Feb. 2024
Ah, yes, I did mean that!
Kategorien
Mehr zu Large Files and Big Data finden Sie in Hilfe-Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!