MATLAB Answers

0

How would I create a script to read files line-by-line to save memory

Asked by Eric Lundin on 20 Aug 2019
Latest activity Commented on by Adam Danz
on 21 Aug 2019 at 23:08
Hey guys,
I've done the MatLab Onramp, but I still feel extremely confused about what the hell I'm doing and it's frustrating me. I don't even know how to google the right qeustions, and interpreting pages from this website is a task that alone is like learning another language. Learning German was easier than this it feels like. So I'm sorry if I'm asking stupid questions, but I feel like I've been thrown into the deep end.
I have a .txt file that is 1,000,000,000 lines long, give or take a few 100,000,000 (no two files are the same length)
It constists of only numbers, no headers that I'm aware of.
Because of the file size, I cannot load the whole file. It needs to be read in portions. I'd rather not split the file or
I'm looking to gather variance data every 100,000 data points, to be organized in a single column/multiple row format.
Idealy, I'd also like to have new columns generated every 360 variance data points, however this isn't as important as generating the varience data first.
Thanks for the help!

  6 Comments

"It's essentially a single column"
Does that mean sometimes it's not a single column or that sometimes there's white space? Why isn't it "definitely" a single column?
Are there any empty rows? If yes, should the be skipped or treated as missing data?
Screenshot?
A tiny sample from the file?
Could you create a file that looks exactly like your data but only has ~20 rows?
I cut off a little section. This is the very top of a file I would use.
EDIT: Here's a script I'm currently using, and the errors I recieved
%% Loading Files for Input
% Currently, this can only do a single file at a time. Future editions intend to
% have multiple files loaded at once to save time.
prompt = 'Enter the name of the .txt file to run (e.g. Organism_L/D_Media_Temp_mmddyyyy_Signal.txt).';
inputfile = input(prompt, 's');
%% Data Collection Rate
prompt = 'Enter the Data Collection rate(Hz). [20,000]';
Hz=input(prompt);
if isempty(Hz)
Hz=20000;
end
%% Variance (n)
% This designated the amount of data to use for each datapoint generated.
% The standard amount is 5 seconds (100,000 datapoints). If left empty,
% this is the value that will be used. Otherwise, this will be done in
% seconds.
% Variables
% vt = variance time. The time in seconds is the input, which is then
% multiplied by 20,000.
prompt = 'Enter the time length for variance calc in sec (20,000 points/sec) [5 seconds].';
vt=input(prompt);
if isempty(vt)
vt=5;
end
%% Designating file for export
% This is the name of the .txt file that will contain the variance data
prompt = 'Enter the name for the output file (e.g. Organism_L/D_Media_Temp_mmddyyy_VarianceTime).';
outputfile=input(prompt,'s');
%% Initianting the code
% This is intended to be read line-by-line, then generating a single column
% text file of the variance data.
infile=fopen(inputfile);
outfile=fopen(outputfile);
fline=fgetl(infile);
line_index=1;
variancewindow = Hz*vt;
data=zeros(1,variancewindow);
while ischar(dline);
data(line_index) = str2double(dline) ; % str2double = Convert string to double precision value. What does that mean......?
line_index=line_index+1;
if line_index > variancewindow;
line_index=1;
variance_value=variance_function(data);
fprintf(outfile,'%f\n',variance_value);
data=zeros(1,variancewindow);
end
dline=fgetl(infile);
end
fclose(infile);
data=data(data~=0);
variance_value=variance_function(data);
fprintf(outfile,'%f/n',variance_value);
fclose(outfile);s
EDIT 2: The error's
Error using fgets
Invalid file identifier. Use fopen to generate a valid file identifier.
Error in fgetl (line 32)
[tline,lt] = fgets(fid);
Error in NMDIII_Data (line 59)
fline=fgetl(infile);
Just to be clear, this is something I was workign on while asking this question. That's why I didn't post it in the original question.
The methods proposed by myself and Walter involve reading in chunks of data rather than reading in line-by-line (as you're doing with fgets). I suggest you abandon that method and use textscan() instead.

Sign in to comment.

Products


Release

R2018a

2 Answers

Answer by Adam Danz
on 21 Aug 2019
Edited by Adam Danz
on 21 Aug 2019 at 23:06
 Accepted Answer

Here's a demo that shows how to read in multiple lines of a file in chunks. I included lots of comments that explain what's going on. There's a section at the bottom where you can perform whatever operations you want on the data that is being read it. Walter's answer includes the variance calculations you described.
% Set parameters
file = 'x0.txt'; % The file you're reading; it's better to use a full path such as "C:\Users\name\Documents\x0.txt'
nrows = 5; %number of rows to read in at a time (you can change this to 100000 or whatever)
% Initialize the file for reading
fid = fopen(file);
% Set some loop variables
ignore = 0; %number of rows to ignore at the beginning (headers etc)
done = false; % flag that detects when file is complete
% Loop through until you've read all lines of file. When that
% happens, "done" will be switched to true and the while-loop
% will end.
while ~done
% Read the next 'nrows'; C will be a cell array of strings.
C = textscan(fid,'%s', nrows, 'delimiter', '\n', 'headerlines', ignore);
% If C is completely empty, you've finished the file.
if cellfun(@isempty, C)
% C has no data so the file is finished.
% Set the "done" flag to True so the while-loop ends
done = true;
% Skip the rest of this iteration.
continue
end
% Convert C from a cell array of strings to a numeric vector
% This assumes the content of the strings are numbers.
nVec = str2double(C{:});
% Increment the number of lines to ignore
ignore = ignore + nrows;
% % % % % % % % % % % % % % % % % % %
% %
% HERE IS WHERE YOU'LL DO WHATEVER %
% OPERATIONS YOU NEED TO DO WITH %
% THE VALUES YOU JUST READ IN. %
% %
% % % % % % % % % % % % % % % % % % %
end
% Close file
fclose(fid);

  2 Comments

I do not see a purpose on the frewind() ? textscan() will continue from the current file position.
Nice catch, Walter. I originally copied a similar code that uses fgetl() and adapted it to this but I guess I overlooked the frewind. I edited and fixed it. Thanks.

Sign in to comment.


Answer by Walter Roberson
on 20 Aug 2019

vary_every = 10000;
expected_buffers = 10000; %1000000000 / 100000
group_every = 360;
variances = zeros(1, expected_buffers);
filename = 'YourFileNameHere';
[fid, msg] = fopen(filename, 'r');
if fid < 0
error('Failed to open file "%s" because "%s"', filename, msg)
end
buffcount = 0
while true
this_buffer = cell2mat( textscan(fid, '%f', vary_every) );
if isempty(this_buffer); break; end %end of file
buffcount = buffcount + 1;
variances(buffcount) = variance(this_buffer);
end
variances(buffcount+1:expected_buffers) = []; %trim off any extra
leftover = mod(buffcount,group_every);
if leftover ~= 0
variances(end+1:end+group_every-leftover) = nan;
end
variances = reshape(variances, group_every, []);
disp(variances)

  0 Comments

Sign in to comment.