How to write data to a binary file at a specific position?

Hello,
Let us say that my data looks like this -
data = [1,1,1,1,1;...
2,2,2,2,2; ...
3,3,3,3,3];
I would like a write this data to a binary file such that it looks like - [1;2;3;1;2;3;1;2;3;1;2;3 ... and so on].
Now for a small file, I can easily do this as - fwrite(fp, data(:), 'int16'); However, for a very large data file (where data size is 100*1e10 or more), it becomes extraordinary slow. The raw data is stored as deparate files for each row, so I can read the data row by row. So, is it possible to write data to a binary file in a specific position?
Thank you for help!

6 Kommentare

"is it possible to write data to a binary file in a specific position"
See fseek().
Is it possible to give an array of positions in the file?fseek only allows a single offset, where here I would like have a large array of position. :(
Writing a 100x1e10 array as UINT16 means 2 TB. Of course writing this takes time. But I'm impressed by your computer also, which is able to store these 2 TB in the RAM before. Is this really the case?
Writing or importing data row-wise is time consuming also, because Matlab stores values in columnwise order. But according to your question, you have decided for this structure. What does "write data to a binary file in a specific position" exactly mean now? Of coure this works with fseek and fwrite, as _ has mentiones already. But this is much slower than writing the data in contiguous blocks.
NeuronDB
NeuronDB am 25 Mär. 2022
Bearbeitet: NeuronDB am 25 Mär. 2022
Here is some dummy data i made up, the output_data is how the output should look -
data = double(ones(10,1e7) .* [1:10]');
output_data = data(:);
One (slow) way to solve this is to create an output array in the memory, allocate data to desired spaces and then write it to the file. But this is very slow!
rowsize = 1e7;
nrows = 10;
% make an array for output
output_data = zeros(nrows*rowsize,1);
for i=1:nrows
this_row = i*ones(1,rowsize); % dummy data for this row
output_data(i:nrows:end) = this_row;
end
% create a file and write data to file
fname = 'data.dat';
fid = fopen(fname, 'Wb');
fwrite(fid, output_data(:), 'int16')
fclose(fid)
This is what I want to avoid, I want to read each row, write is to the file in a correct position, and then move on to the next row, so that to avoid keeping large variables in memory.
Jan
Jan am 25 Mär. 2022
Bearbeitet: Jan am 25 Mär. 2022
Why do you use dummy data, if you have create some test data before?
What is the purpose of this code:
output_data = zeros(nrows*rowsize,1);
for i = 1:nrows
this_row = data(i, :); % This is meant, isn't it?
output_data(i:nrows:end) = this_row;
end
It is an expensive version of:
output_data = data(:);
But you have written this line already. Therefore I do not understand, what the 2nd code should demonstrate. Simply omit the expensive loop.
Let's start with some test data:
rowsize = 1e7;
nrows = 10;
data = randi([0, 32767], nrows, rowsize, 'int16');
What do you want to do now? What is the relation of the shown code and the question about writing data at specific positions into a file?
By the way, there is no 'b' format anymore in fopen for over 20 years now. Simply use 'W'.
Hi Jan,
The raw data is stored in separate files for each row. So I need to loop through the files to read each row, append the data in the workspace cat(1, data, new_row), then do data(:), then write to binary file. But this requires storing the large arrays in the workspace before writing it to the data file. I would like to just read the first row, write to data file, then read the next row and so on... so to save memory and speed up!
Thank you in advance!

Melden Sie sich an, um zu kommentieren.

 Akzeptierte Antwort

Walter Roberson
Walter Roberson am 25 Mär. 2022

2 Stimmen

First (and this is important!) write a block of zeros that is the same number of bytes as the final array size. The writing will not work properly if you omit this step. But you do not need to create an array that size: you could loop writing out a buffer of zeros until enough had accumulated. Do not write extra data: there is no way in MATLAB of getting rid of the extra data once it is written.
Now, repeat:
fseek to ((row number minus 1) times (bytes per element)) from beginning of file.
fwrite() the content of the row, making sure to use the precision argument to control how the data is written, and making sure to use the "skip" option. The value of the skip should be ((total rows minus 1) times (bytes per element))
Go back to the next row.
This will not be fast at all. Every page that is being updated will have to be read by MATLAB, and MATLAB will have to do the modification in its internal buffers and write the results out again.
It is not possible at the MATLAB level to "leave holes" that you gradually fill in. And even if it were, MATLAB would still need to do the continual read/modify/write cycle.

2 Kommentare

Jan
Jan am 26 Mär. 2022
Bearbeitet: Jan am 26 Mär. 2022
I've written an equivalent code. There was no problem, if I omit the step to writing zeros at first. The iterative expanding of the file is not expensive also, because the existing data are not rewritten. In my tests it is even slower to pre-allocate the file.
To my surprise there is no method to crop a file in Matlab, as you say. See FileExchange: FileResize
In MATLAB fseek beyond the end of a file does not work, at least historically.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

Jan
Jan am 26 Mär. 2022
Bearbeitet: Jan am 26 Mär. 2022
% Some test data storing the rows in different files:
nRow = 10;
nCol = 1e6;
for k = 1:nRow
[fid, msg] = fopen(sprintf('file%02d.bin', k), 'W');
assert(fid > 0, msg);
data = randi([0, 32767], nCol, 1, 'int16');
fwrite(fid, data, 'int16');
fclose(fid);
end
% *** Version 1: insert data in chunks into the file:
tic
% Create the output file:
[ofid, msg] = fopen(sprintf('matrix1.bin'), 'W');
assert(ofid > 0, msg);
% Pre-allocate the output file (not really needed):
width = 2; % Bytes per element
skip = (nRow - 1) * width;
fwrite(ofid, 0, 'int16', (nRow * nCol - 1) * width);
% Loop over input files:
for k = 1:nRow
[ifid, msg] = fopen(sprintf('file%02d.bin', k), 'r');
assert(ifid > 0, msg);
data = fread(ifid, Inf, '*int16');
fclose(ifid);
% Insert in output file in chunks:
fseek(ofid, (k-1) * width, 'bof');
fwrite(ofid, data(1), 'int16');
fseek(ofid, k * width, 'bof');
fwrite(ofid, data(2:nCol), 'int16', skip);
end
fclose(ofid);
toc;
% *** Version 2: Join array in the memory:
tic
% Loop over input files:
data = zeros(nRow, nCol, 'int16');
for k = 1:nRow
[ifid, msg] = fopen(sprintf('file%02d.bin', k), 'r');
assert(ifid > 0, msg);
data(k, :) = fread(ifid, Inf, '*int16');
fclose(ifid);
end
% Write output file at once:
[ofid, msg] = fopen(sprintf('matrix2.bin'), 'W');
assert(ofid > 0, msg);
fwrite(ofid, data, 'int16');
fclose(ofid);
toc;
Timings on my i5, Matlab R2018b, SSD:
Elapsed time is 46.099363 seconds. % Insert on disk
Elapsed time is 0.060289 seconds. % Insert in memory
This means, that the joining in the RAM is much faster than writing the data with skipping.
This might be different, if you convert the imported data to doubles, which use 8 byte per element instead of 2 bytes for int16. Maybe the available RAM is exhausted and the computer stores the data in the much slower virtual memory.

Produkte

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by