How to operate with large arrays of structs

21 Ansichten (letzte 30 Tage)
vthuongt
vthuongt am 29 Sep. 2015
Kommentiert: Jan am 30 Sep. 2015
Hello, I am doing a Markov Chain Monte Carlo Simulation where I want to store many sampled states. I have the following data structure:
state(1) = struct('dim', 3 ,'coords',rand(3,1), 'vals', rand(3,1));
state(10000) = struct('dim', [], 'coords', [], 'vals', []);
for i = 2:10000
state(i) = generateNewState(state(i-1));
end
How can I store my generated state-data and proceed with the next 10000 states? Then append them to the existing .mat file and go on until I generated say 1e10 states. And then use the data to do calculations? My problem is that the dimension (up to 10000) of the struct is not fixed. The other problem is that I dont want to load the whole mat file into my memory since it wouldn't fit. I would like to process the data in chunks. By processing I mean calculations of mean, variance, covariance, max, min , extraction of every 100th sample, creating histogram without knowing the domain etc...
I already tried the map-reduce formalism but there I had to limit myself to a maximum dimension and I had to fill up every struct of smaller dimensions with NaN's in order to be able to store the structs as a table in a csv file. But this can't be the right way to do it because maybe I will just need 10 dimensions but 10000 are theoretically possible. So I would have a really sparse table... It just depends on the data which I don't know in advance. So has anybody a good idea how to solve it?
Thanks in advance!
  4 Kommentare
vthuongt
vthuongt am 30 Sep. 2015
Bearbeitet: vthuongt am 30 Sep. 2015
I did following more realistic comparison: state(10000) = struct('x',[],'y',[],'z',[]);
for i = 1:10000
state(i) = struct('x',rand(1,10000),'y',rand(1,10000),'z',rand(1,10000));
end
vs.
coord(10000,30000) = 0;
for i = 1:10000
coord(i,:) = rand(1,3*10000);
end
My result was a bit unexpected because the matrix version was a lot slower! So I will just stick to my struct version. Also I experienced some weird memory behaviour. When I preallocated my matrix with "coord(10000,30000) = 0" I woul see a linear increase in memory during the inner for loop. But when I preallocate with "coord = zeros(10000,30000)" I wouldn't see an instant increase in memory usage and it will stay constant during the for loop. Also the time for the first option is longer than the second one. So whats happen internaly?
Guillaume
Guillaume am 30 Sep. 2015
The overhead has nothing to do with the cell. It's simply due to the fact that you allocate 15000x3 matrices for your structure, all of which need memory to track their size, type, etc.
With your example, the structure uses about 5 MB more (5,040,192 bytes exactly in 2015b) than the matrix.
But, yes, if the data you store takes over 3 GB, 5 MB becomes less significant.
You can of course store sparse matrices in a struct, but the overhead of the sparse matrices may be more than you save.

Melden Sie sich an, um zu kommentieren.

Antworten (1)

Jan
Jan am 29 Sep. 2015
I do not understand the question. How can you store the data? The shown code works, doesn't it? So is the first question solved already? You process with the next 10'000 by simply calling your code again. You can store the state variables in a cell array. There are different methods to append this to an existing MAT file. But a binary file seems to be more efficient in this case. Especially if you want to read it partially only.
A compact and efficient file format could be:
number of dimensions as uint64
coordinates as double vector
vals as double vector
This can be read by a simply loop. You can skip a variable or read as many variables until the memory is filled. Using powerful MAT files for this job is far too complicated.
  2 Kommentare
vthuongt
vthuongt am 30 Sep. 2015
Could you please give me some details on this? Are there any routines for saving a struct in a binary file? And especially readinf data from a binary file back to memory?
Jan
Jan am 30 Sep. 2015
There are no standard functions for your specific job. But they are easy to write using fwrite and fread:
% For writing the array:
fid = fopen(FileName, 'w');
if fid == -1, error('Cannot open file: %s', FileName); end
% First value: Total number of elements:
fwrite(fid, numel(state), 'uint64');
for k = 1:numel(state);
fwrite(fid, state(k).dim, 'uint64');
fwrite(fid, state(k).coords, 'double');
fwrite(fid, state(k).vals, 'double');
end
fclose(fid);
% For reading:
fid = fopen(FileName, 'r');
if fid == -1, error('Cannot open file: %s', FileName); end
% First value: Total number of elements:
num = fread(fid, 'uint64');
% Pre-allocate:
state(num) = struct('dim', [], 'coords', [], 'vals', []);
for k = 1:numel(state);
dim = fwrite(fid, 1, 'uint64');
state(k).dim = dim;
state(k).coords = fread(fid, dim, 'double');
state(k).vals = fread(fid, dim, 'double');
end
fclose(fid);
I cannot debug this, because I cannot run Matlab currently. I think the strategy is clear, so please adjust this to yout needs.

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Large Files and Big Data finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by