# How to operate with large arrays of structs

28 views (last 30 days)
vthuongt on 29 Sep 2015
Commented: Jan on 30 Sep 2015
Hello, I am doing a Markov Chain Monte Carlo Simulation where I want to store many sampled states. I have the following data structure:
state(1) = struct('dim', 3 ,'coords',rand(3,1), 'vals', rand(3,1));
state(10000) = struct('dim', [], 'coords', [], 'vals', []);
for i = 2:10000
state(i) = generateNewState(state(i-1));
end
How can I store my generated state-data and proceed with the next 10000 states? Then append them to the existing .mat file and go on until I generated say 1e10 states. And then use the data to do calculations? My problem is that the dimension (up to 10000) of the struct is not fixed. The other problem is that I dont want to load the whole mat file into my memory since it wouldn't fit. I would like to process the data in chunks. By processing I mean calculations of mean, variance, covariance, max, min , extraction of every 100th sample, creating histogram without knowing the domain etc...
I already tried the map-reduce formalism but there I had to limit myself to a maximum dimension and I had to fill up every struct of smaller dimensions with NaN's in order to be able to store the structs as a table in a csv file. But this can't be the right way to do it because maybe I will just need 10 dimensions but 10000 are theoretically possible. So I would have a really sparse table... It just depends on the data which I don't know in advance. So has anybody a good idea how to solve it?
Guillaume on 30 Sep 2015
The overhead has nothing to do with the cell. It's simply due to the fact that you allocate 15000x3 matrices for your structure, all of which need memory to track their size, type, etc.
With your example, the structure uses about 5 MB more (5,040,192 bytes exactly in 2015b) than the matrix.
But, yes, if the data you store takes over 3 GB, 5 MB becomes less significant.
You can of course store sparse matrices in a struct, but the overhead of the sparse matrices may be more than you save.

Jan on 29 Sep 2015
I do not understand the question. How can you store the data? The shown code works, doesn't it? So is the first question solved already? You process with the next 10'000 by simply calling your code again. You can store the state variables in a cell array. There are different methods to append this to an existing MAT file. But a binary file seems to be more efficient in this case. Especially if you want to read it partially only.
A compact and efficient file format could be:
number of dimensions as uint64
coordinates as double vector
vals as double vector
This can be read by a simply loop. You can skip a variable or read as many variables until the memory is filled. Using powerful MAT files for this job is far too complicated.
##### 2 CommentsShowHide 1 older comment
Jan on 30 Sep 2015
There are no standard functions for your specific job. But they are easy to write using fwrite and fread:
% For writing the array:
fid = fopen(FileName, 'w');
if fid == -1, error('Cannot open file: %s', FileName); end
% First value: Total number of elements:
fwrite(fid, numel(state), 'uint64');
for k = 1:numel(state);
fwrite(fid, state(k).dim, 'uint64');
fwrite(fid, state(k).coords, 'double');
fwrite(fid, state(k).vals, 'double');
end
fclose(fid);
fid = fopen(FileName, 'r');
if fid == -1, error('Cannot open file: %s', FileName); end
% First value: Total number of elements:
% Pre-allocate:
state(num) = struct('dim', [], 'coords', [], 'vals', []);
for k = 1:numel(state);
dim = fwrite(fid, 1, 'uint64');
state(k).dim = dim;
end
fclose(fid);
I cannot debug this, because I cannot run Matlab currently. I think the strategy is clear, so please adjust this to yout needs.