How to operate with large arrays of structs

Question

vthuongt am 29 Sep. 2015

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/245796-how-to-operate-with-large-arrays-of-structs

Kommentiert: Jan am 30 Sep. 2015

Hello, I am doing a Markov Chain Monte Carlo Simulation where I want to store many sampled states. I have the following data structure:

    state(1) = struct('dim', 3 ,'coords',rand(3,1), 'vals', rand(3,1));
    state(10000) = struct('dim', [], 'coords', [], 'vals', []);
    for i = 2:10000
    state(i) = generateNewState(state(i-1));
    end

How can I store my generated state-data and proceed with the next 10000 states? Then append them to the existing .mat file and go on until I generated say 1e10 states. And then use the data to do calculations? My problem is that the dimension (up to 10000) of the struct is not fixed. The other problem is that I dont want to load the whole mat file into my memory since it wouldn't fit. I would like to process the data in chunks. By processing I mean calculations of mean, variance, covariance, max, min , extraction of every 100th sample, creating histogram without knowing the domain etc...

I already tried the map-reduce formalism but there I had to limit myself to a maximum dimension and I had to fill up every struct of smaller dimensions with NaN's in order to be able to store the structs as a table in a csv file. But this can't be the right way to do it because maybe I will just need 10 dimensions but 10000 are theoretically possible. So I would have a really sparse table... It just depends on the data which I don't know in advance. So has anybody a good idea how to solve it?

Thanks in advance!

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Guillaume am 29 Sep. 2015

Bearbeitet: Guillaume am 29 Sep. 2015

In MATLAB Online öffnen

Note that using large arrays of structures (or a cell array of numerous matrices) is extremely memory inefficient in matlab. Each field of each element of the structure carries the overhead of a matrix, so with 10,000 elements and 3 fields, you have 30,000 time that overhead.

If the corresponding fields of each struct array element are always the same, you're better off using a big matrix for each field and a scalar struct.

%on R2015b 
coord = rand(3, 10000); %one big 3x10000 array
state = struct('coord', num2cell(rand(3, 10000), 1)); %a 1x10000 struct with a 3x1 field
whos coord state
Name       Size                 Bytes  Class     Attributes
coord      3x10000             240000  double              
state      1x10000            1360064  struct

The structure uses nearly 6 times the amount of memory for the same content. Note that whos does not show the overhead for plain matrices, so coorrd actually uses a bit more memory than reported.

vthuongt am 30 Sep. 2015

Bearbeitet: vthuongt am 30 Sep. 2015

In MATLAB Online öffnen

I think the large overhead comes from the cell structure, which I am not using. I compared following data structure:

state(15000) = struct('x',[],'y',[],'z',[]);
for i = 1:15000
   state(i) =  struct('x',rand(1,10000),'y',rand(1,10000),'z',rand(1,10000));    
end

Which took about 3.36GB Ram - that is ok for me if I can write this data to a file on my Harddrive and proceed with the next states. In copmarison to the matrix representation:

coord = rand(15000,3*10000);

which took about 3.35GB Ram. So there is just an overhead of some MB which I dont really care since the data struct makes my code a lot more readable. Also I had the feeling that there was no performance issue. The two methods were somehow equally fast. So I dont really consider changing my data strcture since I dont really see an advantage in it. In fact the struct things was about 15% slower then the matrix approach. But this doesnt bother me too much.

I was just wondering whether I can use sparse matrices within a struct since many matrix entries would be 0.

vthuongt am 30 Sep. 2015

Bearbeitet: vthuongt am 30 Sep. 2015

In MATLAB Online öffnen

I did following more realistic comparison: state(10000) = struct('x',[],'y',[],'z',[]);

for i = 1:10000
    state(i) =  struct('x',rand(1,10000),'y',rand(1,10000),'z',rand(1,10000));
      end

vs.

coord(10000,30000) = 0;
for i = 1:10000
        coord(i,:) = rand(1,3*10000);
    end

My result was a bit unexpected because the matrix version was a lot slower! So I will just stick to my struct version. Also I experienced some weird memory behaviour. When I preallocated my matrix with "coord(10000,30000) = 0" I woul see a linear increase in memory during the inner for loop. But when I preallocate with "coord = zeros(10000,30000)" I wouldn't see an instant increase in memory usage and it will stay constant during the for loop. Also the time for the first option is longer than the second one. So whats happen internaly?

Guillaume am 30 Sep. 2015

The overhead has nothing to do with the cell. It's simply due to the fact that you allocate 15000x3 matrices for your structure, all of which need memory to track their size, type, etc.

With your example, the structure uses about 5 MB more (5,040,192 bytes exactly in 2015b) than the matrix.

But, yes, if the data you store takes over 3 GB, 5 MB becomes less significant.

You can of course store sparse matrices in a struct, but the overhead of the sparse matrices may be more than you save.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Jan am 29 Sep. 2015

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/245796-how-to-operate-with-large-arrays-of-structs#answer_194064

In MATLAB Online öffnen

I do not understand the question. How can you store the data? The shown code works, doesn't it? So is the first question solved already? You process with the next 10'000 by simply calling your code again. You can store the state variables in a cell array. There are different methods to append this to an existing MAT file. But a binary file seems to be more efficient in this case. Especially if you want to read it partially only.

A compact and efficient file format could be:

number of dimensions as uint64
coordinates as double vector
vals as double vector

This can be read by a simply loop. You can skip a variable or read as many variables until the memory is filled. Using powerful MAT files for this job is far too complicated.

2 Kommentare
Keine anzeigenKeine ausblenden

vthuongt am 30 Sep. 2015

Could you please give me some details on this? Are there any routines for saving a struct in a binary file? And especially readinf data from a binary file back to memory?

Jan am 30 Sep. 2015

In MATLAB Online öffnen

There are no standard functions for your specific job. But they are easy to write using fwrite and fread:

% For writing the array:
fid = fopen(FileName, 'w');
if fid == -1, error('Cannot open file: %s', FileName); end
% First value: Total number of elements:
fwrite(fid, numel(state), 'uint64');
for k = 1:numel(state);
  fwrite(fid, state(k).dim,    'uint64');
  fwrite(fid, state(k).coords, 'double');
  fwrite(fid, state(k).vals,   'double');
end
fclose(fid);
% For reading:
fid = fopen(FileName, 'r');
if fid == -1, error('Cannot open file: %s', FileName); end
% First value: Total number of elements:
num = fread(fid, 'uint64');
% Pre-allocate:
state(num) = struct('dim', [], 'coords', [], 'vals', []);
for k = 1:numel(state);
  dim = fwrite(fid, 1, 'uint64');
  state(k).dim    = dim;
  state(k).coords = fread(fid, dim, 'double');
  state(k).vals   = fread(fid, dim, 'double');
end
fclose(fid);

I cannot debug this, because I cannot run Matlab currently. I think the strategy is clear, so please adjust this to yout needs.

Melden Sie sich an, um zu kommentieren.

How to operate with large arrays of structs

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Antworten (1)

2 Kommentare
Keine anzeigenKeine ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

How to operate with large arrays of structs

4 Kommentare 2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Antworten (1)

2 Kommentare Keine anzeigenKeine ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden