Fast access/concatenation of large array structure

Question

D. Plotnick am 13 Apr. 2017

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/335272-fast-access-concatenation-of-large-array-structure

Kommentiert: Matt J am 19 Apr. 2017

I have a large structure array (500K+ items), and I wish to access certain fields of the that array and concatenate the results. Below is a placeholder example.

 A{1}.time = 1500
 A{1}.data.temp = 70;
 A{1}.data.humidity = 20;
 A{2}.time = 1501
 A{2}.data.temp = 73;
 A{2}.data.humidity = 19;

etc. Till we have 500,000 of these. (I have made it a cell array since the actual entries differ in my data, and I have other code that will go through and just grab the cells we want.)

Now, I want to access e.g. all of the 'data' and concatenate it so that I have a simple vector I can plot. Currently this is done using a loop, but that is very slow. Is there a faster way to do this than some version of the below:

 fieldNames = fields(A{1}.data);
 for ii = 1:length(fieldNames)
     out.(fieldNames{ii}) = ...
         cat(1,cellfun(@(x) getField(x,'data',fieldNames{ii}), A));
end

where

function out = getField(in, fieldname1,fieldname2)
   out = in.(fieldname1).(fieldname2);
end

Again, this certainly works but for extremely large datasets with lots of fields it becomes very very slow. I bet that there is a much more efficient way of gathering all of the data contained in the fields and subfields of a large data set like above. Any help is appreciated.

Thanks, -Dan

An additional discovery: Matlab is somehow storing the field names for each sub-structure individually. In the above example, it has memory allocated for the fieldnames data.temp and data.humidity TWICE (once for each copy). This is why it is so slow. A 50 Mbyte set of data has grown to 3 GB because of this organization scheme. I am going to make a separate post about this (is the memory issue resolved if each entry is a known class? That way the field names aren't stored once for each copy?).

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

D. Plotnick am 14 Apr. 2017

Thank you. Yes, I am actually doing this as a struct array (mostly) now, as I realized I could use this method sometime after I posted.

However, while I absolutely agree with your in general statement above, the data is output in that format by Matlab's memmapfile function. I am reading in a byte stream of messages in the form [header | payload]. The header must be read in using memmapfile as the originator has written it in mixed format (uint32, double, int16, etc.). memmapfile itself returns a structure array. For a 50 MB file it corresponds to a 600K+ length array, and the core issue is how to then quickly read in the headers, determine the payload format, parse the payload, and return the payload contents.

Walter Roberson am 14 Apr. 2017

memmapfile is convenient but not mandatory: you can use a bunch of fread() instead.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Matt J am 14 Apr. 2017

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/335272-fast-access-concatenation-of-large-array-structure#answer_262987

Bearbeitet: Matt J am 14 Apr. 2017

In MATLAB Online öffnen

I don't quite understand why you are using a cell array and not a struct array. Your example will not work unless all A{i} have the same field names so a struct array should have been sufficient.

In any case, you should be able to create a struct array and do faster concatenations with it as follows,

structArray=[A{:}];
dataArray=[structArray.data];

Now you can concatenate individual fields of dataArray in a similar way, e.g.,

out.temp=[dataArray.temp];

You can loop over the field names of dataArray, similar to what you are doing now, to repeat this kind of concatenation for all fields.

I feel obliged to mention, though, that your data organization looks like it is bound for trouble. It is inefficient, in general, to store large amounts of data scattered across structs and cell arrays. Cells and structs do not use contiguous memory, and so are not efficient storage-wise or built for fast access. Nesting them in sub-structs makes the problem worse. You should really be storing all your temp, humidity, etc... data in their own vectors to begin with. Or at least, you should be doing so if the data is going to get large and speed is a priority.

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

D. Plotnick am 18 Apr. 2017

In MATLAB Online öffnen

I think I need to be a bit more expansive on the exact problem I am tackling. As stated above, I am reading in a large byte stream using memmapfile for convenience; the byte stream consists of a series of messages using mixed form, and memmapfile returns an Nx1 structure array, where N is the number of messages. Each element of the structure array is of the form:

 example(ii).header1 = uint16(someNumber)
 example(ii).header2 = double(someOtherNumber)
etc. 
 example(ii).payload = [100 x 1 uint8 array]

The real stumbling block is that someone has written the header in little-endian and the payload in big-endian (I have no power over this).

So, I want to run swapbytes on all N entries of example.header1 and example.header2. swapbytes only takes a numerical array. So, the question is: how do I run such a function on all of the desired subfields.

Right now, I have a crazy solution which is to convert my structure to a cell array, use cellfun(@swapbytes) on each column of my cell array corresponding to , and then convert back to a struct. It works, but it is slow and just screams to me that there is a more elegant way to do this.

D. Plotnick am 18 Apr. 2017

As to Matt's comment, I have implemented your suggestion. I am still ending up re-inserting afterwards (into a custom class hierarchy now) , but that is due to data-product requirements as opposed to my own desire for speed. Thanks.

Matt J am 19 Apr. 2017

In MATLAB Online öffnen

If you really want these results re-inserted back into a struct array ... then you must use a for-loop

My comment here wasn't really very precise. There are alternatives to for-loops, but they will not be faster, e.g,

header1Cell=num2cell(  swapbytes([example.header1]) );
[example.header1]= deal(header1Cell{:});

Melden Sie sich an, um zu kommentieren.

Fast access/concatenation of large array structure

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Akzeptierte Antwort

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Fast access/concatenation of large array structure

5 Kommentare 3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Akzeptierte Antwort

6 Kommentare 4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden