histc with split intervals

Question

Par am 18 Dez. 2013

1
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/110125-histc-with-split-intervals

Bearbeitet: Oleg Komarov am 19 Dez. 2013

Akzeptierte Antwort: Par

Hello All,

I am stuck with this problem for the past few days! So, any help would be greatly appreciated.

THE PROBLEM:

---------------

I have a huge list of discontinuous intervals, about 80000 of them, spread over the range from 1 to 250 million. And I have a much larger set of numbers (a vector of about 200 million elements) spread over the same range, i.e., 1 to 250 million. As if this was not complex enough, some of these intervals may overlap, but most of them are expected to be discontinuous.

Now I want to do a histogram count of how many elements of the vector fall within each of the intervals. If these were simply continuous intervals (without gaps), then one could use histc. Even if the intervals were guaranteed to be always discontinuous (i.e., with gaps), I could have still used histc and thrown out the counts in the gaps. But since I do not have the guarantee that the intervals would always be discontinuous, I am stumped.

So far, I tried two ways of attacking this problem:

1) Simply loop through the intervals, and do sum(vector >= start & vector <= end) inside the for loop. This was hopelessly slow.

2) Use cellfun or arrayfun like this: cellfun(@(L,U) sum(histc(vector,[L U])), Strts, Ends); where Strts and Ends are cell arrays defining the intervals.

Although the second solution would take a few hours, since I need to do this operation for hundreds of large datasets, I cannot afford the time.

So, is there a better way?

Any help would be greatly appreciated!

Thanks!

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Sean de Wolski am 18 Dez. 2013

Could you provide a (small!) example set of data with expected results?

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Par am 19 Dez. 2013

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/110125-histc-with-split-intervals#answer_118854

Well, I tried both your suggestions, and a few other ways using the bsxfun function, but because of the large sizes I am dealing with, bsxfun ran out of memory even on a large-scale cluster environment. I think it tries to create these ridiculously large matrices.

So, the simplest possible technique that actually worked was:

BINVec = zeros(1,Siz_Including_All_Intvls,'uint8'); BINVec(vector) = 1; SumVec = cellfun(@(L,U) sum(BINVec(L:U)), strts, ends);

where strts and ends define the intervals in cell arrays.

This code was very fast and I had the results in a few minutes! Lesson I learnt: Sometimes, the most underestimated solution is the one that works best. Although the above struck me earlier, I simply dismissed the thought assuming this would be very slow.

I did learn about the use of bsxfun from your answers though. So, thanks a lot for your inputs.

Good day.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Oleg Komarov am 19 Dez. 2013

Bearbeitet: Oleg Komarov am 19 Dez. 2013

In MATLAB Online öffnen

Cellfun is a masked loop and a simple loop will be faster. Now I understand you question.

You problem is trivial and no need to loop:

    cumbin = cumsum(BINVec);
    counts = cumbin(ends) - cumbin(ends) + 1;

Melden Sie sich an, um zu kommentieren.

Answer 2

Kelly Kearney am 18 Dez. 2013

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/110125-histc-with-split-intervals#answer_118746

In MATLAB Online öffnen

I'd set up separate matrices that hold all the sub-bins (i.e. set of all endpoints) as well as which intervals are composed of which subbins. Here's a small example:

% Example data
ni = 5;
intervals = sort(rand(2,ni), 1);
% The subbins
x = unique(intervals);
% Which bins are part of each interval?
isin = zeros(length(x), ni);
for ii = 1:ni
    b1 = find(x == intervals(1,ii));
    b2 = find(x == intervals(2,ii))-1;
    isin(b1:b2,ii) = 1;
end
% Histogram counts
xpt = rand(100,1);
n = histc(xpt, x);
% And map back to intervals
nperint = sum(bsxfun(@times, isin, n));

I think this method should work with or without gaps, and with or without overlap.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Answer 3

Oleg Komarov am 18 Dez. 2013

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/110125-histc-with-split-intervals#answer_118747

Bearbeitet: Oleg Komarov am 18 Dez. 2013

In MATLAB Online öffnen

You could use bsxfun(). A simple example might clarify. Suppose you have sample inputs:

A = [3,  5
     10, 20
     21, 30];
B =  [1, 2
      2, 5
      7, 21
      9, 30];

Calculate for each interval in A (discontinuos) how many times the lower bound is >=, i.e. ge(), AND the upper bound is <=, i.e. le(). Finally sum per interval in B (overlapping):

lbFallsWithin = bsxfun(@ge, A(:,1)', B(:,1));
upFallsWithin = bsxfun(@le, A(:,2)', B(:,2));
sum(lbFallsWithin & upFallsWithin,2) 
ans =
     0
     1
     1
     2

WARNING:

The problem is that even for B with 1e6 intervals, you will have a logical lbFallsWithin of 80000*1e6bytes~74.5 GB! (keep in mind you need two of those).

The good news, is that you could block process, i.e. apply the snippet above for partitions of B, where the size of the parition is, loosely speaking, the maximum constrained by your amount of memory.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

histc with split intervals

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Weitere Antworten (2)

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

histc with split intervals

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Weitere Antworten (2)

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden