Histograms - why does the smallest binsize always give the smallest mean integrated squared error?

Question

Neuropragmatist am 23 Jul. 2020

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/569460-histograms-why-does-the-smallest-binsize-always-give-the-smallest-mean-integrated-squared-error

Kommentiert: Neuropragmatist am 23 Jul. 2020

Hi all,

I have a bit of a specialised question involving histograms and mean integrated squared error (MISE). I want to find the 'best' way to construct a histogram for some data using a quantifiable method. I thought that MISE would provide a good way to do this as I can simulate data and then compare different histograms to the underlying probability distribution.

However, surprisingly (to me at least) I keep finding that histograms made with tiny bins always have a smaller MISE than ones with larger bins, even though the latter seem to reflect the data more accurately.

For an example I have made some mock code (below) which just simulates some random numbers from a normal distribution, bins them into a histogram and compares this histogram to the actual underlying distribution.

If we look at MISE with respect to bin size we get this:

So the tiny bins have the smallest error, however, if we plot some examples:

The red is the actual probability distribution, the blue line is a histogram built using the smallest bin size (with the smallest MISE) and the green is a histogram built using a larger binsize (with a larger MISE but 'looks' closer to the real distribution).

So what's going on? Is this just a property of MISE or am I making a mistake? Some areas where I think I might have made a mistake:

Before calculating MISE I sum normalise both distributions, this makes sense to me as I should be comparing probability distributions but maybe they should be normalised in a different manner?
Sometimes when MISE is expressed there is also an 'Expected Value' coefficient which I have not been able to identify. This seems to be an average, but an average of what? I think this might fix the problem by scaling the MISE according to the average bin contents but I'm not sure how to apply it.

Any help would be greatly appreciated,

NP.

% distribution mean
mu = 0;
% distribution standard deviation
std = 2;
% binsizes we want to test
bin_size = 0.1:0.1:3;
% random values from this distribution
vals = normrnd(mu,std,100,1);
    
% preallocate
mise_values = NaN(length(bin_size),1);
% run through every bin size
for bb = 1:length(bin_size)
    % values to evaluate distributions at
    xi = -10:bin_size(bb):10;
    
    % histogram of values
    kpdf = histcounts(vals,xi);
    
    % locations of bin centers (so PDF will match histogram)
    xi2 = movmean(xi,2,'Endpoints','discard');
    % underlying probability distribution
    updf = normpdf(xi2,mu,std);
    % mean integrated squared error between the histogram and the PDF
    % first normalise both
    updf = updf ./ nansum(updf);
    kpdf = kpdf ./ nansum(kpdf);
    % calculate MISE
    mise_values(bb) = sum( sum( (updf - kpdf).^2 ) ) .* bin_size(bb);
end
% plot MISE vs bin size
figure
scatter(bin_size,mise_values,'k');
refline
% plot different distributions
figure
xi = -10:0.1:10;
plot(xi,normpdf(xi,mu,std),'r'); hold on;
% plot 'best' binsize
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(vals,xi);
plot(xi2,f./nansum(f),'b')
% plot a better one
xi = -10:0.8:10;
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(vals,xi);
plot(xi2,f./nansum(f),'g')

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

John D'Errico am 23 Jul. 2020

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/569460-histograms-why-does-the-smallest-binsize-always-give-the-smallest-mean-integrated-squared-error#answer_469803

Bearbeitet: John D'Errico am 23 Jul. 2020

In MATLAB Online öffnen

Your error is a subtle one, but important to understand why it happens.

x = randn(1000,1);
xi = -5:0.1:5;
histogram(x,xi,'norm','pdf')
hold on
fplot(@(x) normpdf(x))
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(x,xi);
plot(xi2,f./nansum(f),'r')
legend('histogram - pdf normalization','true pdf','histogram - relative counts')

The red curve at the bottom (look carefully, it is hard to see there) is the one you plotted. It is a simple relative number of counts per bin, so normalized to sum to 1. However, a pdf is normalized to have unit area.

Instead, see the difference here:

figure
dx = 0.1;
plot(xi2,f./nansum(f)/dx,'r')
hold on
fplot(@(x) normpdf(x))
legend('histogram - pdf normalization','true pdf')

Do you see the difference? I used your same data, but now the histogram is properly normalized, in a way that is consistent with a pdf.

While you think it makes sense for the simple frequency histogram to sum to 1, it was NOT normalized to INTEGRATE to have an area of 1. That only happened when I scaled it by dividing by dx.

As far as the smaller bin size being better, that should just reflect the idea that a smaller bin size can better approximate the true distribution.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Neuropragmatist am 23 Jul. 2020

Of course, this is exactly the problem! This is extremely helpful, thank you. Now, when I normalise for integration the MISE is actually high for small bins and then decreases to a plateau of nice bin sizes.

Do you also have any insight on the 'expected value' in the MISE formula?

https://en.wikipedia.org/wiki/Mean_integrated_squared_error

I have seen a few papers where this was omitted, so I'm not sure what purpose it serves.

Thanks,

NP.

Melden Sie sich an, um zu kommentieren.

Histograms - why does the smallest binsize always give the smallest mean integrated squared error?

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Histograms - why does the smallest binsize always give the smallest mean integrated squared error?

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden