7 views (last 30 days)

Show older comments

Hi all,

I have a bit of a specialised question involving histograms and mean integrated squared error (MISE). I want to find the 'best' way to construct a histogram for some data using a quantifiable method. I thought that MISE would provide a good way to do this as I can simulate data and then compare different histograms to the underlying probability distribution.

However, surprisingly (to me at least) I keep finding that histograms made with tiny bins always have a smaller MISE than ones with larger bins, even though the latter seem to reflect the data more accurately.

For an example I have made some mock code (below) which just simulates some random numbers from a normal distribution, bins them into a histogram and compares this histogram to the actual underlying distribution.

If we look at MISE with respect to bin size we get this:

So the tiny bins have the smallest error, however, if we plot some examples:

The red is the actual probability distribution, the blue line is a histogram built using the smallest bin size (with the smallest MISE) and the green is a histogram built using a larger binsize (with a larger MISE but 'looks' closer to the real distribution).

So what's going on? Is this just a property of MISE or am I making a mistake? Some areas where I think I might have made a mistake:

- Before calculating MISE I sum normalise both distributions, this makes sense to me as I should be comparing probability distributions but maybe they should be normalised in a different manner?
- Sometimes when MISE is expressed there is also an 'Expected Value' coefficient which I have not been able to identify. This seems to be an average, but an average of what? I think this might fix the problem by scaling the MISE according to the average bin contents but I'm not sure how to apply it.

Any help would be greatly appreciated,

NP.

% distribution mean

mu = 0;

% distribution standard deviation

std = 2;

% binsizes we want to test

bin_size = 0.1:0.1:3;

% random values from this distribution

vals = normrnd(mu,std,100,1);

% preallocate

mise_values = NaN(length(bin_size),1);

% run through every bin size

for bb = 1:length(bin_size)

% values to evaluate distributions at

xi = -10:bin_size(bb):10;

% histogram of values

kpdf = histcounts(vals,xi);

% locations of bin centers (so PDF will match histogram)

xi2 = movmean(xi,2,'Endpoints','discard');

% underlying probability distribution

updf = normpdf(xi2,mu,std);

% mean integrated squared error between the histogram and the PDF

% first normalise both

updf = updf ./ nansum(updf);

kpdf = kpdf ./ nansum(kpdf);

% calculate MISE

mise_values(bb) = sum( sum( (updf - kpdf).^2 ) ) .* bin_size(bb);

end

% plot MISE vs bin size

figure

scatter(bin_size,mise_values,'k');

refline

% plot different distributions

figure

xi = -10:0.1:10;

plot(xi,normpdf(xi,mu,std),'r'); hold on;

% plot 'best' binsize

xi2 = movmean(xi,2,'Endpoints','discard');

f = histcounts(vals,xi);

plot(xi2,f./nansum(f),'b')

% plot a better one

xi = -10:0.8:10;

xi2 = movmean(xi,2,'Endpoints','discard');

f = histcounts(vals,xi);

plot(xi2,f./nansum(f),'g')

John D'Errico
on 23 Jul 2020

Edited: John D'Errico
on 23 Jul 2020

Your error is a subtle one, but important to understand why it happens.

x = randn(1000,1);

xi = -5:0.1:5;

histogram(x,xi,'norm','pdf')

hold on

fplot(@(x) normpdf(x))

xi2 = movmean(xi,2,'Endpoints','discard');

f = histcounts(x,xi);

plot(xi2,f./nansum(f),'r')

legend('histogram - pdf normalization','true pdf','histogram - relative counts')

The red curve at the bottom (look carefully, it is hard to see there) is the one you plotted. It is a simple relative number of counts per bin, so normalized to sum to 1. However, a pdf is normalized to have unit area.

Instead, see the difference here:

figure

dx = 0.1;

plot(xi2,f./nansum(f)/dx,'r')

hold on

fplot(@(x) normpdf(x))

legend('histogram - pdf normalization','true pdf')

Do you see the difference? I used your same data, but now the histogram is properly normalized, in a way that is consistent with a pdf.

While you think it makes sense for the simple frequency histogram to sum to 1, it was NOT normalized to INTEGRATE to have an area of 1. That only happened when I scaled it by dividing by dx.

As far as the smaller bin size being better, that should just reflect the idea that a smaller bin size can better approximate the true distribution.

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!