randperm non uniformly distributed

3 Ansichten (letzte 30 Tage)
AbioEngineer
AbioEngineer am 15 Aug. 2019
Kommentiert: AbioEngineer am 15 Aug. 2019
I want to sample from integers 1 through 56 without replacement. Neither randperm nor datasample with 'Replacement',false give a uniformly distributed set if I iterate many times. Why is the last bin in the histogram double the size of the the rest?
perms=zeros(10000,6);
samps=zeros(10000,6);
[rp, cp]=size(perms);
for p=1:rp
permstemp = randperm(56,6);
perms(p,:)=permstemp;
end
[rs, cs]=size(samps);
for s=1:rs
sampstemp = datasample(1:56,6,'Replace',false);
samps(s,:)=sampstemp;
end
histogram(perms(1:end))
histogram(samps(1:end))
nonuniform.png

Akzeptierte Antwort

John D'Errico
John D'Errico am 15 Aug. 2019
Sigh. This is NOT a question of non-uniformity. Just a question of not understanding how to recognize non-uniformity, and partially how to understand a histogram.
If you create a histogram with too few bins, what happens is there will be SOME bins that have multiple counts in those bins.
It turns out that histogram decided to use bin edges of 1:56 here, so the last bin got used for twice as many samples.
Note the difference between these two calls to histogram:
histogram(perms(1:end))
histogram(perms(1:end),1:56)
histogram(perms(1:end),1:57)
The first two produce the same results. So it appears the default for the bin edges was 1:56. However, when I gave it another bin up to 57, all things appear normal.
So what happens when I have bin edges 1:56? There are integer events at 56, and some at 55. So that last bin had all events that were either 55 OR 56 in the bin. Whereas bin number 1 only had the events that were strictly a 1. When I get it one more bin to use for the histogram, things were now fine.
So before you claim non-uniformity, think about whether the test you are using that asserts non-uniformity might be flawed.
  3 Kommentare
Steven Lord
Steven Lord am 15 Aug. 2019
John is correct. As stated in the histogram documentation page, "Each bin includes the left edge, but does not include the right edge, except for the last bin which includes both edges."
Before John added that last bin edge at 57, the last bin was [55, 56] and the next-to-last bin was [54, 55). So the last bin counted two distinct values from the data.
After John added that last bin edge at 57, the last bin is [56, 57] and the next-to-last bin is [55, 56). Each of the last two bins now counts only one distinct value from the data.
AbioEngineer
AbioEngineer am 15 Aug. 2019
Yep! thank you for the answer and comment! I can't believe I forgot to set the bin size properly. I remember back in r2014 there was an issue with random integers that I had to work around more cleverly, and thought my current problem tessellated with that old one.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

AbioEngineer
AbioEngineer am 15 Aug. 2019
I'm an idiot, there were 55 bins in the above image... changing h.NumBins = 56 solves the problem.

Kategorien

Mehr zu Data Distribution Plots finden Sie in Help Center und File Exchange

Produkte


Version

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by