Filter löschen
Filter löschen

Sampling from a population so that the sample has a target mean/median.

4 Ansichten (letzte 30 Tage)
Hi all,
if there a direct way to randomly collect samples from an existing population multiple times so that each time the sample has a target mean or median for a given metric?
Example:
Dataset = 400 observations X 10 metrics
I want to sample N times (with replaecement) 20 observations so that the mean or median for each 20-observation sample of metric 1 equals a target value and then calculate the mean for the other 9 metrics for each of these 20-observation samples.
Is there a way to do such sampling in Matlab?
Thank you!
  3 Kommentare
Antonios Asiminas
Antonios Asiminas am 18 Apr. 2022
Apologies for confusion.
What I mean is: Sampling from Dataset(:,1) and get indexes for the elements of each sample (idx) so that each sample has mean(Dataset(idx, 1)) == target, and calculate mean(Dataset(idx, 2)), mean(Dataset(idx, 3)),... mean(Dataset(idx, 10)). Repeat that N times.
The "or median" was cover the posibility there is a solution for this problem with median target rather than mean.
I thought trying a while loop and sample, check if the mean of the sample is the target (or close enough) and then calculate the other metrics means. This is not a nice a certainly not a fast solution though...
I hope this makes more sense now.
Image Analyst
Image Analyst am 18 Apr. 2022
Not really. How would you compute idx? And I'm not sure what N is when you say that you need to get the 10 means N times.
I'm thinking what you really want is like what Bruno said where you'd get a list of indexes that match a target or are in a target range, and then get the means for the other columns, so like
rowsOfInterest = Dataset(:, 1) == target;
theMeans = mean(Dataset(rowsOfInterest, :), 1);
That would give you the mean of all columns but only for those rows where column 1 is your target value.
Of course you could also use grpstats() or groupsummary() or splitapply() to do the same thing.

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

Bruno Luong
Bruno Luong am 18 Apr. 2022
Bearbeitet: Bruno Luong am 18 Apr. 2022
This is for "mean" (EDIT code for fixing BUG)
% Generate 1e6 Dummy Data test
A=10+sum(30*rand(1e6,3),2);
% number of subsamples to be draw from A
m = 1000;
targetmean=35; % taget mean
meanA = mean(A);
if targetmean > meanA
dir = 'ascend';
else
dir = 'descend';
end
As = sort(A, dir);
A1 = As(1);
Aend = As(end);
t = (As-A1)/(Aend-A1);
nA = length(A);
% if you give non-attainable targetmean, you get error here or NaN p
EspFun = @(p) sum(As.*t.^p) / sum(t.^p);
p = fzero(@(p) EspFun(p)-targetmean, [0 1000]);
if isnan(p)
error('target mean not possible with this formulation')
end
idx = ceil(nA*rand(1,m).^(1/(p+1)));
Asubsample = As(idx);
% Check
mean(Asubsample)
ans = 34.5036
  2 Kommentare
Bruno Luong
Bruno Luong am 18 Apr. 2022
Bearbeitet: Bruno Luong am 18 Apr. 2022
If you want non replacement draw you have to set m value larger than the subsample cardinal
desiredsubsamplecardinal = 100;
m = ceil(1.1*desiredsubsamplecardinal);
then later do
Asubsample = As(unique(idx));
if length(Asubsample) >= desiredsubsamplecardinal
Asubsample = Asubsample(1:desiredsubsamplecardinal);
else
... retry
end
Bruno Luong
Bruno Luong am 18 Apr. 2022
Bearbeitet: Bruno Luong am 19 Apr. 2022
This is for "median" (EDIT code for fixing BUG)
% Generate 1e6 Dummy Data test
A=10+sum(30*rand(1e6,3),2);
% number of subsamples to be draw from A
m = 1000;
targetmedian=35; % taget edian
medianA = median(A);
nA = length(A);
if medianA > targetmedian
dir = 'ascend';
thalf = sum(A <= targetmedian) / nA;
else
dir = 'descend';
thalf = sum(A >= targetmedian) / nA;
end
As = sort(A, dir);
p = max(log(thalf)/log(0.5),0);
idx = ceil(nA*rand(1,m).^p);
Asubsample = As(idx);
% Check
median(Asubsample)
ans = 35.5359

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

Image Analyst
Image Analyst am 18 Apr. 2022
No, not in general. The target mean may not exist in the population. If my target mean weight is 10 kg, and my population is the weights of elephants, I cannot sample the weights of the elephants and ever get a weight of 10 Kg.
If the targets are definitely in your population then you can't guarantee any mean if you randomly pick some samples. You may however get close to the population mean. If you want a specific mean I think you'd have to sort your population and then sample around the target mean, but then you're no longer doing it randomly.
  9 Kommentare
Beorn Nijenhuis
Beorn Nijenhuis am 9 Aug. 2022
@Bruno Luong This was a helpful script for me. I had a cohort of n=280 samples of peoples ages (15<age < 75) and needed a subsample of 60 with a mean of 50. It worked. I have a question though: When I run this code I notice a tolerance in the output. With n=280 the dolerance was ±7 years approximatly. How is this tolerance calculated so I can write this in to the methods of my paper?
Bruno Luong
Bruno Luong am 9 Aug. 2022
I don't know how it is calculate, but it's due to the fact that the sample is too sparse. My method assume it is well approximated a uniform distrubution; so it works well when one have a big sample of data.

Melden Sie sich an, um zu kommentieren.

Produkte


Version

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by