general statistics problem: how to best characterize non-normal distributions

24 Ansichten (letzte 30 Tage)
Even though it is not directly MATLAB related, I figured I would pose this question to the MATLAB community because there are a bunch of smart and helpful people here :D
I have looked and looked but I cannot find a straightforward test or method to characterize a distribution that fails a normality test. I have read several peer-reviewed scientific journal articles where this does not stop authors from giving a mean and standard deviation (!) but I think that is a bad thing to do.
My current approach is to get a kernel smoothing density estimate of the distribution using a function I wrote around the built-in ksdensity() function, and play with the smoothing window width until it gives something that nicely portrays the data (not too spikey, not too round). I then give the peak value of the kernel estimate as my "mean" (i.e. the one number people will look at and prematurely judge everything by). The only way I know to then characterize the distribution width or deviation would be to give a full width at half maximum. Of course this is not good because the distribution tends not to be symmetric around the peak, and is often on the order of the peak value in magnitude.
So people I am working with want to see some kind of error bars, and I have no idea what to give them to make them happy.
This is a recurring theme in my current work and I am desperate to find a good solution, so any pointers would be greatly appreciated. I am sure I am not the only one who has to deal with non-gaussian distributions.
If you want to see an example of one of these distributions, there are a couple in Figure 3 in the paper you can find here:
Thanks in advance, Rory

Akzeptierte Antwort

Andrew Newell
Andrew Newell am 10 Jun. 2011
You should NOT use the peak of your distribution to estimate the mean, because it is not the mean. It is the mode.
Since your distribution is skewed, it might be better to use the geometric mean or harmonic mean (see Measures of central tendency). You could also estimate some measure of dispersion and shape.
For estimating the errors in these statistics, you could use the boostrap or the jacknife (see Resampling Statistics).
You could also explore MATLAB's collection of distributions to see if any look like your data (see Distribution Reference). For example, some of the curves look like the Gamma distribution. However, each distribution is a model of a particular kind of statistical process, so ideally you should understand what a distribution represents before using it.
  2 Kommentare
Rory Staunton
Rory Staunton am 11 Jun. 2011
I shouldn't have written "mean", not even in scare quotes---I know the peak is not the mean and I have never actually conflated the two, until now apparently.
Thanks for your help and I will look into your suggestions, especially the resampling statistics, as I am unfamiliar with bootstrap and jackknife methods.
Andrew Newell
Andrew Newell am 11 Jun. 2011
Sorry for overlooking the scare quotes. Notice, by the way, that all the links are MATLAB links. That makes this a MATLAB question!

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (2)

Tom Lane
Tom Lane am 12 Jun. 2011
Some other things I might consider:
1. Look at distributions of the log(data).
2. Consider using the median and quartiles (it may be more intuitive to use the interquartile range) or other quantiles. It may be possible to find theoretical ways to compute confidence intervals for those quantities, but the bootstrap approach may be adequate. Also Google for "five number summary."
3. There are larger families of distributions that include the normal as a special case. Look into the Johnson and Pearson families. There are Statistics Toolbox functions johnsrnd and pearsrnd for generating random samples from these distributions, but the "fitting" step is simply computing quantiles or moments.
-- Tom
  1 Kommentar
Andrew Newell
Andrew Newell am 12 Jun. 2011
See http://www.mathworks.com/help/toolbox/stats/br5k833-1.html#br5k833-2 for the Johnson and Pearson distributions.

Melden Sie sich an, um zu kommentieren.


bym
bym am 10 Jun. 2011
I think a good distribution would be the Weibull and it is available in the statistics toolbox. You could then use the distributions parameters to compare datasets rather than mean & standard deviation
doc wblfit
you can get confidence intervals for the parameters - would that suffice for error bars?

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by