A suitable method to detect outliers from a non-normally distributed dataset?
16 Ansichten (letzte 30 Tage)
I understood that a good/suitable method to detect outliers in a not normally distributed dataset (I would say skewed) would be the quartiles one. However, it looks like that method would include, as outliers, data that are not visibly outliers, but part of the main cluster of data.
Alternatively, I tried the mean plus 3-sigma method (i.e. isoutlier(x,'mean')) and it looks like it gives a better result (even though it seems it is recommended for normal distributions only), and, as we can see from the plots it does not detect a few other outliers, i.e. a few points that are still quite close to the cluster of data, but visibly saparated and relatively distant by it.
You can see the difference among the two methods here below in the two figures.
Is there a more suitable method (and definition) available in Matlab to detect all the outliers in the plot here below?
% According to the following tests it looks like my dataset is not normally
% distributed, which means that a 'good' method to detect outliers
% in a not normally distributed dataset would be the 'quartiles' one
disp('Normality Test Name H')
fprintf('Kolmogorov-Smirnov Test %d\n',double(kstest(x)))
fprintf('Anderson-Darling Test %d\n',double(adtest(x)))
fprintf('Cramer-Von Mises Test %d\n',double(cmtest(x)))
fprintf('Shapiro-Wilk Test %d\n',double(swtest(x)))
fprintf('Jarque-Bera Test %d\n',double(jbtest(x)))
% detect and plot outliers with the method 'mean'
% detect and plot outliers with the method 'quartiles'
Normality Test Name H
Kolmogorov-Smirnov Test 1
Warning: P is less than the smallest tabulated value, returning 0.0005.
Anderson-Darling Test 1
Cramer-Von Mises Test 1
Shapiro-Wilk Test 1
Jarque-Bera Test 1
dpb am 1 Mai 2023
Bearbeitet: dpb am 1 Mai 2023
You made it notoriously difficult to do anything to help by not attaching the data in a usable form, but...
ALWAYS visualize your data first -- the second doesn't truly look too awful bad for a lognormal at first blush, although the probability plot does indicate it's a little long in the tails so is more extreme than "just" a lognormal.
BUT, unless you have some reason to know these are truly outliers and you're just not dealing with an extreme distribution sample, it's not clear whether those really are "outliers" or just actual realizations of the underlying process.
The MATLAB toolset comes from the <NIST guidelines>; I'd recommend reading it thoroughly to get a better appreciation of the various tests built into MATLAB and their application.
It's a notoriously difficult issue; without knowing much more about the dataset provenance, I'd be reluctant to recommend any particular process here; your order statstics plots focus on the upper tail only; the lower tail is pretty-much symmetric with it that would make one wonder if not a reason for that and that should be looking at alternative distribution families...