How to best determine the probability of a distribution given an outlying observation?

Question

0 Stimmen

Hi,

I have a classification problem. I have a set of data from a reference process (let's call that "known") and a set of data from a second process (let's call that "test").

Hypothesis 0 is that the test sample came from an identical process as the "known", and will therefore have the same distribution.

Hypothesis 1 is that the test sample came from a different process. However, here is the catch: for all but one sample, this process has an identical distribution to the "known". Just one sample will be "suspiciously" low.

I will add a picture to better explain:

In this case, the red histogram is the reference "known" distribution. The blue histogram is the questioned "test" distribution. In this case, I already know that the test came from a different process. It might not be completely clear due to the overlaying, but it can be seen that the distributions pretty well match, except for a single blue sample which is suspiciously low.

What I need now is to take each distribution and work out some method of returning a probability that the extremely low blue value would be observed given the distribution is the "known" distribution. I know how to calculate the probability of a particular single observation, but how do I properly balance this with the number of observations? Would just a KS test be appropriate? It strikes me as stats 101, but it's been a while, and I don't want to get this wrong.

Thanks in advance.

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Answer 1

Ilya am 12 Sep. 2012

Bearbeitet: Ilya am 12 Sep. 2012

0 Stimmen

If you know the reference distribution analytically, you can compute its cdf at the smallest observed value. Suppose this cdf value is p. The p-value for your test would be then one minus the binomial probability of not observing any successes in N trials, where N is the sample size and p is the success probability. That is, it would be 1-(1-p)^N.

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Tim am 19 Sep. 2012

Oh, so obvious now! Thank you. I was over-thinking it with the variance of the variance and all that jazz. My only excuses are lack of sleep and rusty stats - honestly, I avoid them when I can.

Melden Sie sich an, um zu kommentieren.

Answer 2

per isakson am 12 Sep. 2012

0 Stimmen

See: FBD - "Find the Best Distribution" tool in the File Exchange

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Tim am 12 Sep. 2012

In MATLAB Online öffnen

Thanks for your answer, per, but I'm not sure that this is what I'm looking for. I'll try and clarify with a simple code example.

KnownSet = randn(1000,1);
TestSet1 = randn(100,1);
TestSet2 = [randn(99,1); -4];

In this case, I know all three sets of data are mostly drawn from the same Gaussian distribution. However, TestSet2 has an outlier. The value -4 is very unlikely, and I'm hoping to use that single outlying value to provide a probability that each TestSet is purely from the same distribution as KnownSet. In this case, TestSet1 should have a high 'p-value', and TestSet2 should have a low 'p-value' and be rejected. I use the term p-value, but there might be something else.

FBD would help me determine the distribution of KnownSet (which I can assume is at least for the most part the same as that of the TestSets), but that is only the first step. How do I go from there to determining how likely/unlikely the set of observations is, given the distribution, and given the outlier?

Melden Sie sich an, um zu kommentieren.

How to best determine the probability of a distribution given an outlying observation?

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Weitere Antworten (1)

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Kategorien

Produkte

Tags

Community Treasure Hunt

How to best determine the probability of a distribution given an outlying observation?

0 Kommentare -2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar -1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Weitere Antworten (1)

1 Kommentar -1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Kategorien

Produkte

Tags

Siehe auch

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden