What is the correct input for the Two-sample Kolmogorov-Smirnov test, when I need to compare two histograms?

Question

Sim am 16 Jul. 2024

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/2137638-what-is-the-correct-input-for-the-two-sample-kolmogorov-smirnov-test-when-i-need-to-compare-two-his

Beantwortet: Divyam am 13 Sep. 2024

Example 1. If the input data are quite similar, it looks like there is no difference in the output of the Two-sample Kolmogorov-Smirnov test, either if I use the original data, "X" and "Y", or the bin counts "NX" and "NY" (returned from the histcounts function), as inputs for the Two-sample Kolmogorov-Smirnov test:

% inputs
X = [2 5 7 10 11 12 13 14 16 17 18 19       22 23 24 29];
Y = [2 5      11 12 13 14 16 17 18 19 20 21 22 23 24 29];
[NX,edgesX] = histcounts(X,'NumBins',6);
[NY,edgesY] = histcounts(Y,'NumBins',6);
% plot
hold on
histogram(X,edgesX,'FaceAlpha',0.1,'EdgeAlpha',0.8)
histogram(Y,edgesY,'FaceAlpha',0.1,'EdgeAlpha',0.2)
% Two-sample Kolmogorov-Smirnov test
[h1,p1,ks2stat1] = kstest2(X,Y);     % <-- By using the original input data
[h2,p2,ks2stat2] = kstest2(NX,NY);   % <-- By using the Bin counts "NX" and "NY", returned from the "histcounts" function
table([[h1,p1,ks2stat1];[h2,p2,ks2stat2]] ,'VariableNames', {'h | p | ks2stat'},'RowNames', {'kstest2(X,Y)', 'kstest2(NX,NY)'})

% Reusult of Example 1
ans =
  2×1 table
                                    h | p | ks2stat              
                      ___________________________________________
    kstest2(X,Y)      0    0.999035232339821                0.125
    kstest2(NX,NY)    0    0.999956514899259    0.166666666666667

Example 2. If the input data are a bit different, it looks like there is difference in the output of the Two-sample Kolmogorov-Smirnov test, if I use the original data, "X" and "Y", or the bin counts "NX" and "NY" (returned from the histcounts function), as inputs for the Two-sample Kolmogorov-Smirnov test:

% inputs
X = [2 5 7 10 11 12 13 14 16 17 18 19       22 23 24 29];
Y = [2 5      11 12 13 14 16 17 18 19 20 21 22 23 24 29 29 29 29 29 29 29 29];
[NX,edgesX] = histcounts(X,'NumBins',6);
[NY,edgesY] = histcounts(Y,'NumBins',6);
% plot
hold on
histogram(X,edgesX,'FaceAlpha',0.1,'EdgeAlpha',0.8)
histogram(Y,edgesY,'FaceAlpha',0.1,'EdgeAlpha',0.2)
% Two-sample Kolmogorov-Smirnov test
[h1,p1,ks2stat1] = kstest2(X,Y);
[h2,p2,ks2stat2] = kstest2(NX,NY);
table([[h1,p1,ks2stat1];[h2,p2,ks2stat2]] ,'VariableNames', {'h | p | ks2stat'},'RowNames', {'kstest2(X,Y)', 'kstest2(NX,NY)'})

% Result of Example 2
ans =
  2×1 table
                                    h | p | ks2stat              
                      ___________________________________________
    kstest2(X,Y)      0    0.251817384522441    0.315217391304348
    kstest2(NX,NY)    0    0.809557310616653    0.333333333333333

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Sim am 16 Jul. 2024

In MATLAB Online öffnen

I have just checked the kstest2(x1,x2) function by opening it in the Command Window:

>> open kstest2

I can see that the Bin counts are already calculated in the part where the empirical CDFs are derived:

% Calculate F1(x) and F2(x), the empirical (i.e., sample) CDFs.
...
binCounts1  =  histc (x1 , binEdges, 1);
binCounts2  =  histc (x2 , binEdges, 1);
...

Therefore, to the best of my understanding, the correct usage of kstest2(x1,x2) is only with the original input data "X" and "Y":

[h1,p1,ks2stat1] = kstest2(X,Y); % <-- By using the original input data

While the employment of the Bin counts "NX" and "NY" would lead to a wrong result:

[h2,p2,ks2stat2] = kstest2(NX,NY);   % <-- By using the Bin counts "NX" and "NY", returned from the "histcounts" function

If anyone wants to confirm my reasoning is very welcome!

(maybe, from @MathWorks Support Team as well?)

Sim am 16 Jul. 2024

Bearbeitet: Sim am 16 Jul. 2024

Yes @Divyam, you are right! I just need to use original data "X" and "Y" (that you called the "vectors themselves")!! ...and kstest2 will do the rest of the work for me :-)

Many thanks! :-) :-)

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Divyam am 13 Sep. 2024

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/2137638-what-is-the-correct-input-for-the-two-sample-kolmogorov-smirnov-test-when-i-need-to-compare-two-his#answer_1515815

Hi @Sim,

The Two-sample Kolmogorov-Smirnov test is used to test whether the data from any two vectors is from the same continuous distributions.

To effectively implement the Kolmogorov-Smirnov test, you should use the vectors "X" and "Y" themselves since they are the direct representatives of the distribution which create the histogram. The "bincounts" variables "NX" and "NY" on the other hand, reveal the underlying shape of the distribution and do not represent the distribution itself in all cases (as evident when you choose different input data).

Note: Answering this question to improve the visibility for anyone referring the community with a similar query.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

What is the correct input for the Two-sample Kolmogorov-Smirnov test, when I need to compare two histograms?

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Antworten (1)

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

What is the correct input for the Two-sample Kolmogorov-Smirnov test, when I need to compare two histograms?

5 Kommentare 3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Antworten (1)

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden