Data usage of the function iforest during sampling

Question

Simone Ravaioli am 7 Dez. 2023

1
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/2057564-data-usage-of-the-function-iforest-during-sampling

Kommentiert: Drew am 12 Dez. 2023

In the built-in function iforest we can set the hyperparameter 'NumObservationsPerLearner', which is the number of observations for each isolation tree.

When a point of the dataset is selected to train an isolation tree, can it be used to train another tree or it is removed from the data that can be used to train the next tree?

If it cannot be used, how does the function behave when 'NumObservationsPerLearner * NumLearners' is greater than the number of point of the given dataset?

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Drew am 11 Dez. 2023

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/2057564-data-usage-of-the-function-iforest-during-sampling#answer_1369479

In MATLAB Online öffnen

Regarding your main question: "When a point of the dataset is selected to train an isolation tree, can it be used to train another tree?", the short answer is yes, that point (or observation) can be used to train a different tree. The sampling process for each tree begins with the full set of data points/observations.

This can be seen on the doc page: https://www.mathworks.com/help/stats/anomaly-detection-with-isolation-forest.html

NumObservationsPerLearner (number of observations for each isolation tree) — Each isolation tree corresponds to a subset of training observations. For each tree, iforest samples min(N,256) number of observations from the training data without replacement, where N is the number of training observations. The isolation forest algorithm performs well with a small sample size because it helps to detect dense anomalies and anomalies close to normal points. However, you need to experiment with the sample size if N is small. For an example, see Examine NumObservationsPerLearner for Small Data."

So, given that each tree begins with the full dataset, it is no problem to train an isolation forest of 100 trees with NumObservationPerLearner of 149 using a data set of only 150 observations. See the doc section Examine NumObservationsPerLearner for Small Data for more info.

load fisheriris
size(meas)
ans = 1×2
   150     4
[forest,tf,scores]=iforest(meas,NumObservationsPerLearner=149);
forest
forest = 
  IsolationForest

        CategoricalPredictors: []
        ContaminationFraction: 0
               ScoreThreshold: 0.6603
                  NumLearners: 100
    NumObservationsPerLearner: 149

If this answer helps you, please remember to accept the answer.

2 Kommentare
Keine anzeigenKeine ausblenden

Francesco Bellucci am 12 Dez. 2023

I completely agree with your statement, but I would personally correct the previous sentence like this:

For each tree, iforest samples min(N,256) number of observations from the training data without replacement, "if it's possible", where N is the number of training observations.

When sampling without replacement is applied to training data, it means that when creating a smaller training set (or subset) that trains each binary tree, each observation is selected once and not returned to the training data pool.

So, in the case you mentioned before, if you train the IsolationForest model with NumLearners: 100 and NumObservationsPerLearner: 149 without replacement, you will get a total of 14.900 samples.

The dataset under consideration contains only 150 observations, so sampling without replacement is impossible.

Drew am 12 Dez. 2023

The sampling from the training data is reset for every tree. That is why the number of samples per tree is bounded by the lower of N or 256. The phrase "without replacement" applies to the sampling process for one tree. When the sampling process begins for the next tree, the process starts over with all of the training samples.

Melden Sie sich an, um zu kommentieren.

Data usage of the function iforest during sampling

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

Data usage of the function iforest during sampling

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden