Documentation

sequentialfs

Sequential feature selection

Syntax

inmodel = sequentialfs(fun,X,y)
inmodel = sequentialfs(fun,X,Y,Z,...)
[inmodel,history] = sequentialfs(fun,X,...)
[] = sequentialfs(...,param1,val1,param2,val2,...)

Description

inmodel = sequentialfs(fun,X,y) selects a subset of features from the data matrix X that best predict the data in y by sequentially selecting features until there is no improvement in prediction. Rows of X correspond to observations; columns correspond to variables or features. y is a column vector of response values or class labels for each observation in X. X and y must have the same number of rows. fun is a function handle to a function that defines the criterion used to select features and to determine when to stop. The output inmodel is a logical vector indicating which features are finally chosen.

Starting from an empty feature set, sequentialfs creates candidate feature subsets by sequentially adding each of the features not yet selected. For each candidate feature subset, sequentialfs performs 10-fold cross-validation by repeatedly calling fun with different training subsets of X and y, XTRAIN and ytrain, and test subsets of X and y, XTEST and ytest, as follows:

criterion = fun(XTRAIN,ytrain,XTEST,ytest)

XTRAIN and ytrain contain the same subset of rows of X and Y, while XTEST and ytest contain the complementary subset of rows. XTRAIN and XTEST contain the data taken from the columns of X that correspond to the current candidate feature set.

Each time it is called, fun must return a scalar value criterion. Typically, fun uses XTRAIN and ytrain to train or fit a model, then predicts values for XTEST using that model, and finally returns some measure of distance, or loss, of those predicted values from ytest. In the cross-validation calculation for a given candidate feature set, sequentialfs sums the values returned by fun and divides that sum by the total number of test observations. It then uses that mean value to evaluate each candidate feature subset.

Typical loss measures include sum of squared errors for regression models (sequentialfs computes the mean-squared error in this case), and the number of misclassified observations for classification models (sequentialfs computes the misclassification rate in this case).

    Note:   sequentialfs divides the sum of the values returned by fun across all test sets by the total number of test observations. Accordingly, fun should not divide its output value by the number of test observations.

After computing the mean criterion values for each candidate feature subset, sequentialfs chooses the candidate feature subset that minimizes the mean criterion value. This process continues until adding more features does not decrease the criterion.

inmodel = sequentialfs(fun,X,Y,Z,...) allows any number of input variables X, Y, Z, ... . sequentialfs chooses features (columns) only from X, but otherwise imposes no interpretation on X, Y, Z, ... . All data inputs, whether column vectors or matrices, must have the same number of rows. sequentialfs calls fun with training and test subsets of X, Y, Z, ... as follows:

criterion = fun(XTRAIN,YTRAIN,ZTRAIN,...,
                XTEST,YTEST,ZTEST,...)

sequentialfs creates XTRAIN, YTRAIN, ZTRAIN, ... , XTEST, YTEST, ZTEST, ... by selecting subsets of the rows of X, Y, Z, ... . fun must return a scalar value criterion, but may compute that value in any way. Elements of the logical vector inmodel correspond to columns of X and indicate which features are finally chosen.

[inmodel,history] = sequentialfs(fun,X,...) returns information on which feature is chosen at each step. history is a scalar structure with the following fields:

  • Crit — A vector containing the criterion values computed at each step.

  • In — A logical matrix in which row i indicates the features selected at step i.

[] = sequentialfs(...,param1,val1,param2,val2,...) specifies optional parameter name/value pairs from the following table.

ParameterValue
'cv'

The validation method used to compute the criterion for each candidate feature subset.

  • When the value is a positive integer k, sequentialfs uses k-fold cross-validation without stratification.

  • When the value is an object of the cvpartition class, other forms of cross-validation can be specified.

  • When the value is 'resubstitution', the original data are passed to fun as both the training and test data to compute the criterion.

  • When the value is 'none', sequentialfs calls fun as criterion = fun(X,Y,Z,...), without separating test and training sets.

The default value is 10, that is, 10-fold cross-validation without stratification.

So-called wrapper methods use a function fun that implements a learning algorithm. These methods usually apply cross-validation to select features. So-called filter methods use a function fun that measures characteristics of the data (such as correlation) to select features.

'mcreps'

A positive integer indicating the number of Monte-Carlo repetitions for cross-validation. The default value is 1. The value must be 1 if the value of 'cv' is 'resubstitution' or 'none'.

'direction'

The direction of the sequential search. The default is 'forward'. A value of 'backward' specifies an initial candidate set including all features and an algorithm that removes features sequentially until the criterion increases.

'keepin'

A logical vector or a vector of column numbers specifying features that must be included. The default is empty.

'keepout'

A logical vector or a vector of column numbers specifying features that must be excluded. The default is empty.

'nfeatures'

The number of features at which sequentialfs should stop. inmodel includes exactly this many features. The default value is empty, indicating that sequentialfs should stop when a local minimum of the criterion is found. A nonempty value overrides values of 'MaxIter' and 'TolFun' in 'options'.

'nullmodel'

A logical value, indicating whether or not the null model (containing no features from X) should be included in feature selection and in the history output. The default is false.

'options'

Options structure for the iterative sequential search algorithm, as created by statset. sequentialfs uses the following statset parameters:

  • Display — Amount of information displayed by the algorithm. The default is 'off'.

  • MaxIter — Maximum number of iterations allowed. The default is Inf.

  • TolFun — Termination tolerance for the objective function value. The default is 1e-6 if 'direction' is 'forward'; 0 if 'direction' is 'backward'.

  • TolTypeFun — Use absolute or relative objective function tolerances. The default is 'rel'.

  • UseParallel — Set to true to compute in parallel. Default is false.

  • UseSubstreams — Set to true to compute in parallel in a reproducible fashion. Default is false. To compute reproducibly, set Streams to a type allowing substreams: 'mlfg6331_64' or 'mrg32k3a'.

  • Streams — A RandStream object or cell array consisting of one such object. If you do not specify Streams, sequentialfs uses the default stream.

Examples

Perform sequential feature selection for classification of noisy features:

load fisheriris;
X = randn(150,10);
X(:,[1 3 5 7 ])= meas;
y = species;

c = cvpartition(y,'k',10);
opts = statset('display','iter');
fun = @(XT,yT,Xt,yt)...
      (sum(~strcmp(yt,classify(Xt,XT,yT,'quadratic'))));

[fs,history] = sequentialfs(fun,X,y,'cv',c,'options',opts)

Start forward sequential feature selection:
Initial columns included: none
Columns that can not be included: none
Step 1, added column 7, criterion value 0.04
Step 2, added column 5, criterion value 0.0266667
Final columns included:  5 7 

fs =
     0  0  0  0  1  0  1  0  0  0
history = 
      In: [2x10 logical]
    Crit: [0.0400 0.0267]

history.In
ans =
     0  0  0  0  0  0  1  0  0  0
     0  0  0  0  1  0  1  0  0  0
Was this topic helpful?