Feature Selection QUBO (Quadratic Unconstrained Binary Optimization)

Note

Installation Required: This functionality requires MATLAB Support Package for Quantum Computing.

Feature Selection Overview

Feature selection is a dimensionality reduction technique used to select a subset of features (predictor variables) that provide the best predictive power in modeling a set of data for classification or regression.

Feature selection can be used to:

Prevent overfitting — Avoid modeling with an excessive number of features that are more susceptible to rote-learning-specific training examples.
Reduce model size — Increase computational performance with high-dimensional data or prepare a model for embedded deployment when memory might be limited.
Improve interpretability — Use fewer features, which might help identify those that affect model behavior.

Based on Mücke, Heese, Müller, Wolter, and Piatkowski [1], you can perform feature selection by using a QUBO model to select relevant predictors. Specifically, given a set of N scalar observations y(i), where the observations are associated with p variables (the predictors) x(i,j) for i = 1 through N and j = 1 through p. The problem is to find a subset of the p predictors such that the resulting model gives accurate predictions of the observations.

QUBO Data Model

As noted in [1], the suggested process for creating a QUBO data model is as follows:

Specify the input data as an N-by-p matrix X, where each row is one data point.
Specify the response data as an N-by-1 vector Y, where Y(i) is the response to the row X(i,:).
Calculate the matrix R(i,j) for i and j from 1 through p as the mutual information between columns i and j of X. R(i,j) represents the redundancy between pairs of features. The mutual information is a nonnegative quantity that can be estimated by binning the data and counting entries in each bin to estimate the associated probabilities. For more information, see https://en.wikipedia.org/wiki/Mutual_information.
Similarly, calculate the vector J(i) for i from 1 through p as the mutual information between column i of X and the response variable Y. The quantity J(i) is nonnegative and represents the strength of the coupling between column i and the response Y.
For each value a from 0 through 1, define the QUBO as
Q_a = (1 – a)R – diag(aJ).

Because each R and J entry is nonnegative, the minimum of the QUBO problem is at the point [1,1,…,1] for a = 1 and the point [0,0,…,0] for a = 0. In [1], Mücke proves that in nontrivial problems, for each k from 0 through p, there is a value a such that the number of nonzero entries in the solution to Q_a is equal to k. In other words, for each number of predictors k, there is a QUBO problem Q_a whose solution has exactly k nonzero entries. These entries represent the data columns in X that best match the Y data while not matching the other data columns.

Create Feature Selection Model with Synthetic Data

Create synthetic data for feature selection and then create a QUBO model to select the features. To train the resulting regression model, you need a Statistics and Machine Learning Toolbox™ license.

Create a synthetic data set using the iSyntheticData1 helper function.
```
[N,p,X,Y] = iSyntheticData1;
```
View the number of data points and predictors.
```
N
```
```
    N = 10000
```
```
p
```
```
    p = 30
```
Select five features, the number of useful features in this example. In general, you can try to find how many useful features exist by cross-validating models with different numbers of features.
```
K = 5;
```

Create the R matrix and J vector from the data by using 20 bins for the data, and estimating the mutual information by counting the number of entries in each bin. Use the iBinXY, iComputeMIXX, and iComputeMIXY helper functions to create the matrix and vector.

nbins = 20;
[binnedXInfo,binnedYInfo] = iBinXY(X,Y,nbins);
binnedX = binnedXInfo.binned;
nbinsX = binnedXInfo.nbins;
binnedY = binnedYInfo.binned;
nbinsY = binnedYInfo.nbins;
R0 = iComputeMIXX(binnedX,nbinsX);
J = iComputeMIXY(binnedX,binnedY,nbinsX,nbinsY);
% Scale $R$ to make $\alpha = 0.5$ correspond to equal redundancy and
% relevance terms.
R = R0/(K-1);

Find the appropriate value of a for the QUBO Q_a = (1 – a)R – diag(aJ). To do so, solve for the number of nonzero elements in the solution by using the howmany helper function, and then use fzero to set that number equal to K.
```
fun = @(alpha)howmany(alpha,R,J) - K;
alphasol = fzero(fun,[0 1]);

[~,xsol] = howmany(alphasol,R,J);
```

View the selected predictors.

find(xsol.BestX)

Train a regression tree, first using the full set of 30 predictors and then using the selected set of five predictors. Specify a common holdout value of 0.2, and compare the cross-validation losses. First train the regression tree using all the predictors. Training a regression tree requires a Statistics and Machine Learning Toolbox license.
```
rng default
mdl = fitrtree(X,Y,CrossVal="on",Holdout=0.2);
kfoldLoss(mdl)
```
```
ans = 0.0041
```
```
rng default
X2 = X(:,find(xsol.BestX));
mdl2 = fitrtree(X2,Y,CrossVal="on",Holdout=0.2);
kfoldLoss(mdl2)
```
```
ans = 0.0032
```
Although both regression trees have a low loss value, the regression tree that uses only five of the 30 predictors has a lower loss value.

Helper Functions

This code creates the iSyntheticData1 helper function.

function [N,p,X,y] = iSyntheticData1
rng default
N = 10000;
p = 30;
useful = [6,8,9,11,13];
C = randn(p,p);
R = corrcov(C'*C);
X = mvnrnd(zeros(p,1),R,N);
% Make features 15 to 19 highly correlated with useful features:
% 15 -> 6
% 16 -> 8
% 17 -> 9
% 18 -> 11
% 19 -> 13
corrStd = 0.1;
X(:,15:19) = X(:,useful) + corrStd*randn(N,5);
noiseStd = 0.1;
t = 0.5*cos(X(:,11)) + sin(X(:,9).*X(:,8)) + 0.5*X(:,13).*X(:,6) + noiseStd*randn(N,1);
y = rescale(t,0,1);
X = zscore(X);
end

This code creates the iBinXY helper function. Note that this helper function uses the iBinPredictors helper function.

function [binnedXInfo,binnedYInfo] = iBinXY(X,Y,nbins)
binnedXInfo = iBinPredictors(X,nbins);
binnedYInfo = iBinPredictors(Y,nbins);
end

This code creates the iComputeMIXX helper function. Note that this helper function uses the iComputeMIXIXJ helper function.

function Q = iComputeMIXX(binnedX,nbinsX)
p = size(binnedX,2);
Q = zeros(p,p);
for i = 1:p
    for j = i+1:p
        Xi = binnedX(:,i);
        Xj = binnedX(:,j);
        nbinsXi = nbinsX(i);
        nbinsXj = nbinsX(j);
        Q(i,j) = iComputeMIXIXJ(Xi,Xj,nbinsXi,nbinsXj);
    end
end
Q = Q + Q';
end

This code creates the iComputeMIXIXJ helper function.

function mi = iComputeMIXIXJ(Xi,Xj,nbinsXi,nbinsXj)
N = size(Xi,1);
NXY = zeros(nbinsXi,nbinsXj);
for n = 1:N
    a = Xi(n);
    b = Xj(n);
    NXY(a,b) = NXY(a,b) + 1;
end
NX = sum(NXY,2);
NY = sum(NXY,1);
NXNYDivN = NX.*NY/N;
nonzeroID = NXY ~= 0;
nonzeroNXY = NXY(nonzeroID);
nonzeroNXNYDivN = NXNYDivN(nonzeroID);
mi = sum((nonzeroNXY./N).*log(nonzeroNXY./nonzeroNXNYDivN),'all');
end

This code creates the iComputeMIXY helper function. Note that this helper function uses the iComputeMIXIXJ helper function.

function f = iComputeMIXY(binnedX,binnedY,nbinsX,nbinsY)
p = size(binnedX,2);
f = zeros(p,1);
for i = 1:p
    Xi = binnedX(:,i);
    nbinsXi = nbinsX(i);
    f(i) = iComputeMIXIXJ(Xi,binnedY,nbinsXi,nbinsY);
end
end

This code creates the howmany helper function.

function [n,result] = howmany(alpha,R,J)
Q = qubo((1-alpha)*R - alpha*diag(J));
result = solve(Q);
n = sum(result.BestX);
end

This code creates the iBinPredictors helper function. Note that this helper function uses the iDiscretize helper function.

function binnedInfo = iBinPredictors(X,nbins)
[N,p] = size(X);
binnedX = zeros(N,p);
edgesX = cell(1,p);
iscatX = false(1,p);
nbinsX = zeros(1,p);
istbl = istable(X);
for i = 1:p
    if istbl
        oneX = X{:,i};
    else
        oneX = X(:,i);
    end
    nbinsi = min(nbins, numel(unique(oneX)));
    [binnedX(:,i),edgesX{i},iscatX(i),nbinsX(i)] = iDiscretize(oneX,nbinsi);
end
binnedInfo = struct;
binnedInfo.binned = binnedX;
binnedInfo.edges = edgesX;
binnedInfo.iscat = iscatX;
binnedInfo.nbins = nbinsX;
end

This code creates the iDiscretize helper function.

function [binnedx,edges,iscat,nbins] = iDiscretize(x,nbins)
if iscategorical(x)
    binnedx = double(x);
    edges = [];
    iscat = true;
    nbins = numel(unique(x));
else
    probs = [0,(1:(nbins-1))/nbins,1];
    edges = quantile(x,probs);
    binnedx = discretize(x,edges,IncludedEdge="right");
    iscat = false;
end
end

References

[1] Mücke, S., R. Heese, S. Müller, M. Wolter, and N. Piatkowski. Quantum Feature Selection. arXiv:2203.13261v1, March 2022. Available at https://arxiv.org/pdf/2203.13261.

External Websites

https://en.wikipedia.org/wiki/Mutual_information