Genetic Algorithm. About ga function in matlab and feature selection

I am using Genetic algorithm for feature selection. I used the built-in function of matlab
In the above function from where I will get the selected features? what is the value stored in x? For what it was used? Whether I have to include other steps for feature selection?What will be the nvars?Is it the number of features inputed to genetic algorithm?Kindly clarify..Thanks in advance

 Akzeptierte Antwort

Walter Roberson
Walter Roberson am 13 Apr. 2022

0 Stimmen

The function fun that you pass in, is responsible accepting trial vectors of model parameters, and evaluating the "cost" associated with those particular parameters. The return value, x, that gets returned, is the vector of model parameters that resulted in the lowest "cost".
In the case of feature selection, the trial parameters you pass in could potentially include a vector of integer decision variables, restricted to 0 or 1, with 0 meaning that the corresponding feature is not selected, and 1 meaning that it is selected. To select no more than N features, you could add a linear constraint that the sum of those decision variables is <= N.
As the return is the trial parameters that resulted in the lowest cost, then if you did integer decision variables like I describe, then that section of the output vector would tell you which features were selected (1) or not (0)
The return value from ga() does not inherently tell you about selected features: you have to arrange your function so that the set of selected features can be calculated from the inputs, and then after when you get the best parameters out, use them to say which features were selected.
Another approach instead of binary decision variables would be to use a vector of integer constrained variables, each between 1 and the number of features, effectively listing off which are selected.
nvars is the total number of model parameters that are to be varied. The output will be of the length indicated by nvars. This is not necessarily the same as the number of features, since you might have extra variables not being used as decision variables, or you might have chosen to encode by feature number instead of by binary decision variables.

13 Kommentare

Thank you dear Walter Roberson. As am new to this field I am facing some difficulty in the code section. Kindly explain with this example code
data=rand(1250,37);
%split into training and testing
traindata=rand(875,36);
testdata=rand(375,36);
trainlabel=rand(875,1);
testlabel=rand(375,1);
%knn model
Mdl=fitcknn(traindata, trainlabel, 'NumNeighbours', 5) ;
fun=@(t,p)mse(mdl,testdata, testlabel) ;
ga_out=ga(fun, nvars, options);
function mse_out=mse(mdl,testdata, testlabel)
predictknn=predict(mdl, testdata) ;
mse_out=mean(sum((testlabel-predictknn). ^2),2),2) ;
end
Now, is the defined function handle correct? Will it iteratively changes or updates values that minimizes mean squared error? In this how we have to get the selected features? Kindly help..
Thanks in advance
fun=@(t,p)mse(mdl,testdata, testlabel) ;
That function expects up to two inputs.
It then ignores the inputs, and calculates mse for the exact same values of mdl, testdata, and testlabel each time. Nothing useful will be done with that as it is exactly the same calculation each time it is attempted.
Note also that ga() will only be passing one parameter to fun -- the vector of trial values. But that is not going to happen to be important here because the function ignores the inputs it does get passed.
The ga() that you are doing does not help select features.
There are three major ways to do feature selection here:
  • you could interpret the vector of trial inputs as being a binary vector that indicates which features to include; in this case you would pass a number of variables the same as the number of features, and you would configure their lower bound to be 0 and their upper bound to be 1 and mark them as integer. The 1 values in the output vector show which features are selected.
  • you could interpret the vector of trial inputs as being a vector of integer indices that indicate which features are active; in this case you would pass a number of variables the same as the number of features you want to select, and you would configure their lower bound to 1 and their upper bound to the number of features, and mark them as integer. The output vector is the vector of indices of features
  • you could interpret the vector of trial inputs as being a vector of weights for each component; in this case you would pass a number of variables the same as the number of features, and you would configure the lower bound to 0 and the upper bound to 1, and do not mark them as integer. The output vector would be the vector of weights; you would sort in descending order and take the highest-weighted ones as your features
All three of these would require some adjustment to your function that calculates mse.
Thank you Walter Roberson. As you told i have made the feature selection with the help of genetic algorithm. Now I have a doubt that I have used error=1-accuracy as a fitness function. As theory says the individual with high fitness is used for reproduction, in my case how it will be?
MATLAB optimization mostly use cost functions instead of fitness functions, and minimize the cost. The way you define error would result in lower error when the accuracy is higher, which is perfect for a cost function. But it would not be a fitness function in the sense of high fitness is better. accuracy would be your fitness in that case. But ga() uses cost rather than fitness.
Thank you Walter Roberson for your clarification. Also I have a doubt that for the calculation of the error, error=1-accuracy, we have to use only the training accuracy or we can also use the testing accuracy?
You should only use testing accuracy for this purpose, in order to prevent over-training.
Thank you so much for your guidance
Hello Walter Roberson, I have a doubt in chosing the mutation operator and crossover operator. I have used mutation operation rate as 0.3 and crossover operation rate as 0.5. How this works? How many number of variables are mutated and crossovered?
If you have a crossover rate of 0.5 then ga() first holds on to a selection of "elite" population elements unchanged. Then it selects a 0.5 fraction (half) of the non-elite elements to apply cross-over to; the other half will have mutation applied. When a population member is chosen for cross-over, a random integer is used to select the cross-over point. I do not recall there being any way to control the locations of cross-overs.
When a non-elite population member is not chosen for cross-overs, then the rest are mutated. The details of the mutation depend on which mutation function is being used; https://www.mathworks.com/help/gads/genetic-algorithm-options.html#f6633 .
Thank you for your response. I have a doubt regarding optimization types. GA can be used both as a constrained and unconstrained optimization. In my case, that is in which error function is used as the cost function, GA comes under which type of optimization. Constrained or unconstrained?
ga() is considered constrained if any of the following are true:
  • you specify the type is bitset
  • you pass in non-empty A b linear inequality
  • you pass in non-empty Aeq beq linear equality
  • you pass in any lower bound that is not -inf, or any upper bound that is not +inf
  • you pass in a nonlinear constraint function that uses any input to calculate any output
  • you mark any variable as integer constrained
Generally speaking if you call ga with more than two parameters, there is a good chance that you are using constraints (unless all parameters from 3rd on are empty except the option parameter)
Thank you.. I have another doubt..Which test is suitable to find the statistical significance between features??
I have a dataset of size (m,n) matrix. Here is the number of obeservation and n is the number of features. These m samples lie in c classes. For eg. c=4. Now my question is can i do any of the statistical significance test for giving the whole matrix as input or i have to seperate it in terms of classes. Which method is preferred in my case?. Whether it is student t test or anova or some other test?

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Produkte

Version

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by