Hello i have a 54000 x 10 matrix i want to split it 70% training and 30% testing whats the easiest way to do that ?

1 Kommentar

Delvan Mjomba
Delvan Mjomba am 6 Jun. 2019
Use the Randperm command to ensure random splitting. Its very easy.
for example:
if you have 150 items to split for training and testing proceed as below:
Indices=randperm(150);
Trainingset=<data file name>(indices(1:105),:);
Testingset=<data file name>(indices(106:end),:);

Melden Sie sich an, um zu kommentieren.

 Akzeptierte Antwort

Akira Agata
Akira Agata am 18 Jan. 2018
Bearbeitet: the cyclist am 16 Aug. 2022

25 Stimmen

I would recommend using cvpartition, like:
% Sample data (54000 x 10)
data = rand(54000,10);
% Cross varidation (train: 70%, test: 30%)
cv = cvpartition(size(data,1),'HoldOut',0.3);
idx = cv.test;
% Separate to training and test data
dataTrain = data(~idx,:);
dataTest = data(idx,:);

11 Kommentare

abdulaziz marie
abdulaziz marie am 18 Jan. 2018
that's so perfect ... thank you
vetri L
vetri L am 5 Mär. 2019
Dear Akira,
While doing this type of partition.. i will get different classification accuracy every time i ran.. why this is happen how would avoid this?
Hi vetri-san,
Good question. This is because the function cvpartition splits data into dataTrain and dataTest randomly. And every time you run the code, the seed of random number generator changes. That's why classification accuracy changes every time you run.
To avoid this, you should explicitly initialize random number generator before you run the code. Please insert the following line before cvpartition.
rng('default');
Myraj
Myraj am 13 Apr. 2019
Bearbeitet: Myraj am 18 Apr. 2019
Hi Sir Akira,
I have some questions hope to reply me.
1/ The data used in this question is of type matrix(54000*10), can I use cvpartition in data type image?
2/ Also, I want to know if can we put your code for example on "for loop" to run it more times to perform them automatically instead of doing it manually?
3/ what is the difference between "cvprtition" and "randomized" in splitEachLabel like this
[dataTrain,dataTest] = splitEachLabel(data,0.8,'randomized');
is it correct? is it work like "cvpartion" ?
emmanuel adewumi
emmanuel adewumi am 10 Jul. 2020
Hi Akira,
Can I use cvpartition for splitting data sets to be used for regression model?
Akira Agata
Akira Agata am 28 Nov. 2020
> Myraj-san
1) Yes, you can use cvpartition in such task, too. But if your data-set consists of large number of image files, I would recommend using imageDatastore and splitEachlabel.
2) Yes.
3) cvprtition randomly split dataset into training and test. On the otherhand, splitEachLabel split dataset with keeping label ratio in the outputs as same as possible.
Akira Agata
Akira Agata am 28 Nov. 2020
>emmanuel adewumi-san,
Yes, of course! Also, I would recommend utilizing 'CrossVal' option in many regression functions such like fitrsvm (Which solution will be better depends on your task).
gaurang solanki
gaurang solanki am 25 Feb. 2021
Hi I am Using LIBSVM And i made a training model through MTALB COMMAND but don't know how to make training testing file can anyone please guide me how to do this for the testing my file then i get result for my MODEL Thank you for reading this.
Shehbaz Aslam
Shehbaz Aslam am 4 Sep. 2021
I have 600001*4 data in Excel. While using this the data siplits into 70% training and 30% testing. But values in each column are changed after implementation of this function. For example I have third column of 40 values but when it generate training and testing data then values are automatically changed. Instead of 40 it becomes 0.2 or 0.3 or 0.4. why these values are changed?? Please help... The simple I want to divide 600001*4 data into training and testing data. I want to train and test ANFIS controller. Thanks
Rishikesh Shetty
Rishikesh Shetty am 9 Jan. 2023
Hi Akira,
Thank you for this straight forward approach.
After following these steps, I was able to predict my model accuracy as expected.
My next question is - how do I split my data for all possible combinations?
For example, I have a 13*2 array that will split into 70/30 as 9*2 (training) and 4*2 (testing). I would like to repeat this split for all possible combinations(13C9) and then obtain an average of the model prediction accuracy.
Any advise is deeply appreciated.
Abhijit Bhattacharjee
Abhijit Bhattacharjee am 4 Mär. 2023
Rishikesh,
The CVPARTITION function randomizes the selection of the training and test datasets, so to get a new random combination just run it again. I am not sure it is advisable to try all combinatorial possibilities, as it is questionable whether that will return a much better model than you could get with considerably less effort. Just retrain with a new random partitioning a few times (say 10 times). This would be 10-fold cross-validation (or also called k-fold cross-validation for the case of k different random partitions).
Best,
Abhijit

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (4)

Gilbert Temgoua
Gilbert Temgoua am 19 Apr. 2022
Bearbeitet: Gilbert Temgoua am 20 Apr. 2022

4 Stimmen

I find dividerand very straightforward, see below:
% randomly select indexes to split data into 70%
% training set, 0% validation set and 30% test set.
[train_idx, ~, test_idx] = dividerand(54000, 0.7, 0,
0.3);
% slice training data with train indexes
%(take training indexes in all 10 features)
x_train = x(train_idx, :);
% select test data
x_test = x(test_idx, :);

1 Kommentar

uma
uma am 28 Apr. 2022
how to split the data into trainx trainy testx testy format but both trainx trainy should have first dimension same also for testx testy should have first dimension same.Example i have a dataset 1000*9 . trainx should contain 1000*9, trainy should contain 1000*1, testx should contain 473*9 and texty should contain473*1.

Melden Sie sich an, um zu kommentieren.

Vrushal Shah
Vrushal Shah am 14 Mär. 2019

3 Stimmen

If we want to Split the data set in Training and Testing Phase what is the best option to do that ?
Jere Thayo
Jere Thayo am 28 Okt. 2022

0 Stimmen

what if both training and testing are already in files, i.e X_train.mat, y_train.mat, x_test.mat and y_test.mat
Syed Iftikhar
Syed Iftikhar am 1 Jan. 2023

0 Stimmen

I have input variable name 's' in which i have data only in columns. The size is 1000000. I want to split that for 20% test. So i can save that data in some other variable. because i will gonna use that test data in some python script. Any Idea how to do this?

Kategorien

Mehr zu Statistics and Machine Learning Toolbox finden Sie in Hilfe-Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by