NN validation and data partition

Question

laplace laplace am 14 Jan. 2013

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/58761-nn-validation-and-data-partition

Akzeptierte Antwort: Greg Heath

1)i got 50 data i use 35 to train the network 15* to test
2)i take these 35 data and split them into 7 folds of 5 data each
i=1,..,n=7 i pick an i(5 data) to test/validate and the rest (30 data) to train the network
3) so now i have created 8 networks: the original and onother 7 due to the data partition i made
3a)i wanna save these 7 networks and be able to manipulate them
3b)i want to take the orignal 15 data* and run them through the 7 networks as test/validate data

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Greg Heath am 16 Jan. 2013

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/58761-nn-validation-and-data-partition#answer_71318

There is an important difference between validation and testing.

In the ideal scenario:

There is only 1 subset to be used for both adequate training and unbiased estimation of performance on unseen data.The subset is assumed to be randomly drawn from a general population and is sufficiently large and diverse to represent the salient I/O features of that population so well that using the same data to test the performance of a net that was trained with it results in a relatively unbiased estimate of performance on unseen data from the same general population.

Typically, however, this requires more data than is available and the scenario has to be modified so that a good network can be designed and a relatively unbiased estimate of network performance on nondesign data can be obtained..

A common approach is based on a division of the data into 3 separate subsets for training, validation and testing.

Data = Design + Test

Design = Train + Val

The 3 subsets are all assumed to be sufficiently large random draws from the general population. Accurate weight estimation generally results in the training set being at least 63% (1-1/exp(1)) of the total. The validation and test sets are generally of similar size. (THE MTLB default is (0.7/0.15/0.15))

The test set is used ONCE AND ONLY ONCE to estimate network performance on population nondesign data. If this is unsatisfactory, redesigns require re-randomization of the data using a different initial RNG state.

The training set is used to estimate weights given a set of training parameters.

The training and validation sets are used repetively to both estimate weights (training set) and determine an adequate set of training parameters (validation set). Very often a set of default values are used for number of hidden nodes, learning rate, momentum constant, maximm number of epochs, etc. and the emphasis is on determining when to stop training.

This is called validation stopping (AKA "stopped training" and "early stopping" ). The validation error is used to stop training when it fails to decrease for 6 (MTLB default) consecutive epochs.

In other words, the training subset is used to obtain a set of weights that EITHER minimizes the validation error, minmizes the training error or acheives a specified training error goal.

The training error is always biased.

If the validation error causes training to stop, or is used repetively to determine training parameters, the validation error is also biased (although usually not nearly as much as the training error).

The test set error is unbiased because it is completely independent of design (training and validation).

The test set error is used to estimate performance on the rest of the unseen data in the population.

When the data is not sufficiently large to acheive reasonable sizes for the 3 subsets, I suggest first using all of the data, without division, to design ~100 networks that differ by number of hidden nodes (e.g., 0:9) and 10 different random weight initializations. This typically takes less than a minute or two. Use the 10x10 performance tabulations to guide further designs.

I have posted several examples in the NEWSGROUP and ANSWERS.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Regarding your proposal: Random data division and weight initialization are defaults in MTLB feedforward nets. Therefore all you have to do is set

1. The initial RNG state (e.g., rng(0)) so you can duplicate experiments

2. The range of hidden nodes (e.g., j = 1:10)

3.The number of weight initialization trials (e.g., i = 1:10)

4. The data division ratio

5. The MSEgoal

and use a double loop over H and weight intialization to obtain the 3 error estimates from the training record, tr, and store them in 10x10 matrices.

From the matrices you can easily see the smallest number of hidden nodes that yields good performances most of the time.

You can either average over those good performances or make more runs for that particular value of hidden nodes to get a final estimate of mean error and standard deviation.

Hope this helps.

Thank you for formally accepting my answer.

Greg

2 Kommentare
Keine anzeigenKeine ausblenden

laplace laplace am 18 Feb. 2013

really helpfull answer but write me some code so i can understand better your points thnx in advance

Greg Heath am 19 Feb. 2013

Bearbeitet: Greg Heath am 19 Feb. 2013

In MATLAB Online öffnen

Will send two draft outlines shortly:

1. Estimating the optimum No. of hidden nodes, Hopt, using ALL of the data

2. Estimating the generalization error on unseen data using Hopt and 7-fold XVAL.

What are dimensionalities of input & output? (I and O)

Have you plotted and surveyed the data?

What is the average variance of the target variables?

 MSE00  = mean(var(t',1))% biased (N-1 divisor )
 MSE00a = mean(var(t')) % unbiased (N divisor )

Please respond with those 4 numbers ASAP.

Thanks

Greg

Melden Sie sich an, um zu kommentieren.

Answer 2

Greg Heath am 19 Feb. 2013

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/58761-nn-validation-and-data-partition#answer_75601

In MATLAB Online öffnen

1 Choose candidate values for Hopt, the optimal No. of hidden nodes by using ALL of the data and looping over Ntrials candidates values for each H in Hmin:dH:Hmax.

     Neq = N*O                 % No. of training equations
     Nw = (I+1)*H+(H+1)*O      % No.of unknowns (weights)
     Ndof = Neq - Nw           % No. of estimation degrees-of-freedom
     MSE = SSE/Neq             % Biased MSE estimate
     MSEa = SSE/Ndof           %Unbiased ("a"djusted) MSE estimate
     Hub = -1 + ceil( (Neq-O)/(I+O+1)) % Ndof > 0 (Neq >Nw) upper bound
     Choose Hmin,dH,Hmax <= Hub
     numH = length(Hmin:dH:Hmax)
     Ntrials = 10             % No.of weight initializations per H value

2. Use ALL of the data for training to choose Hopt from Ntrials*numH candidate nets

     a. rng(0)   % Initialize random number generator
     b. Outer Loop over h = Hmin:dH:Hmax (j=1:numH)
     c. Inner loop over i = 1:Ntrials % of weight initializations
     d. net.divideFcn = 'dividetrain'
     e. MSEgoal = 0.01*Ndof*MSE00a/Neq
     f. net.trainParam.goal = MSEgoal;
     g. net = fitnet(h);
     h. [net tr]    = train(net,x,t);
     i. Nepochs(i,j)= tr.best_epoch
     j. MSE         = tr.best_perf
     k. NMSE        = MSE/MSE00
     l. R2(i,j)     = 1-NMSE  % Biased
     m. MSEa        = Neq*MSE/Ndof
     n. NMSEa       = MSEa/MSE00a
     o. R2a(i,j)    = 1-NMSEa  % Unbiased
 3. Estimating Hopt
     a.Tabulate Nepochs, R2, and R2a in 3  Ntrials-by-numH matrices.
     b.Choose (i,j)_opt and Hopt from maximum of R2a.    
     c.Redesign netopt by reinitializing RNG and calling it repeatedly to get 
       the same initial state as the (i,j)_opt run.