I am performing system identification using neural networks with 5 inputs and 1 output. NARX networks seem to give good results when the gradients are stable during training. However, I often run into exploding/vanishing gradient problems when training a NARX network in closed loop. I can observe this in the nntraintool window - the gradient diverges and becomes unstable then the maximum mu performance criteria in triggers and prematurely ends the network training. Note that I first train the NARX network in open loop with a performance goal of 1e-09. I then close the loop and retrain in closed loop form using the open loop weights and biases as initial values.
I would like to keep the NARX network architecture since when training isn't interrupted by an exploding gradient, it performs quite well on new data. Do you have any strategies or examples for avoiding exploding/vanishing gradient problems with NARX networks? I can't seem to find any discussions, documentations, or examples going over this issue.
One workaround that I've read up on is to use a leaky ReLu activation function. However, I do not see an option like this for a NARX network's hidden layer transfer function. The closest I can find is 'poslin' but I still run into similarly unstable gradients. Another workaround I've seen is to use a LSTM or GRU network. However, every LSTM/GRU network that I have trained has not reached the same level of performance as the NARX network.