PPO reinforcement Learning Agent doesn't learn

Question

Francesco Mogetti am 16 Jan. 2022

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/1629850-ppo-reinforcement-learning-agent-doesn-t-learn

Kommentiert: Sourabh am 10 Dez. 2023

Hi, I am trying to design a reinforcement learning algorithm to perform a landing on the moon in a defined region.

The algorithm I implemented is a PPO with the environment designed in simulink. The model is designed as a continuous one. The action from RL Agent simulink block is the Thrust, the observation is the state (position and velocity). The Reward is also designed in a continuous way, with penalties outside some boundaries ("exteriorPenalty" function) and reward if inside boundaries (using exponential functions) plus some others penalties on velocities and action, properly weighted.

The model seems to work but the agent doesn't learn as it was supposed to do. I played with PPO options to limit local minima and increase exploration to help finding optimal conditions. After lot of episodes the reward function is supposed to increase, but it varies between optimal values and the worst cases. I know that training can take lot of time due the large environment, on the other hand after a while I expect to see a better behavior. Especially if Reward values are soo high in some cases.

My questions are: how can I read in a properly way my plots of Reinforcement Learning Episode Manager? Which parameters should I change to help the agent to understand what is better? Any other comments are welcomed!

Thanks for helping!

Here my code for actor and critic generation with relative options:

 
actPath = [
            sequenceInputLayer(numObs,'Normalization','none','Name','obs')
            fullyConnectedLayer(50,'Name','fc1act')
            dropoutLayer(0.2,'Name','drop1act')            
            layerNormalizationLayer('Name','norm1act')            
            reluLayer('Name','relu1act')
            lstmLayer(8,'OutputMode','sequence','Name','lstmact')
            layerNormalizationLayer('Name','norm2act')
            fullyConnectedLayer(2*numAct,'Name','fcoutput')
            layerNormalizationLayer('Name','norm3act')
            softmaxLayer('Name','SoftactionProb')];
        obsPath = [
            sequenceInputLayer(numObs,'Normalization','none','Name','obs')
            fullyConnectedLayer(100, 'Name', 'fc1obs')
            dropoutLayer(0.2,'Name','drop1obs')            
            layerNormalizationLayer('Name','norm1obs')
            reluLayer('Name','relu1obs')
            fullyConnectedLayer(22, 'Name', 'fc2obs')
            dropoutLayer(0.2,'Name','drop2obs')                        
            layerNormalizationLayer('Name','norm2obs')
            reluLayer('Name','relu2obs')
            fullyConnectedLayer(5, 'Name', 'fc3obs')
            dropoutLayer(0.2,'Name','drop3obs')                        
            layerNormalizationLayer('Name','norm3obs')
            reluLayer('Name','relu3obs')
            lstmLayer(8,'OutputMode','sequence','Name','lstmobs')
            layerNormalizationLayer('Name','norm4obs')
            fullyConnectedLayer(1,'Name','fcvalue')];
        opts1 = rlRepresentationOptions("Learnrate",5e-3,"GradientThreshold",10,"UseDevice","gpu");
        actor = rlStochasticActorRepresentation(actPath,obsInfo,actInfo,'Observation','obs',opts1)
        critic = rlValueRepresentation(obsPath,obsInfo,'Observation','obs',opts1)
        opts2 = rlPPOAgentOptions( "ExperienceHorizon",200,...
            "SampleTIme",0.25, ...
            "MiniBatchSize",32, ...
            "EntropyLossWeight",0.5, ...
            "AdvantageEstimateMethod","gae", ...
            "GAEFactor",0.8, ...
            "NormalizedAdvantageMethod","current");
        agent = rlPPOAgent(actor,critic,opts2)

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siphiwe Phetla am 15 Okt. 2022

Hi. I have the exact same problem with a different custom Simulink environment. Have you solved it yet?

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Muhammad Fairuz Abdul Jalal am 24 Mär. 2023

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/1629850-ppo-reinforcement-learning-agent-doesn-t-learn#answer_1199864

Bearbeitet: Muhammad Fairuz Abdul Jalal am 24 Mär. 2023

I do believe one of the main reason is the hyperparameter (L2 Regularisation factor, Learning Rate, Clipping threshold, etc) setting. You need to do some sort of investigation which value suit best to your model.

For instance, my training only work well if the learning rate is set to 3e-04. Other value may not be optimal.

In my case, I have seen that PPO may have quite low rewards for couple of thousand episodes number and roughly strats to picking up after 4k or 5k episode number. Your model may behave differently.

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Muhammad Fairuz Abdul Jalal am 7 Jul. 2023

Hi @Rawan.

I did it by changing the hyperparameter manually and plot the graph for comparing them.

I am following the similar process like the journal paper by: Andrychowicz etal. What Matters In On Policy Reinforcement Learning? A Large-Scale Empirical Study. There is also tips and suggestion on the range that you can start with.

This journal paper really help me in looking into this PPO hyperparameter tuning in details. Really hope it helps you in any way.

Sourabh am 10 Dez. 2023

I had silly doubt but doesnt ppo requires output of mean and std deviation from actor network like here there is only path for o/p in actor network.

Melden Sie sich an, um zu kommentieren.

PPO reinforcement Learning Agent doesn't learn

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Antworten (1)

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

PPO reinforcement Learning Agent doesn't learn

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Antworten (1)

3 Kommentare 1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden