Soft Actor Critic loses learning after some training

Question

Wilson Salomão am 11 Okt. 2021

2
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/1560956-soft-actor-critic-loses-learning-after-some-training

Beantwortet: Alan am 29 Mai 2024

I'm using the soft actor critic agent from Matlab in a custon environment, and I am observing it goes really bad after some training time.

My custon environment simulates a robot motion in a scenario with obstacles, and if the robot collides with one of the obstacles it terminates the episode and the reward is -100. And if the robot reachs a target spot in the scenario, the episode terminates with a +100 reward. So, the agent seems to learn how to get the reward, but then loses all learning after some iterations.

This is a screenshot from one example:

Can someone help me to understand what is happening?

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Alan am 29 Mai 2024

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/1560956-soft-actor-critic-loses-learning-after-some-training#answer_1464731

In MATLAB Online öffnen

Hi Wilson,

It looks like the robot is heading straight to obstacles, therefore ending the episode and settling at the -100 average reward at the end of the curve. There are 2 areas where you could focus on to improve the SAC agent’s learning capability:

I. Reward Shaping

The reward function is an important factor that determines whether the SAC agent can learn effectively. You could add different components to get the reward function:

Provide a higher reward when the root moves closer to expected trajectory. This would incentivize the robot to take the desired route.
Provide a small constant penalty for each step. This would encourage the robot to reach the desired target with minimal steps.
Provide a small penalty for the power expended by the robot to reach the target with minimal energy. This could be calculated from the torques applied by the robot during each step.

Also, ensure that the gradient of the reward function points away from obstacles and points towards the desired trajectory. To get a better idea of reward shaping, refer to module 2.3 of the Reinforcement Learning Onramp, which is a free training: https://matlabacademy.mathworks.com/details/reinforcement-learning-onramp/reinforcementlearning

Here is a set of instructions from the onramp course that explains reward shaping:

II. Hyperparameter Tuning

There are many hyperparameters in the rlSACAgentOptions that can be tuned. Here are a few:

TargetUpdateFrequency: Too low or too high rates can lead to instability while learning.
ExperienceBufferLength: Try increasing the experience buffer so that the agent can learn from a wide variety of experiences. This also ensures consistent performance across different states of the environment.
EntropyWeightOptions: Larger the entropy weight, the more the agent explores. The options related to entropy can be modified in the EntropyWeightOptions property. The following is an example snippet:

entropyOptions = rl.option.EntropyWeightOptions(... 
    'EntropyWeight', 1, ... 
    'LearnRate', 3e-4, ... 
    'TargetEntropy', -actionSpaceDimension, ... 
    'Algorithm', 'adam', ... 
    'GradientThreshold', Inf); 
agentOptions = rlSACAgentOptions(... 
    'EntropyWeightOptions', entropyOptions, ... 
    ... % Other SAC agent options 
    );