Soft Actor Critic loses learning after some training

10 Ansichten (letzte 30 Tage)
Wilson Salomão
Wilson Salomão am 11 Okt. 2021
Beantwortet: Alan am 29 Mai 2024
I'm using the soft actor critic agent from Matlab in a custon environment, and I am observing it goes really bad after some training time.
My custon environment simulates a robot motion in a scenario with obstacles, and if the robot collides with one of the obstacles it terminates the episode and the reward is -100. And if the robot reachs a target spot in the scenario, the episode terminates with a +100 reward. So, the agent seems to learn how to get the reward, but then loses all learning after some iterations.
This is a screenshot from one example:
Can someone help me to understand what is happening?

Antworten (1)

Alan
Alan am 29 Mai 2024
Hi Wilson,
It looks like the robot is heading straight to obstacles, therefore ending the episode and settling at the -100 average reward at the end of the curve. There are 2 areas where you could focus on to improve the SAC agent’s learning capability:
I. Reward Shaping
The reward function is an important factor that determines whether the SAC agent can learn effectively. You could add different components to get the reward function:
  1. Provide a higher reward when the root moves closer to expected trajectory. This would incentivize the robot to take the desired route.
  2. Provide a small constant penalty for each step. This would encourage the robot to reach the desired target with minimal steps.
  3. Provide a small penalty for the power expended by the robot to reach the target with minimal energy. This could be calculated from the torques applied by the robot during each step.
Also, ensure that the gradient of the reward function points away from obstacles and points towards the desired trajectory. To get a better idea of reward shaping, refer to module 2.3 of the Reinforcement Learning Onramp, which is a free training: https://matlabacademy.mathworks.com/details/reinforcement-learning-onramp/reinforcementlearning
Here is a set of instructions from the onramp course that explains reward shaping:
II. Hyperparameter Tuning
There are many hyperparameters in the rlSACAgentOptions that can be tuned. Here are a few:
  1. TargetUpdateFrequency: Too low or too high rates can lead to instability while learning.
  2. ExperienceBufferLength: Try increasing the experience buffer so that the agent can learn from a wide variety of experiences. This also ensures consistent performance across different states of the environment.
  3. EntropyWeightOptions: Larger the entropy weight, the more the agent explores. The options related to entropy can be modified in the EntropyWeightOptions property. The following is an example snippet:
entropyOptions = rl.option.EntropyWeightOptions(...
'EntropyWeight', 1, ...
'LearnRate', 3e-4, ...
'TargetEntropy', -actionSpaceDimension, ...
'Algorithm', 'adam', ...
'GradientThreshold', Inf);
agentOptions = rlSACAgentOptions(...
'EntropyWeightOptions', entropyOptions, ...
... % Other SAC agent options
);
The reason why the target entropy is set to the negative of size of action space is explained here: https://stats.stackexchange.com/questions/561624/choosing-target-entropy-for-soft-actor-critic-sac-algorithm
Other parameters can be explored in the documentation page of rlSACAgentOptions: https://www.mathworks.com/help/releases/R2021a/reinforcement-learning/ref/rlsacagentoptions.html
The agentOptions can be then supplied to the rlSACAgent function while creating the SAC agent:
agent = rlSACAgent(actor, critic, agentOptions);
I hope this helped!

Produkte


Version

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by