PPO issue with Nans

Question

0 Stimmen

I am working on a project in which we are attempting to simulate a "flock" and insert a rogue agent that will be trained to match their behavior. Right now there is a simple reward function based on matching the position of the agents. The problem I am having is that after some amount of episodes (the amount changes when I adjust parameters like learn rate and clip factor), Nans are introduced. I have made sure that they are not created inside the plant or any part of the code I introduced, so they are appearing from the policy itself.

Train Multiple Agents to Perform Collaborative Task

This example is modifed from the multi-agent training session on a Simulink® environment example where you train two agents to collaboratively perform the task of moving an object. I changed it to have only one agent (Rogue) that tries to interact with a Flock of agents following the flocking control laws from Tanner et al 2003.

First we set all the parameter values like initial conditions, mass, interaction radius and coeffieicnts for control laws

rng(10); %seeds the random number generator so results are repeatable
rlCollaborativeTaskParams_Esposito  % this is just a simple script w parameter values
Open the Simulink model if desired.
mdl = "rlCollaborativeTask_esposito";
open_system(mdl)

Environment

% Number of observations
numObs = N*4+2;  % I think this comes from the sum of all the states (4 per agent) plus the inputs (Fx and Fy) 
% Number of actions
numAct = 2; %inputs 
% I/O specifications for each agent
oinfo = rlNumericSpec([numObs,1]);
ainfo = rlNumericSpec([numAct,1], ...
    UpperLimit= maxU, ...
    LowerLimit= -maxU);
oinfo.Name = "observations";
ainfo.Name = "forces";
blks = ["rlCollaborativeTask_esposito/Agent A"];
obsInfos = oinfo;
actInfos = ainfo;
env = rlSimulinkEnv(mdl,blks,obsInfos,actInfos);

The reset function resetRobots calls rlCollabroativeTaskParams_Esposito which ensures that the robots start from random initial positions at the beginning of each episode. Inside of this is a plotting function too

env.ResetFcn = @(in) resetRobots_Esposito(in,R,boundaryR, N);

Agents

This example uses a Proximal Policy Optimization (PPO) agents with continuous action spaces. The agents apply external forces on the robot that result in motion. To learn more about PPO agents, see Proximal Policy Optimization Agents.

The agents collect experiences until the experience horizon is reached. After trajectory completion, the agents learn from mini-batches of experiences. An objective function clip factor of 0.2 is used to improve training stability and a discount factor of 0.99 is used to encourage long-term rewards.

Specify the agent options for this example.

agentOptions = rlPPOAgentOptions(...
    ExperienceHorizon=600,...
    ClipFactor=0.2,...
    EntropyLossWeight=0.01,...
    MiniBatchSize=300,...
    NumEpoch=4,...
    AdvantageEstimateMethod="gae",...
    GAEFactor=0.95,...
    SampleTime=Ts,...
    DiscountFactor=0.99);

Set the learning rate for the actor and critic.

agentOptions.ActorOptimizerOptions.LearnRate  = .00001;
agentOptions.CriticOptimizerOptions.LearnRate = .00001;
actor.ActionInfo.LowerLimit = -maxU;
actor.ActionInfo.UpperLimit = maxU;
Create the agents using the default agent creation syntax. For more information see rlPPOAgent.
agentA = rlPPOAgent(oinfo, ainfo, ...
    rlAgentInitializationOptions(NumHiddenUnit= 20), agentOptions);

Training

With only one agent I modified this significantly from the example

For more information on multi-agent training, type help rlMultiAgentTrainingOptions that example includes options like this

MATLAB.opts = rlMultiAgentTrainingOptions(AgentGroups= {[1,2], 4, [3,5]}, LearningStrategy= ["centralized","decentralized","centralized"])

But with only one agent you use rlTrainingOptions (NOTE I kept getting an error about parrelelization options.)

trainOpts = rlTrainingOptions(...
    MaxEpisodes=5000,...
    MaxStepsPerEpisode=100,...
    ScoreAveragingWindowLength=30,...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=200);

Train the agents using the train function. Training can take several hours to complete depending on the available computational power. To save time, load the MAT-file which contains a pretrained agents. To train the agents yourself, set doTraining to true.

doTraining = true;
if doTraining
    centralizedTrainResults = train([agentA],env,trainOpts);
else
    load("TrainedRogueToMatchHeading.mat");
end

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

PPO issue with Nans

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Antworten (0)

Kategorien

Produkte

Version

Tags

Community Treasure Hunt

PPO issue with Nans

0 Kommentare -2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Antworten (0)

Kategorien

Produkte

Version

Tags

Siehe auch

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden