RL Toolbox: Proximal Policy Optimisation

Question

Robert Gordon am 8 Aug. 2019

1
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/475354-rl-toolbox-proximal-policy-optimisation

Kommentiert: Weihao Yuan am 22 Aug. 2020

I just wanted to ask if anyone is aware of a proximal policy optimisation (PPO) reinforement learning implimentation avaliable for MATLAB RL Toolbox. I know that you can create a custom agent class, but I wanted to see if anyone else has implimented it before?

Thanks!

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Emmanouil Tzorakoleftherakis am 16 Sep. 2019

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/475354-rl-toolbox-proximal-policy-optimisation#answer_392113

Hi Robert,

Reinforcement Learning Toolbox in R2019b has a PPO implementation for discrete action spaces. Future releases will include continuous action spaces as well.

I hope this helps.

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

Camilo Manrique am 25 Mär. 2020

Bearbeitet: Camilo Manrique am 25 Mär. 2020

In MATLAB Online öffnen

Hello Emmanouil

Thank you very much for your answer. I have upgraded MATLAB as you suggested and also checked the new User's Guide. Regarding the PPO algorithm for continuous action spaces, unfortunately in the User's Guide it is not included an example for such a case. I've been trying to use PPO in the predfined env 'CartPoleSimscapeModel-Continuous' but I'm not having any luck so far when creating the Agent object.

mdl='CartPoleSimscapeModel-Continuous'; 
agentblk= [mdl '/RL Agent'];
env= rlPredefinedEnv(mdl)

Environment information

obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
Ts = 0.02; %sampling
Tf = 25; %max episode duration

Critic Neural network

statePath = [
    imageInputLayer([numObservations 1 1],'Normalization','none','Name','S')
    fullyConnectedLayer(200,'Name','CriticStateFC1') %128
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(300,'Name','CriticStateFC2') %200
    reluLayer('Name','CriticRelu2')
    fullyConnectedLayer(400,'Name','CriticStateFC3')]; %200
    
actionPath = [
    imageInputLayer([1 1 1],'Normalization','none','Name','actInfo.Name')
    fullyConnectedLayer(400,'Name','CriticActionFC1','BiasLearnRateFactor',0)
    fullyConnectedLayer(400,'Name','CriticActionFC2')]; %200
commonPath = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(1,'Name','CriticPreOutput')
    regressionLayer('Name','CriticOutput')]; %fullyConnectedLayer(1,'Name','CriticOutput')
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
    
criticNetwork = connectLayers(criticNetwork,'CriticStateFC3','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC2','add/in2');

Critic Object

criticOptions = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);
critic = rlValueRepresentation(criticNetwork,obsInfo,actInfo, 'Observation',{'S'});

Actor Neural Network

inPath = [ imageInputLayer([numObservations 1], 'Normalization','none','Name','S') 
           fullyConnectedLayer(1,'Name','puerto') ]; % half the outputs
% path layers for mean value (1 by 1 input and 1 by 1 output)
% using scalingLayer to scale the range
meanPath = [ tanhLayer('Name','tanh'); 
             scalingLayer('Name','scale','Scale',actInfo.UpperLimit) ];
% path layers for variance (1 by 1 input and output)
% using softplus layer to make it non negative)
variancePath =  softplusLayer('Name', 'splus');
% conctatenate two inputs (along dimension #3) to form a single (2 by 1) output layer
outLayer =[concatenationLayer(3,2,'Name','gaussPars')...
            ] ;    %fullyConnectedLayer(1, 'Name','fuerza') ?
% add layers to network object
actorNetwork = layerGraph(inPath);
actorNetwork = addLayers(actorNetwork,meanPath);
actorNetwork = addLayers(actorNetwork,variancePath);
actorNetwork = addLayers(actorNetwork,outLayer);
% connect layers
actorNetwork = connectLayers(actorNetwork,'puerto','tanh/in');              % connect output of inPath to meanPath input
actorNetwork = connectLayers(actorNetwork,'puerto','splus/in');             % connect output of inPath to variancePath input
actorNetwork = connectLayers(actorNetwork,'scale','gaussPars/in1');       % connect output of meanPath to gaussPars input #1
actorNetwork = connectLayers(actorNetwork,'splus','gaussPars/in2');       % connect output of variancePath to gaussPars input #2

Actor Object

actorOptions = rlRepresentationOptions('LearnRate',5e-04,'GradientThreshold',1);  %, 'UseDevice','gpu'
actor = rlStochasticActorRepresentation(actorNetwork,obsInfo,actInfo,...
    'Observation',{'S'});   %,'Action',{'ActorScaling'},actorOptions

Agent Object Definition

agentOptions = rlPPOAgentOptions(...
    'ExperienceHorizon', 500/2 , ...
    'MiniBatchSize',128 , ...  % TAMAñO del Batch, tuneable!
    'ClipFactor', 0.2000 , ...
    'EntropyLossWeight', 0.0100 , ...
    'NumEpoch', 3 , ...
    'AdvantageEstimateMethod', "gae",...
    'GAEFactor', 0.9500 , ...
    'SampleTime',Ts , ...
    'DiscountFactor', 0.9900 ); 
agent = rlPPOAgent(actor,critic,agentOptions);

Training Parameters

maxepisodes = 15000;
maxsteps = ceil(Tf/Ts);
StTrV= -210;  %Stop training value   
trainingOptions = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'ScoreAveragingWindowLength',5,...
    'Verbose',false,... %estaba en false,
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',StTrV,...
    'SaveAgentCriteria','EpisodeReward',...
    'SaveAgentValue',-240);
trainingStats = train(agent,env,trainingOptions);
save("finalAgent.mat",'agent')

With this code I'm getting an error which I've been unable to fix so far.

ERROR

Error using rl.env.AbstractEnv/simWithPolicy (line 70)
An error occurred while simulating "rlCartPoleSimscapeModel" with the agent "agent".
Error in rl.task.SeriesTrainTask/runImpl (line 33)
            [varargout{1},varargout{2}] = simWithPolicy(this.Env,this.Agent,simOpts);
Error in rl.task.Task/run (line 21)
            [varargout{1:nargout}] = runImpl(this);
Error in rl.task.TaskSpec/internal_run (line 159)
            [varargout{1:nargout}] = run(task);
Error in rl.task.TaskSpec/runDirect (line 163)
            [this.Outputs{1:getNumOutputs(this)}] = internal_run(this);
Error in rl.task.TaskSpec/runScalarTask (line 187)
                runDirect(this);
Error in rl.task.TaskSpec/run (line 69)
                runScalarTask(task);
Error in rl.train.SeriesTrainer/run (line 24)
            run(seriestaskspec);
Error in rl.train.TrainingManager/train (line 291)
            run(trainer);
Error in rl.train.TrainingManager/run (line 160)
            train(this);
Error in rl.agent.AbstractAgent/train (line 54)
TrainingStatistics = run(trainMgr);
Error in parametrosPPO (line 293)
trainingStats = train(agent,env,trainingOptions);
Caused by:
    Error using rl.env.SimulinkEnvWithAgent>localHandleSimoutErrors (line 689)
    Invalid input argument type or size such as observation, reward, isdone or loggedSignals.
        Error using rl.env.SimulinkEnvWithAgent>localHandleSimoutErrors (line 689)
        Unable to evaluate representation.
            Error using rl.env.SimulinkEnvWithAgent>localHandleSimoutErrors (line 689)
            The logical indices contain a true value outside of the array bounds.

I would really appreciate if you could give me any hint about the origin and cause of the error. Thank you very much in advance

Camilo Manrique am 26 Mär. 2020

Bearbeitet: Camilo Manrique am 26 Mär. 2020

It worked indeed, you are right, I completely forgot about the fact that PPO uses a value function approach, instead of to the Q-value function used with DDPG. Thank you very much for your help.

Weihao Yuan am 22 Aug. 2020

In MATLAB Online öffnen

Hi Emmanouil, I encountered a similar problem when applying PPO to the ACC model in DDPG example.

Environment

mdl = 'rlACCMdl';
open_system(mdl)
agentblk = [mdl '/RL Agent'];
% create the observation info
observationInfo = rlNumericSpec([3 1],'LowerLimit',-inf*ones(3,1),'UpperLimit',inf*ones(3,1));
observationInfo.Name = 'observations';
observationInfo.Description = 'information on velocity error and ego velocity';
% action Info
actionInfo = rlNumericSpec([1 1],'LowerLimit',-3,'UpperLimit',2);
actionInfo.Name = 'acceleration';
% define environment
env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);

Critic

predefinedWeightsandBiases = false;
if predefinedWeightsandBiases
    load('PredefinedWeightsAndBiases.mat');
else
    createNetworkWeights;
end
criticNetwork = [imageInputLayer([numObs 1 1],'Normalization','none','Name','observation')
    fullyConnectedLayer(200,'Name','CriticFC1', ... 
                                            'Weights',weights.criticFC1, ...
                                            'Bias',bias.criticFC1)
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(100,'Name','CriticFC2', ...
                                            'Weights',weights.criticFC2, ... 
                                            'Bias',bias.criticFC2)
    reluLayer('Name','CriticRelu2')
    fullyConnectedLayer(1,'Name','CriticOutput',...
                          'Weights',weights.criticOut,...
                          'Bias',bias.criticOut)];
                    
 criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1,'L2RegularizationFactor',1e-4);
 critic = rlValueRepresentation(criticNetwork,observationInfo,...
    'Observation',{'observation'},criticOptions);   

Actor

% observation path layers (3 by 1 input and a 2 by 1 output)
actorNetwork = [ imageInputLayer([3 1 1], 'Normalization','none','Name','observation') 
            fullyConnectedLayer(2,'Name','infc') ];
% path layers for mean value (2 by 1 input and 2 by 1 output)
% using scalingLayer to scale the range
meanPath = [ tanhLayer('Name','tanh'); 
             scalingLayer('Name','ActorScaling','Scale',2.5,'Bias',-0.5)];
% path layers for variance (2 by 1 input and output)
% using softplus layer to make it non negative)
variancePath =  softplusLayer('Name', 'Softplus');
% conctatenate two inputs (along dimension #3) to form a single (4 by 1) output layer
outLayer = concatenationLayer(3,2,'Name','gaussPars');
% add layers to network object
net = layerGraph(actorNetwork);
net = addLayers(net,meanPath);
net = addLayers(net,variancePath);
net = addLayers(net,outLayer);
% connect layers
net = connectLayers(net,'infc','tanh/in');              % connect output of inPath to meanPath input
net = connectLayers(net,'infc','Softplus/in');             % connect output of inPath to variancePath input
net = connectLayers(net,'ActorScaling','gaussPars/in1');       % connect output of meanPath to gaussPars input #1
net = connectLayers(net,'Softplus','gaussPars/in2');       % connect output of variancePath to gaussPars input #2
% plot network
plot(net)  

However the agent stopped training at 50th episode:

Error

Error using rl.env.AbstractEnv/simWithPolicy (line 70)
An error occurred while simulating "rlACCMdl" with the agent "agent".
Error in rl.task.SeriesTrainTask/runImpl (line 33)
            [varargout{1},varargout{2}] = simWithPolicy(this.Env,this.Agent,simOpts);
Error in rl.task.Task/run (line 21)
            [varargout{1:nargout}] = runImpl(this);
Error in rl.task.TaskSpec/internal_run (line 159)
            [varargout{1:nargout}] = run(task);
Error in rl.task.TaskSpec/runDirect (line 163)
            [this.Outputs{1:getNumOutputs(this)}] = internal_run(this);
Error in rl.task.TaskSpec/runScalarTask (line 187)
                runDirect(this);
Error in rl.task.TaskSpec/run (line 69)
                runScalarTask(task);
Error in rl.train.SeriesTrainer/run (line 24)
            run(seriestaskspec);
Error in rl.train.TrainingManager/train (line 291)
            run(trainer);
Error in rl.train.TrainingManager/run (line 160)
            train(this);
Error in rl.agent.AbstractAgent/train (line 54)
TrainingStatistics = run(trainMgr);
Caused by:
    Error using rl.env.SimulinkEnvWithAgent>localHandleSimoutErrors (line 689)
    Invalid input argument type or size such as observation, reward, isdone or loggedSignals.
        Error using rl.env.SimulinkEnvWithAgent>localHandleSimoutErrors (line 689)
        Standard deviation must be nonnegative. Ensure your representation always outputs nonnegative values for outputs that correspond to the standard deviation.

I tried to find the reason of this bug but failed. I would be really appreciated if you could check this bug for me. Thanks a lot.

Melden Sie sich an, um zu kommentieren.

RL Toolbox: Proximal Policy Optimisation

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (1)

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

RL Toolbox: Proximal Policy Optimisation

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (1)

6 Kommentare 4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden