# Generate Reward Function from a Model Predictive Controller for a Servomotor

This example shows how to automatically generate a reward function from cost and constraint specifications defined in a model predictive controller object. You then use the generated reward function to train a reinforcement learning agent.

### Introduction

You can use the `generateRewardFunction` to generate a reward function for reinforcement learning, starting from cost and constraints specified in a model predictive controller. The resulting reward signal is a sum of costs (as defined by an objective function) and constraint violation penalties depending on the current state of the environment.

This example is based on the DC Servomotor with Constraint on Unmeasured Output (Model Predictive Control Toolbox) example, in which you design a model predictive controller for a DC servomechanism under voltage and shaft torque constraints. Here, you will convert the cost and constraints specifications defined in the `mpc` object into a reward function and use it to train an agent to control the servomotor.

Open the Simulink model for this example which is based on the above MPC example but has been modified for reinforcement learning.

```mdl = "rl_motor"; open_system(mdl);```

### Create Model Predictive Controller

Create the open-loop dynamic model of the motor, defined in `plant` and the maximum admissible torque `tau` using an helper function.

`[plant,tau] = mpcmotormodel;`

Specify input and output signal types for the MPC controller. The shaft angular position, is measured as first output. The second output, torque, is unmeasurable.

`plant = setmpcsignals(plant, MV=1, MO=1, UO=2);`

Specify constraints on the manipulated variable, and define a scale factor.

`MV = struct(Min=-220, Max=220, ScaleFactor=440);`

Impose torque constraints during the first three prediction steps, and specify scale factor for both shaft position and torque.

```OV = struct(Min={-Inf, [-tau;-tau;-tau;-Inf]}, ... Max={Inf, [tau;tau;tau;Inf]}, ... ScaleFactor={2*pi, 2*tau});```

Specify weights for the quadratic cost function to achieve angular position tracking. Set to zero the weight for the torque, thereby allowing it to float within its constraint.

`Weights = struct(MV=0, MVRate=0.1, OV=[0.1 0]);`

Create an MPC controller for the `plant` model with a sample time of `0.1` s, a prediction horizon `10` steps, and a control horizon of `2` steps, using the previously defined structures for the weights, manipulated variables, and output variables.

`mpcobj = mpc(plant, 0.1, 10, 2, Weights, MV, OV);`

Display the controller specifications.

`mpcobj`
``` MPC object (created on 03-Mar-2023 10:25:44): --------------------------------------------- Sampling time: 0.1 (seconds) Prediction Horizon: 10 Control Horizon: 2 Plant Model: -------------- 1 manipulated variable(s) -->| 4 states | | |--> 1 measured output(s) 0 measured disturbance(s) -->| 1 inputs | | |--> 1 unmeasured output(s) 0 unmeasured disturbance(s) -->| 2 outputs | -------------- Indices: (input vector) Manipulated variables: [1 ] (output vector) Measured outputs: [1 ] Unmeasured outputs: [2 ] Disturbance and Noise Models: Output disturbance model: default (type "getoutdist(mpcobj)" for details) Measurement noise model: default (unity gain after scaling) Weights: ManipulatedVariables: 0 ManipulatedVariablesRate: 0.1000 OutputVariables: [0.1000 0] ECR: 10000 State Estimation: Default Kalman Filter (type "getEstimator(mpcobj)" for details) Constraints: -220 <= MV1 (V) <= 220, MV1/rate (V) is unconstrained, MO1 (rad) is unconstrained -78.54 <= UO1 (Nm)(t+1) <= 78.54 -78.54 <= UO1 (Nm)(t+2) <= 78.54 -78.54 <= UO1 (Nm)(t+3) <= 78.54 UO1 (Nm)(t+4) is unconstrained Use built-in "active-set" QP solver with MaxIterations of 120. ```

The controller operates on a plant with 4 states, 1 input (voltage) and 2 output signals (angle and torque) and has the following specifications:

• The cost function weights for the manipulated variable, manipulated variable rate and output variables are 0, 0.1 and [0.1 0] respectively.

• The manipulated variable is constrained between -220V and 220V.

• The manipulated variable rate is unconstrained.

• The first output variable (angle) is unconstrained but the second (torque) is constrained between -78.54 Nm and 78.54 Nm in the first three prediction time steps and unconstrained in the fourth step.

Note that for reinforcement learning only the constraints specification from the first prediction time step will be used since the reward is computed for a single time step.

### Generate the Reward Function

Generate the reward function code from specifications in the `mpc` object using `generateRewardFunction`. The code is displayed in the MATLAB Editor.

```generateRewardFunction(mpcobj) ```

The generated reward function is a starting point for reward design. You can modify the function with different penalty function choices and tune the weights. For this example, make the following change to the generated code:

• Scale the original cost weights `Qy` and `Qmvrate` by a factor of `50` and `10` respectively.

• Scale the penalty weights `Wy`, `Wmv` and `Wmvrate` by a factor of `1e-2`, `1e-3` and `1e-3` respectively.

• The default exterior penalty function method is `step`. Change the method to `quadratic`.

After you make changes, the cost and penalty specifications should be as follows:

```Qy = 50 * [0.1 0]; Qmv = 0; Qmvrate = 10 * 0.1; Wy = 1e-2 * [1 1]; Wmv = 1e-3; Wmvrate = 1e-3; Py = Wy * exteriorPenalty(y,ymin,ymax,'quadratic'); Pmv = Wmv * exteriorPenalty(mv,mvmin,mvmax,'quadratic'); Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'quadratic'); ```

For this example, the modified code has been saved in the MATLAB function file `rewardFunctionMpc.m`. Display the generated reward function.

`type rewardFunctionMpc.m`
```function reward = rewardFunctionMpc(y,refy,mv,refmv,lastmv) % REWARDFUNCTIONMPC generates rewards from MPC specifications. % % Description of input arguments: % % y : Output variable from plant at step k+1 % refy : Reference output variable at step k+1 % mv : Manipulated variable at step k % refmv : Reference manipulated variable at step k % lastmv : Manipulated variable at step k-1 % % Limitations (MPC and NLMPC): % - Reward computed based on first step in prediction horizon. % Therefore, signal previewing and control horizon settings are ignored. % - Online cost and constraint update is not supported. % - Custom cost and constraint specifications are not considered. % - Time varying cost weights and constraints are not supported. % - Mixed constraint specifications are not considered (for the MPC case). % Reinforcement Learning Toolbox % 02-Jun-2021 16:05:41 %#codegen %% Specifications from MPC object % Standard linear bounds as specified in 'States', 'OutputVariables', and % 'ManipulatedVariables' properties ymin = [-Inf -78.5398163397448]; ymax = [Inf 78.5398163397448]; mvmin = -220; mvmax = 220; mvratemin = -Inf; mvratemax = Inf; % Scale factors as specified in 'States', 'OutputVariables', and % 'ManipulatedVariables' properties Sy = [6.28318530717959 157.07963267949]; Smv = 440; % Standard cost weights as specified in 'Weights' property Qy = 50 * [0.1 0]; Qmv = 0; Qmvrate = 10 * 0.1; %% Compute cost dy = (refy(:)-y(:)) ./ Sy'; dmv = (refmv(:)-mv(:)) ./ Smv'; dmvrate = (mv(:)-lastmv(:)) ./ Smv'; Jy = dy' * diag(Qy.^2) * dy; Jmv = dmv' * diag(Qmv.^2) * dmv; Jmvrate = dmvrate' * diag(Qmvrate.^2) * dmvrate; Cost = Jy + Jmv + Jmvrate; %% Penalty function weight (specify nonnegative) Wy = 1e-2 * [1 1]; Wmv = 1e-3; Wmvrate = 1e-3; %% Compute penalty % Penalty is computed for violation of linear bound constraints. % % To compute exterior bound penalty, use the exteriorPenalty function and % specify the penalty method as 'step' or 'quadratic'. % % Alternatively, use the hyperbolicPenalty or barrierPenalty function for % computing hyperbolic and barrier penalties. % % For more information, see help for these functions. % % Set Pmv value to 0 if the RL agent action specification has % appropriate 'LowerLimit' and 'UpperLimit' values. Py = Wy * exteriorPenalty(y,ymin,ymax,'quadratic'); Pmv = Wmv * exteriorPenalty(mv,mvmin,mvmax,'quadratic'); Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'quadratic'); Penalty = Py + Pmv + Pmvrate; %% Compute reward reward = -(Cost + Penalty); end ```

To integrate this reward function, open the MATLAB Function block in the Simulink model.

`open_system("rl_motor/Reward Function")`

Append the function with the following line of code and save the model.

`r = rewardFunctionMpc(y,refy,mv,refmv,lastmv);`

The MATLAB Function block will now execute `rewardFunctionMpc.m` during simulation.

For this example, the MATLAB Function block has already been modified and saved.

### Create a Reinforcement Learning Environment

The environment dynamics are modeled in the Servomechanism subsystem. For this environment,

• The observations are the reference signals (angle and torque), output variables (angle and torque), and their integrals from the last 3 time steps. The angle and torque signals are normalized by multiplying with the gain `[0.1 1/tau]`.

• The action is the voltage $\mathit{V}$ applied to the servomotor. The action values are limited between `-220` and `220`.

• The sample time is ${\mathit{T}}_{\mathit{s}}=0.1\mathit{s}$.

• The total simulation time is ${\mathit{T}}_{\mathit{f}}=20\mathit{s}$.

Specify the total simulation time and sample time.

```Tf = 20; Ts = 0.1;```

Create observation and action specifications for the environment.

```numObs = 24; numAct = 1; oinfo = rlNumericSpec([numObs 1]); ainfo = rlNumericSpec([numAct 1], ... LowerLimit=-220, ... UpperLimit=220);```

Create the reinforcement learning environment using the `rlSimulinkEnv` function.

```blk = "rl_motor/RL Agent"; env = rlSimulinkEnv(mdl, blk, oinfo, ainfo);```

#### Create a Reinforcement Learning Agent

Fix the random seed for reproducibility.

`rng(0)`

The agent used in this example is a twin-delayed deep deterministic policy gradient (TD3) agent. TD3 agents use two parametrized Q-value function approximators to estimate the value (that is the expected cumulative long-term reward) of the policy. To model the parametrized Q-value function within both critics, use a neural network with two inputs (the observation and action) and one output (the value of the policy when taking a given action from the state corresponding to a given observation). For more information on TD3 agents, see Twin-Delayed Deep Deterministic (TD3) Policy Gradient Agents.

Define each network path as an array of layer objects. Assign names to the input and output layers of each path. These names allow you to connect the paths and then later explicitly associate the network input and output layers with the appropriate environment channel.

```mainPath = [ featureInputLayer(numObs) fullyConnectedLayer(128) concatenationLayer(1, 2, Name="concat") reluLayer fullyConnectedLayer(64) reluLayer fullyConnectedLayer(1)]; actionPath = [ featureInputLayer(numAct) fullyConnectedLayer(8, Name="fc_act")]; % Create layerGraph object and add layers criticNet = layerGraph(mainPath); criticNet = addLayers(criticNet, actionPath); % Connect Layers criticNet = connectLayers(criticNet, "fc_act", "concat/in2");```

Plot the critic network structure.

`plot(criticNet);`

Create the critic function objects using `rlQValueFunction`. The critic function object encapsulates the critic by wrapping around the critic deep neural network. To make sure the critics have different initial weights, explicitly initialize each network before using them to create the critics.

```% Convert the neural network to a dlnetwork object without initializing the networks criticdlNet = dlnetwork(criticNet, Initialize=false); % Create the two critic functions for the TD3 agent critic1 = rlQValueFunction(initialize(criticdlNet), oinfo, ainfo); critic2 = rlQValueFunction(initialize(criticdlNet), oinfo, ainfo);```

TD3 agents learn a parametrized deterministic policy over continuous action spaces, which is learned by a continuous deterministic actor. This actor takes the current observation as input and returns as output an action that is a deterministic function of the observation.

To model the parametrized policy within the actor, use a neural network with one input layer (which receives the content of the environment observation channel, as specified by obsInfo) and one output layer (which returns the action to the environment action channel, as specified by `ainfo`).

Define the network as an array of layer objects.

```actorNet = [ featureInputLayer(numObs) fullyConnectedLayer(128) reluLayer fullyConnectedLayer(64) reluLayer fullyConnectedLayer(numAct)];```

Plot the actor network.

`plot(layerGraph(actorNet));`

Create a deterministic actor function that is responsible for modeling the policy of the agent. For more information, see `rlContinuousDeterministicActor`.

```actordlNet = dlnetwork(actorNet); actor = rlContinuousDeterministicActor(actordlNet, oinfo, ainfo);```

Specify the agent options using `rlTD3AgentOptions`. The agent trains from an experience buffer of maximum capacity `1e6` by randomly selecting mini-batches of size `256`. The discount factor of `0.995` favors long-term rewards.

```agentOpts = rlTD3AgentOptions(SampleTime=Ts, ... DiscountFactor=0.995, ... ExperienceBufferLength=1e6, ... MiniBatchSize=256);```

Specify optimizer options for the actor and critic functions. For this example, you will choose a learn rate of `1e-3` and a gradient threshold of `1` for the actor and critics.

```% Critic optimizer options for idx = 1:2 agentOpts.CriticOptimizerOptions(idx).LearnRate = 1e-3; agentOpts.CriticOptimizerOptions(idx).GradientThreshold = 1; end % Actor optimizer options agentOpts.ActorOptimizerOptions.LearnRate = 1e-3; agentOpts.ActorOptimizerOptions.GradientThreshold = 1;```

During training, the agent explores the action space using a Gaussian action noise model. Set the standard deviation and decay rate of the noise using the `ExplorationModel` property. The noise has an initial standard deviation of `100` which exponentially decays at the rate of 1e-5 until it reaches a mininum value of 1e-3`.` This favors exploration towards the beginning of training and exploitation in later stages.

```agentOpts.ExplorationModel.StandardDeviationMin = 1e-3; agentOpts.ExplorationModel.StandardDeviation = 100; agentOpts.ExplorationModel.StandardDeviationDecayRate = 1e-5;```

Create the TD3 agent using the actor and critic representations. For more information on TD3 agents, see `rlTD3Agent`.

`agent = rlTD3Agent(actor, [critic1,critic2], agentOpts);`

#### Train the Agent

To train the agent, first specify the training options using `rlTrainingOptions`. For this example, use the following options:

• Run each training for at most `5000` episodes, with each episode lasting at most `ceil(Tf/Ts)` time steps.

• Stop the training when the agent receives an average cumulative reward greater than `-7` over `20` consecutive episodes. At this point, the agent can track the reference signal.

```trainOpts = rlTrainingOptions(... MaxEpisodes=5000, ... MaxStepsPerEpisode=ceil(Tf/Ts), ... StopTrainingCriteria="AverageReward", ... StopTrainingValue=-7, ... ScoreAveragingWindowLength=20);```

Train the agent using the `train` function. Training this agent is a computationally intensive process that may take several minutes to complete. To save time while running this example, load a pretrained agent by setting `doTraining` to `false`. To train the agent yourself, set `doTraining` to `true`.

```doTraining = false; if doTraining trainResult = train(agent, env, trainOpts); else load('rlDCServomotorTD3Agent.mat') end```

A snapshot of the training progress is shown in the following figure. You can expect different results due to inherent randomness in the training process.

#### Validate Controller Response

To validate the performance of the trained agent, simulate the model and view the response in the Scope blocks. The reinforcement learning agent is able to track the reference angle while satisfying the constraints on torque and voltage.

`sim(mdl);`