# train

Train reinforcement learning agents within a specified environment

## Syntax

``trainStats = train(env,agents)``
``trainStats = train(agents,env)``
``env = train(___,trainOpts)``
``trainStats = train(agents,env,prevTrainStats)``

## Description

````trainStats = train(env,agents)` trains one or more reinforcement learning agents within a specified environment, using default training options. Although `agents` is an input argument, after each training episode, `train` updates the parameters of each agent specified in `agents` to maximize their expected long-term reward from the environment. When training terminates, `agents` reflects the state of each agent at the end of the final training episode.```
````trainStats = train(agents,env)` performs the same training as the previous syntax.```

example

````env = train(___,trainOpts)` trains `agents` within `env`, using the training options object `trainOpts`. Use training options to specify training parameters such as the criteria for terminating training, when to save agents, the maximum number of episodes to train, and the maximum number of steps per episode. Use this syntax after any of the input arguments in the previous syntaxes.```
````trainStats = train(agents,env,prevTrainStats)` resumes training from the last values of the agent parameters and training results contained in `prevTrainStats` obtained after the previous function call to `train`.```

## Examples

collapse all

Train the agent configured in the Train PG Agent to Balance Cart-Pole System example, within the corresponding environment. The observation from the environment is a vector containing the position and velocity of a cart, as well as the angular position and velocity of the pole. The action is a scalar with two possible elements (a force of either -10 or 10 Newtons applied to a cart).

Load the file containing the environment and a PG agent already configured for it.

`load RLTrainExample.mat`

Specify some training parameters using `rlTrainingOptions`. These parameters include the maximum number of episodes to train, the maximum steps per episode, and the conditions for terminating training. For this example, use a maximum of 1000 episodes and 500 steps per episode. Instruct the training to stop when the average reward over the previous five episodes reaches 500. Create a default options set and use dot notation to change some of the parameter values.

```trainOpts = rlTrainingOptions; trainOpts.MaxEpisodes = 1000; trainOpts.MaxStepsPerEpisode = 500; trainOpts.StopTrainingCriteria = "AverageReward"; trainOpts.StopTrainingValue = 500; trainOpts.ScoreAveragingWindowLength = 5;```

During training, the `train` command can save candidate agents that give good results. Further configure the training options to save an agent when the episode reward exceeds 500. Save the agent to a folder called `savedAgents`.

```trainOpts.SaveAgentCriteria = "EpisodeReward"; trainOpts.SaveAgentValue = 500; trainOpts.SaveAgentDirectory = "savedAgents";```

Finally, turn off the command-line display. Turn on the Reinforcement Learning Episode Manager so you can observe the training progress visually.

```trainOpts.Verbose = false; trainOpts.Plots = "training-progress";```

You are now ready to train the PG agent. For the predefined cart-pole environment used in this example, you can use `plot` to generate a visualization of the cart-pole system.

`plot(env)`

When you run this example, both this visualization and the Reinforcement Learning Episode Manager update with each training episode. Place them side by side on your screen to observe the progress, and train the agent. (This computation can take 20 minutes or more.)

`trainingInfo = train(agent,env,trainOpts);`

Episode Manager shows that the training successfully reaches the termination condition of a reward of 500 averaged over the previous five episodes. At each training episode, `train` updates `agent` with the parameters learned in the previous episode. When training terminates, you can simulate the environment with the trained agent to evaluate its performance. The environment plot updates during simulation as it did during training.

```simOptions = rlSimulationOptions('MaxSteps',500); experience = sim(env,agent,simOptions);```

During training, `train` saves to disk any agents that meet the condition specified with `trainOps.SaveAgentCritera` and `trainOpts.SaveAgentValue`. To test the performance of any of those agents, you can load the data from the data files in the folder you specified using `trainOpts.SaveAgentDirectory`, and simulate the environment with that agent.

This example shows how to set up a multi-agent training session on a Simulink® environment. In the example, you train two agents to collaboratively perform the task of moving an object.

The environment in this example is a frictionless two dimensional surface containing elements represented by circles. A target object C is represented by the blue circle with a radius of 2 m, and robots A (red) and B (green) are represented by smaller circles with radii of 1 m each. The robots attempt to move object C outside a circular ring of a radius 8 m by applying forces through collision. All elements within the environment have mass and obey Newton's laws of motion. In addition, contact forces between the elements and the environment boundaries are modeled as spring and mass damper systems. The elements can move on the surface through the application of externally applied forces in the X and Y directions. There is no motion in the third dimension and the total energy of the system is conserved.

Set the random seed and create the set of parameters required for this example.

```rng(10) rlCollaborativeTaskParams```

Open the Simulink model.

```mdl = "rlCollaborativeTask"; open_system(mdl)```

For this environment:

• The 2-dimensional space is bounded from –12 m to 12 m in both the X and Y directions.

• The contact spring stiffness and damping values are 100 N/m and 0.1 N/m/s, respectively.

• The agents share the same observations for positions, velocities of A, B, and C and the action values from the last time step.

• The simulation terminates when object C moves outside the circular ring.

• At each time step, the agents receive the following reward:

`$\begin{array}{l}{\mathit{r}}_{\mathit{A}}={\mathit{r}}_{\mathrm{global}}+{\mathit{r}}_{\mathrm{local},\mathit{A}}\\ {\mathit{r}}_{\mathit{B}}={\mathit{r}}_{\mathrm{global}}+{\mathit{r}}_{\mathrm{local},\mathit{B}}\\ {\mathit{r}}_{\mathrm{global}}=0.001{\mathit{d}}_{\mathit{c}}\\ {\mathit{r}}_{\mathrm{local},\mathit{A}}=-0.005{\mathit{d}}_{\mathrm{AC}}-0.008{\mathit{u}}_{\mathit{A}}^{2}\\ {\mathit{r}}_{\mathrm{local},\mathit{B}}=-0.005{\mathit{d}}_{\mathrm{BC}}-0.008{\mathit{u}}_{\mathit{B}}^{2}\end{array}$`

Here:

• ${\mathit{r}}_{\mathit{A}}$and ${\mathit{r}}_{\mathit{B}}$ are the rewards received by agents A and B, respectively.

• ${\mathit{r}}_{\mathrm{global}}$ is a team reward that is received by both agents as object C moves closer towards the boundary of the ring.

• ${\mathit{r}}_{\mathrm{local},\mathit{A}}$ and ${\mathit{r}}_{\mathrm{local},\mathit{B}}$ are local penalties received by agents A and B based on their distances from object C and the magnitude of the action from the last time step.

• ${\mathit{d}}_{\mathit{C}}$ is the distance of object C from the center of the ring.

• ${\mathit{d}}_{\mathrm{AC}}$ and ${\mathit{d}}_{\mathrm{BC}}$ are the distances between agent A and object C and agent B and object C, respectively.

• The agents apply external forces on the robots that result in motion. ${\mathit{u}}_{\mathit{A}}$ and ${\mathit{u}}_{\mathit{B}}$ are the action values of the two agents A and B from the last time step. The range of action values is between -1 and 1.

Environment

To create a multi-agent environment, specify the block paths of the agents using a string array. Also, specify the observation and action specification objects using cell arrays. The order of the specification objects in the cell array must match the order specified in the block path array. When agents are available in the MATLAB workspace at the time of environment creation, the observation and action specification arrays are optional. For more information on creating multi-agent environments, see `rlSimulinkEnv`.

Create the I/O specifications for the environment. In this example, the agents are homogeneous and have the same I/O specifications.

```% Number of observations numObs = 16; % Number of actions numAct = 2; % Maximum value of externally applied force (N) maxF = 1.0; % I/O specifications for each agent oinfo = rlNumericSpec([numObs,1]); ainfo = rlNumericSpec([numAct,1], "UpperLimit", maxF, "LowerLimit", -maxF); oinfo.Name = "observations"; ainfo.Name = "forces";```

Create the Simulink environment interface.

```blks = ["rlCollaborativeTask/Agent A", "rlCollaborativeTask/Agent B"]; obsInfos = {oinfo,oinfo}; actInfos = {ainfo,ainfo}; env = rlSimulinkEnv(mdl,blks,obsInfos,actInfos);```

Specify a reset function for the environment. The reset function `resetRobots` ensures that the robots start from random initial positions at the beginning of each episode.

`env.ResetFcn = @(in) resetRobots(in,RA,RB,RC,boundaryR);`

Agents

This example uses two Proximal Policy Optimization (PPO) agents with continuous action spaces. The agents apply external forces on the robots that result in motion. To learn more about PPO agents, see Proximal Policy Optimization Agents.

The agents collect experiences until the experience horizon (600 steps) is reached. After trajectory completion, the agents learn from mini-batches of 300 experiences. An objective function clip factor of 0.2 is used to improve training stability and a discount factor of 0.99 is used to encourage long-term rewards.

Specify the agent options for this example.

```agentOptions = rlPPOAgentOptions(... "ExperienceHorizon",600,... "ClipFactor",0.2,... "EntropyLossWeight",0.01,... "MiniBatchSize",300,... "NumEpoch",4,... "AdvantageEstimateMethod","gae",... "GAEFactor",0.95,... "SampleTime",Ts,... "DiscountFactor",0.99); agentOptions.ActorOptimizerOptions.LearnRate = 1e-4; agentOptions.CriticOptimizerOptions.LearnRate = 1e-4;```

Create the agents using the default agent creation syntax. For more information see `rlPPOAgent`.

```agentA = rlPPOAgent(oinfo, ainfo, rlAgentInitializationOptions("NumHiddenUnit", 200), agentOptions); agentB = rlPPOAgent(oinfo, ainfo, rlAgentInitializationOptions("NumHiddenUnit", 200), agentOptions);```

Training

To train multiple agents, you can pass an array of agents to the `train` function. The order of agents in the array must match the order of agent block paths specified during environment creation. Doing so ensures that the agent objects are linked to their appropriate I/O interfaces in the environment.

You can train multiple agents in a decentralized or centralized manner. In decentralized training, agents collect their own set of experiences during the episodes and learn independently from those experiences. In centralized training, the agents share the collected experiences and learn from them together. The actor and critic functions are synchronized between the agents after trajectory completion.

To configure a multi-agent training, you can create agent groups and specify a learning strategy for each group through the `rlMultiAgentTrainingOptions` object. Each agent group may contain unique agent indices, and the learning strategy can be `"centralized"` or `"decentralized"`. For example, you can use the following command to configure training for three agent groups with different learning strategies. The agents with indices `[1,2]` and `[3,4]` learn in a centralized manner while agent `4` learns in a decentralized manner.

` opts = rlMultiAgentTrainingOptions("AgentGroups", {[1,2], 4, [3,5]}, "LearningStrategy", ["centralized","decentralized","centralized"])`

For more information on multi-agent training, type `help rlMultiAgentTrainingOptions` in MATLAB.

You can perform decentralized or centralized training by running one of the following sections using the Run Section button.

1. Decentralized Training

To configure decentralized multi-agent training for this example:

• Automatically assign agent groups using the `AgentGroups=auto` option. This allocates each agent in a separate group.

• Specify the `"decentralized"` learning strategy.

• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.

• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or more.

```trainOpts = rlMultiAgentTrainingOptions(... "AgentGroups","auto",... "LearningStrategy","decentralized",... "MaxEpisodes",1000,... "MaxStepsPerEpisode",600,... "ScoreAveragingWindowLength",30,... "StopTrainingCriteria","AverageReward",... "StopTrainingValue",-10);```

Train the agents using the `train` function. Training can take several hours to complete depending on the available computational power. To save time, load the MAT file `decentralizedAgents.mat` which contains a set of pretrained agents. To train the agents yourself, set `doTraining` to `true`.

```doTraining = false; if doTraining decentralizedTrainResults = train([agentA,agentB],env,trainOpts); else load("decentralizedAgents.mat"); end```

The following figure shows a snapshot of decentralized training progress. You can expect different results due to randomness in the training process.

2. Centralized Training

To configure centralized multi-agent training for this example:

• Allocate both agents (with indices `1` and `2`) in a single group. You can do this by specifying the agent indices in the `"AgentGroups"` option.

• Specify the `"centralized"` learning strategy.

• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.

• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or more.

```trainOpts = rlMultiAgentTrainingOptions(... "AgentGroups",{[1,2]},... "LearningStrategy","centralized",... "MaxEpisodes",1000,... "MaxStepsPerEpisode",600,... "ScoreAveragingWindowLength",30,... "StopTrainingCriteria","AverageReward",... "StopTrainingValue",-10);```

Train the agents using the `train` function. Training can take several hours to complete depending on the available computational power. To save time, load the MAT file `centralizedAgents.mat` which contains a set of pretrained agents. To train the agents yourself, set `doTraining` to `true`.

```doTraining = false; if doTraining centralizedTrainResults = train([agentA,agentB],env,trainOpts); else load("centralizedAgents.mat"); end```

The following figure shows a snapshot of centralized training progress. You can expect different results due to randomness in the training process.

Simulation

Once the training is finished, simulate the trained agents with the environment.

```simOptions = rlSimulationOptions("MaxSteps",300); exp = sim(env,[agentA agentB],simOptions);```

For more information on agent simulation, see `rlSimulationOptions` and `sim`.

This example shows how to resume training using existing training data for training Q-learning. For more information on these agents, see Q-Learning Agents and SARSA Agents.

Create Grid World Environment

For this example, create the basic grid world environment.

`env = rlPredefinedEnv("BasicGridWorld");`

To specify that the initial state of the agent is always [2,1], create a reset function that returns the state number for the initial agent state.

```x0 = [1:12 15:17 19:22 24]; env.ResetFcn = @() x0(randi(numel(x0)));```

Fix the random generator seed for reproducibility.

`rng(1)`

Create Q-Learning Agent

To create a Q-learning agent, first create a Q table using the observation and action specifications from the grid world environment. Set the learning rate of the representation to 1.

```qTable = rlTable(getObservationInfo(env),getActionInfo(env)); qVf = rlQValueFunction(qTable,getObservationInfo(env),getActionInfo(env));```

Next, create a Q-learning agent using this table representation and configure the epsilon-greedy exploration. For more information on creating Q-learning agents, see `rlQAgent` and `rlQAgentOptions`. Keep the default value of the discount factor to `0.99`.

```agentOpts = rlQAgentOptions; agentOpts.EpsilonGreedyExploration.Epsilon = 0.2; agentOpts.CriticOptimizerOptions.LearnRate = 0.2; agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-3; agentOpts.EpsilonGreedyExploration.EpsilonMin = 1e-3; agentOpts.DiscountFactor = 1; qAgent = rlQAgent(qVf,agentOpts);```

Train Q-Learning Agent for 100 Episodes

To train the agent, first specify the training options. For more information, see `rlTrainingOptions`.

```trainOpts = rlTrainingOptions; trainOpts.MaxStepsPerEpisode = 200; trainOpts.MaxEpisodes = 1e6; trainOpts.Plots = "none"; trainOpts.Verbose = false; trainOpts.StopTrainingCriteria = "EpisodeCount"; trainOpts.StopTrainingValue = 100; trainOpts.ScoreAveragingWindowLength = 30;```

Train the Q-learning agent using the `train` function. Training can take several minutes to complete. To save time while running this example, load a pretrained agent by setting `doTraining` to `false`. To train the agent yourself, set `doTraining` to `true`.

`trainingStats = train(qAgent,env,trainOpts);`

Display index of last episode.

`trainingStats.EpisodeIndex(end)`
```ans = 100 ```

Train Q-Learning Agent for 200 More Episodes

Set the training to stop after episode 300.

`trainingStats.TrainingOptions.StopTrainingValue = 300;`

Resume the training using the training data that exists in `trainingStats`.

`trainingStats = train(qAgent,env,trainingStats);`

Display index of last episode.

`trainingStats.EpisodeIndex(end)`
```ans = 300 ```

Plot episode reward.

```figure() plot(trainingStats.EpisodeIndex,trainingStats.EpisodeReward) title('Episode Reward') xlabel('EpisodeIndex') ylabel('EpisodeReward')```

Display the final Q-value table.

```qAgentFinalQ = getLearnableParameters(getCritic(qAgent)); qAgentFinalQ{1}```
```ans = 25x4 single matrix 4.8373 10.0000 -1.3036 0.8020 9.2058 4.3147 11.0000 5.8501 10.0000 3.3987 4.5830 -6.4751 6.3569 6.0000 8.9971 5.4393 5.0433 5.8399 7.0067 4.1439 5.1031 8.5228 10.9936 0.1200 9.9616 8.8647 12.0000 10.0026 11.0000 8.4131 8.6974 6.0001 10.0000 6.9997 5.8122 8.5523 7.1164 7.0019 8.0000 5.8196 ⋮ ```

Validate Q-Learning Results

To validate the training results, simulate the agent in the training environment.

Before running the simulation, visualize the environment and configure the visualization to maintain a trace of the agent states.

```plot(env) env.ResetFcn = @() 2; env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;```

Simulate the agent in the environment using the `sim` function.

`sim(qAgent,env)`

## Input Arguments

collapse all

Agents to train, specified as a reinforcement learning agent object, such as `rlACAgent` or `rlDDPGAgent`, or as an array of such objects.

If `env` is a multi-agent environment created with `rlSimulinkEnv`, specify agents as an array. The order of the agents in the array must match the agent order used to create `env`. Multi-agent training is not supported for MATLAB® environments.

Note

`train` updates `agents` at each training episode. When training terminates, `agents` reflects the state of each agent at the end of the final training episode. Therefore, the rewards obtained by the final agents are not necessarily the highest achieved during the training process, due to continuous exploration. To save agents during training, create an `rlTrainingOptions` object specifying the SaveAgentCriteria and SaveAgentValue properties and pass it to `train` as a `trainOpts` argument.

For more information about how to create and configure agents for reinforcement learning, see Reinforcement Learning Agents.

Environment in which the agents act, specified as one of the following kinds of reinforcement learning environment object:

When `env` is a Simulink environment, calling `train` compiles and simulates the model associated with the environment.

Training parameters and options, specified as either an `rlTrainingOptions` or an `rlMultiAgentTrainingOptions` object. Use this argument to specify such parameters and options as:

• Criteria for ending training

• Criteria for saving candidate agents

• How to display training progress

• Options for parallel computing

For details, see `rlTrainingOptions` and `rlMultiAgentTrainingOptions`.

Training episode data, specified as an:

• `rlTrainingResult` object, when training a single agent.

• Array of `rlTrainingResult` objects when training multiple agents.

Use this argument to resume training from the exact point at which it stopped. This starts the training from the last values of the agent parameters and training results object obtained after the previous `train` function call. `prevTrainStats` contains, as one of its properties, the `rlTrainingOptions` object or the `rlMultiAgentTrainingOptions` object specifying the training option set. Therefore, to restart the training with updated training options, first change the training options in `trainResults` using dot notation. If the maximum number of episodes was already reached in the previous training session, you must increase the maximum number of episodes.

For details about the `rlTrainingResult` object properties, see the trainStats output argument.

## Output Arguments

collapse all

Training episode data, returned as an:

• `rlTrainingResult` object, when training a single agent.

• Array of `rlTrainingResult` objects when training multiple agents.

The following properties pertain to the `rlTrainingResult` object:

Episode numbers, returned as the column vector `[1;2;…;N]`, where `N` is the number of episodes in the training run. This vector is useful if you want to plot the evolution of other quantities from episode to episode.

Reward for each episode, returned in a column vector of length `N`. Each entry contains the reward for the corresponding episode.

Number of steps in each episode, returned in a column vector of length `N`. Each entry contains the number of steps in the corresponding episode.

Average reward over the averaging window specified in `trainOpts`, returned as a column vector of length `N`. Each entry contains the average award computed at the end of the corresponding episode.

Total number of agent steps in training, returned as a column vector of length `N`. Each entry contains the cumulative sum of the entries in `EpisodeSteps` up to that point.

Critic estimate of long-term reward using the current agent and the environment initial conditions, returned as a column vector of length `N`. Each entry is the critic estimate (Q0) for the agent of the corresponding episode. This field is present only for agents that have critics, such as `rlDDPGAgent` and `rlDQNAgent`.

Information collected during the simulations performed for training, returned as:

• For training in MATLAB environments, a structure containing the field `SimulationError`. This field is a column vector with one entry per episode. When the `StopOnError` option of `rlTrainingOptions` is `"off"`, each entry contains any errors that occurred during the corresponding episode.

• For training in Simulink environments, a vector of `Simulink.SimulationOutput` objects containing simulation data recorded during the corresponding episode. Recorded data for an episode includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred during the corresponding episode.

Training options set, returned as:

## Tips

• `train` updates the agents as training progresses. To preserve the original agent parameters for later use, save the agents to a MAT-file.

• By default, calling `train` opens the Reinforcement Learning Episode Manager, which lets you visualize the progress of the training. The Episode Manager plot shows the reward for each episode, a running average reward value, and the critic estimate Q0 (for agents that have critics). The Episode Manager also displays various episode and training statistics. To turn off the Reinforcement Learning Episode Manager, set the `Plots` option of `trainOpts` to `"none"`.

• If you use a predefined environment for which there is a visualization, you can use `plot(env)` to visualize the environment. If you call `plot(env)` before training, then the visualization updates during training to allow you to visualize the progress of each episode. (For custom environments, you must implement your own `plot` method.)

• Training terminates when the conditions specified in `trainOpts` are satisfied. To terminate training in progress, in the Reinforcement Learning Episode Manager, click Stop Training. Because `train` updates the agent at each episode, you can resume training by calling `train(agent,env,trainOpts)` again, without losing the trained parameters learned during the first call to `train`.

• During training, you can save candidate agents that meet conditions you specify with `trainOpts`. For instance, you can save any agent whose episode reward exceeds a certain value, even if the overall condition for terminating training is not yet satisfied. `train` stores saved agents in a MAT-file in the folder you specify with `trainOpts`. Saved agents can be useful, for instance, to allow you to test candidate agents generated during a long-running training process. For details about saving criteria and saving location, see `rlTrainingOptions`.

## Algorithms

In general, `train` performs the following iterative steps:

1. Initialize `agent`.

2. For each episode:

1. Reset the environment.

2. Get the initial observation s0 from the environment.

3. Compute the initial action a0 = μ(s0).

4. Set the current action to the initial action (aa0) and set the current observation to the initial observation (ss0).

5. While the episode is not finished or terminated:

1. Step the environment with action a to obtain the next observation s' and the reward r.

2. Learn from the experience set (s,a,r,s').

3. Compute the next action a' = μ(s').

4. Update the current action with the next action (aa') and update the current observation with the next observation (ss').

5. Break if the episode termination conditions defined in the environment are met.

3. If the training termination condition defined by `trainOpts` is met, terminate training. Otherwise, begin the next episode.

The specifics of how `train` performs these computations depends on your configuration of the agent and environment. For instance, resetting the environment at the start of each episode can include randomizing initial state values, if you configure your environment to do so.

## Version History

Introduced in R2019a

expand all

Behavior change in future release