rlPrioritizedReplayMemory

Replay memory experience buffer with prioritized sampling

Since R2022b

Description

An off-policy reinforcement learning agent stores experiences in a circular experience buffer.

During training the agent stores each of its experiences (S,A,R,S',D) in the buffer. Here:

S is the current observation of the environment.
A is the action taken by the agent.
R is the reward for taking action A.
S' is the next observation after taking action A.
D is the is-done signal after taking action A.

The agent then samples mini-batches of experiences from the buffer and uses these mini-batches to update its actor and critic function approximators.

By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object as their experience buffer. Agents uniformly sample data from this buffer. To perform nonuniform prioritized sampling [1], which can improve sample efficiency when training your agent, use an rlPrioritizedReplayMemory object. For more information on prioritized sampling, see Algorithms.

For goal-conditioned tasks, you can also replace your experience buffer with one of the following hindsight replay memory objects.

rlHindsightReplayMemory — Uniform sampling of experiences and generation of hindsight experiences by replacing goals with goal measurements
rlHindsightPrioritizedReplayMemory — Prioritized nonuniform sampling of experiences and generation of hindsight experiences

Creation

Syntax

buffer = rlPrioritizedReplayMemory(obsInfo,actInfo)

buffer = rlPrioritizedReplayMemory(obsInfo,actInfo,maxLength)

Description

buffer = rlPrioritizedReplayMemory(obsInfo,actInfo) creates a prioritized replay memory experience buffer that is compatible with the observation and action specifications in obsInfo and actInfo, respectively.

example

buffer = rlPrioritizedReplayMemory(obsInfo,actInfo,maxLength) sets the maximum length of the buffer by setting the MaxLength property.

Input Arguments

expand all

`obsInfo` — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of specification objects defining properties such as dimensions, data types, and names of the observation signals.

You can extract the observation specifications from an existing environment or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

Example: [rlFiniteSetSpec([-1 0 1]) rlNumericSpec([3 1])]

`actInfo` — Action specifications
specification object | array of specification objects

Action specifications, specified as a reinforcement learning specification object defining properties such as dimensions, data types, and names of the action signals.

You can extract the action specifications from an existing environment or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

Example: [rlFiniteSetSpec([0 1]) rlNumericSpec([1 1])]

Properties

expand all

`MaxLength` — Maximum buffer length
Read-only: `10000` (default) | nonnegative integer

This property is read-only.

Maximum buffer length, specified as a nonnegative integer.

To change the maximum buffer length, use the resize function.

Example: MaxLength=1e5

`Length` — Number of experiences in buffer
Read-only: `0` (default) | nonnegative integer

This property is read-only.

Number of experiences in buffer, specified as a nonnegative integer.

Example: Length=1000

`PriorityExponent` — Priority exponent
`0.6` (default) | nonnegative scalar less than or equal to 1

Priority exponent to control the impact of prioritization during probability computation, specified as a nonnegative scalar less than or equal to 1.

If the priority exponent is zero, the agent uses uniform sampling.

Example: PriorityExponent=0.5

`InitialImportanceSamplingExponent` — Initial value of importance sampling exponent
`0.4` (default) | nonnegative scalar less than or equal to 1

Initial value of the importance sampling exponent, specified as a nonnegative scalar less than or equal to 1.

Example: InitialImportanceSamplingExponent=0.5

`NumAnnealingSteps` — Number of annealing steps
`1000000` (default) | positive integer

Number of annealing steps for updating the importance sampling exponent, specified as a positive integer.

Example: NumAnnealingSteps=1e5

`ImportanceSamplingExponent` — Current value of importance sampling exponent
Read-only: `0.4` (default) | nonnegative scalar less than or equal to 1

This property is read-only.

Current value of the importance sampling exponent, specified as a nonnegative scalar less than or equal to 1.

During training, ImportanceSamplingExponent is linearly increased from InitialImportanceSamplingExponent to 1 over NumAnnealingSteps steps.

Example: ImportanceSamplingExponent=0.5

Object Functions

`append`	Append experiences to replay memory buffer
`sample`	Sample experiences from replay memory buffer
`resize`	Resize replay memory experience buffer
`reset`	Reset environment, agent, experience buffer, or policy object
`allExperiences`	Return all experiences in replay memory buffer
`validateExperience`	Validate experiences for replay memory
`getActionInfo`	Obtain action data specifications from reinforcement learning environment, agent, or experience buffer
`getObservationInfo`	Obtain observation data specifications from reinforcement learning environment, agent, or experience buffer

Examples

collapse all

Create Default DQN Agent and Set Prioritized Replay Memory

Open Live Script

Create an environment for training the agent. For this example, load a predefined environment.

env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

Extract the observation and action specifications from the agent.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a DQN agent from the environment specifications.

agent = rlDQNAgent(obsInfo,actInfo);

By default, the agent uses a replay memory experience buffer with uniform sampling.

Replace the default experience buffer with a prioritized replay memory buffer.

agent.ExperienceBuffer = rlPrioritizedReplayMemory(obsInfo,actInfo);

Configure the prioritized replay memory options. For example, set the initial importance sampling exponent to 0.5 and the number of annealing steps for updating the exponent during training to 1e4.

agent.ExperienceBuffer.NumAnnealingSteps = 1e4;
agent.ExperienceBuffer.PriorityExponent = 0.5;
agent.ExperienceBuffer.InitialImportanceSamplingExponent = 0.5;

Limitations

Prioritized experience replay does not support agents that use recurrent neural networks.

Algorithms

expand all

Prioritized Sampling

Prioritized replay memory samples experiences according to experience priorities. For a given experience, the priority is defined as the absolute value of the associated temporal difference (TD) error. A larger TD error indicates that the critic network is not well-trained for the corresponding experience. Therefore, sampling such experiences during critic updates can help efficiently improve the critic performance, which often improves the sample efficiency of agent training.

When using prioritized replay memory, agents use the following process when sampling a mini-batch of experiences and updating a critic.

Compute the sampling probability P for each experience in the buffer based on the experience priority.
$P (j) = \frac{p {(j)}^{α}}{{\sum_{i = 1}^{N} p (i)}^{α}}$
Here:
- N is the number of experiences in the replay memory buffer.
- p is the experience priority.
- α is a priority exponent. To set α, use the PriorityExponent parameter.
Sample a mini-batch of experiences according to the computed probabilities.
Compute the importance sampling weights (w) for the sampled experiences.
$\begin{array}{l} w' (j) = {(N \cdot P (j))}^{- β} \\ w (j) \leftarrow \frac{w' (j)}{\max_{i \in mini-batch} w' (i)} \end{array}$
Here, β is the importance sampling exponent. The ImportanceSamplingExponent parameter contains the current value of β. To control β, set the ImportanceSamplingExponent and NumAnnealingSteps parameters.
Compute the weighted loss using the importance sampling weights w and the TD error δ to update a critic.
Update the priorities of the sampled experiences based on the TD error.
$p (j) = | δ |$
Update the importance sampling exponent β by linearly annealing the exponent value until it reaches 1.
$β \leftarrow β + \frac{1 - β_{0}}{N_{S}}$
Here:
- β₀ is the initial importance sampling exponent. To specify β₀, use the InitialImportanceSamplingExponent parameter.
- N_S is the number of annealing steps. To specify N_s, use the NumAnnealingSteps parameter.

References

[1] Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. 'Prioritized experience replay'. arXiv:1511.05952 [Cs] 25 February 2016. https://arxiv.org/abs/1511.05952.

Version History

Introduced in R2022b

rlPrioritizedReplayMemory

Description

Creation

Syntax

Description

Input Arguments

`obsInfo` — Observation specifications
specification object | array of specification objects

`actInfo` — Action specifications
specification object | array of specification objects

Properties

`MaxLength` — Maximum buffer length
Read-only: `10000` (default) | nonnegative integer

`Length` — Number of experiences in buffer
Read-only: `0` (default) | nonnegative integer

`PriorityExponent` — Priority exponent
`0.6` (default) | nonnegative scalar less than or equal to 1

`InitialImportanceSamplingExponent` — Initial value of importance sampling exponent
`0.4` (default) | nonnegative scalar less than or equal to 1

`NumAnnealingSteps` — Number of annealing steps
`1000000` (default) | positive integer

`ImportanceSamplingExponent` — Current value of importance sampling exponent
Read-only: `0.4` (default) | nonnegative scalar less than or equal to 1

Object Functions

Examples

Create Default DQN Agent and Set Prioritized Replay Memory

Limitations

Algorithms

Prioritized Sampling

References

Version History

See Also

Functions

Objects

Topics

rlPrioritizedReplayMemory

Description

Creation

Syntax

Description

Input Arguments

obsInfo — Observation specifications specification object | array of specification objects

actInfo — Action specifications specification object | array of specification objects

Properties

MaxLength — Maximum buffer length Read-only: 10000 (default) | nonnegative integer

Length — Number of experiences in buffer Read-only: 0 (default) | nonnegative integer

PriorityExponent — Priority exponent 0.6 (default) | nonnegative scalar less than or equal to 1

InitialImportanceSamplingExponent — Initial value of importance sampling exponent 0.4 (default) | nonnegative scalar less than or equal to 1

NumAnnealingSteps — Number of annealing steps 1000000 (default) | positive integer

ImportanceSamplingExponent — Current value of importance sampling exponent Read-only: 0.4 (default) | nonnegative scalar less than or equal to 1

Object Functions

Examples

Create Default DQN Agent and Set Prioritized Replay Memory

Limitations

Algorithms

Prioritized Sampling

References

Version History

See Also

Functions

Objects

Topics

`obsInfo` — Observation specifications
specification object | array of specification objects

`actInfo` — Action specifications
specification object | array of specification objects

`MaxLength` — Maximum buffer length
Read-only: `10000` (default) | nonnegative integer

`Length` — Number of experiences in buffer
Read-only: `0` (default) | nonnegative integer

`PriorityExponent` — Priority exponent
`0.6` (default) | nonnegative scalar less than or equal to 1

`InitialImportanceSamplingExponent` — Initial value of importance sampling exponent
`0.4` (default) | nonnegative scalar less than or equal to 1

`NumAnnealingSteps` — Number of annealing steps
`1000000` (default) | positive integer

`ImportanceSamplingExponent` — Current value of importance sampling exponent
Read-only: `0.4` (default) | nonnegative scalar less than or equal to 1