Main Content

rlVectorQValueFunction

Vector Q-value function approximator for reinforcement learning agents

Description

This object implements a vector Q-value function approximator that you can use as a critic with a discrete action space for a reinforcement learning agent. A vector Q-value function is a function that maps an environment state to a vector in which each element represents the predicted discounted cumulative long-term reward when the agent starts from the given state and executes the action corresponding to the element number. A Q-value function critic therefore needs only the environment state as input. After you create an rlVectorQValueFunction critic, use it to create an agent such as rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent. For more information on creating representations, see Create Policies and Value Functions.

Creation

Description

example

critic = rlVectorQValueFunction(net,observationInfo,actionInfo) creates the multi-output Q-value function critic with a discrete action space. Here, net is the deep neural network used as an approximator, and must have only the observations as input and a single output layer having as many elements as the number of possible discrete actions. The network input layers are automatically associated with the environment observation channels according to the dimension specifications in observationInfo. This function sets the ObservationInfo and ActionInfo properties of critic to the observationInfo and actionInfo input arguments, respectively.

example

critic = rlVectorQValueFunction(net,observationInfo,ObservationInputNames=netObsNames) specifies the names of the network input layers to be associated with the environment observation channels. The function assigns, in sequential order, each environment observation channel specified in observationInfo to the layer specified by the corresponding name in the string array netObsNames. Therefore, the network input layers, ordered as the names in netObsNames, must have the same data type and dimensions as the observation channels, as ordered in observationInfo.

example

critic = rlVectorQValueFunction({basisFcn,W0},observationInfo,actionInfo) creates the multi-output Q-value function critic with a discrete action space using a custom basis function as underlying approximator. The first input argument is a two-element cell array whose first element is the handle basisFcn to a custom basis function and whose second element is the initial weight matrix W0. Here the basis function must have only the observations as inputs, and W0 must have as many columns as the number of possible actions. The function sets the ObservationInfo and ActionInfo properties of critic to the input arguments observationInfo and actionInfo, respectively.

critic = rlVectorQValueFunction(___,UseDevice=useDevice) specifies the device used to perform computations for the critic object, and sets the UseDevice property of critic to the useDevice input argument. You can use this syntax with any of the previous input-argument combinations.

Input Arguments

expand all

Deep neural network used as the underlying approximator within the critic,specified as one of the following:

Note

Among the different network representation options, dlnetwork is preferred, since it has built-in validation checks and supports automatic differentiation. If you pass another network object as an input argument, it is internally converted to a dlnetwork object. However, best practice is to convert other representations to dlnetwork explicitly before using it to create a critic or an actor for a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any Deep Learning Toolbox™ neural network object. The resulting dlnet is the dlnetwork object that you use for your critic or actor. This practice allows a greater level of insight and control for cases in which the conversion is not straightforward and might require additional specifications.

The network must have only the observation channels as inputs and a single output layer having as many elements as the number of possible discrete actions. Each element of the output vector approximates the value of executing the corresponding action starting from the currently observed state.

rlQValueFunction objects support recurrent deep neural networks.

The learnable parameters of the critic are the weights of the deep neural network. For a list of deep neural network layers, see List of Deep Learning Layers. For more information on creating deep neural networks for reinforcement learning, see Create Policies and Value Functions.

Network input layers names corresponding to the environment observation channels, specified as a string array or a cell array of character vectors. When you use this argument after 'ObservationInputNames', the function assigns, in sequential order, each environment observation channel specified in observationInfo to each network input layer specified by the corresponding name in the string array netObsNames. Therefore, the network input layers, ordered as the names in netObsNames, must have the same data type and dimensions as the observation specifications, as ordered in observationInfo.

Note

Of the information specified in observationInfo, the function uses only the data type and dimension of each channel, but not its (optional) name or description.

Example: {"NetInput1_airspeed","NetInput2_altitude"}

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user defined function can either be an anonymous function or a function on the MATLAB path. The output of the critic is the vector c = W'*B, where W is a matrix containing the learnable parameters, and B is the column vector returned by the custom basis function. Each element of a approximates the value of executing the corresponding action from the observed state.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the channels defined in observationInfo.

Example: @(obs1,obs2) [act(2)*obs1(1)^2; abs(obs2(5))]

Initial value of the basis function weights W, specified as a matrix having as many rows as the length of the basis function output vector and as many columns as the number of possible actions.

Properties

expand all

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name. Note that only the data type and dimension of a channel are used by the software to create actors or critics, but not its (optional) name.

rlVectorQValueFucntion sets the ObservationInfo property of critic to the input argument observationInfo.

You can extract ObservationInfo from an existing environment or agent using getObservationInfo. You can also construct the specifications manually.

Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name. Note that the function does not use the name of the action channel specified in actionInfo.

Note

Only one action channel is allowed.

rlVectorQValueFucntion sets the ActionInfo property of critic to the input actionInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can also construct the specifications manually.

Computation device used to perform operations such as gradient computation, parameter update and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA® enabled NVIDIA® GPU. For more information on supported GPUs see GPU Computing Requirements (Parallel Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with MATLAB®.

Note

Training or simulating an agent on a GPU involves device-specific numerical round-off errors. These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this argument. Instead, when training your agent, use an rlTrainingOptions object in which the UseParallel option is set to true. For more information about training using multicore processors and GPUs for training, see Train Agents Using Parallel Computing and GPUs.

Example: "gpu"

Object Functions

rlDQNAgentDeep Q-network (DQN) reinforcement learning agent
rlQAgentQ-learning reinforcement learning agent
rlSARSAAgentSARSA reinforcement learning agent
getValueObtain estimated value from a critic given environment observations and actions
getMaxQValueObtain maximum estimated value over all possible actions from a Q-value function critic with discrete action space, given environment observations
evaluateEvaluate function approximator object given observation (or observation-action) input data
gradientEvaluate gradient of function approximator object given observation and action input data
accelerateOption to accelerate computation of gradient for approximator object based on neural network
getLearnableParametersObtain learnable parameter values from agent, function approximator, or policy object
setLearnableParametersSet learnable parameter values of agent, function approximator, or policy object
setModelSet function approximation model for actor or critic
getModelGet function approximator model from actor or critic

Examples

collapse all

This example shows how to create a vector Q-value function critic for a discrete action space using a deep neural network approximator.

This critic takes only the observation as input and produces as output a vector with as many elements as the possible actions. Each element represents the expected cumulative long term reward when the agent starts from the given observation and takes the action corresponding to the position of the element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example, define the action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

To approximate the Q-value function within the critic, use a deep neural network. The input of the network must accept a four-element vector, as defined by obsInfo. The output must be a single output layer having as many elements as the number of possible discrete actions (three in this case, as defined by actInfo).

Create a neural network as a row vector of layer objects.

net = [featureInputLayer(4) 
       fullyConnectedLayer(3)];

Convert the network to a dlnetwork object.

net = dlnetwork(net);

Summarize the network properties.

summary(net)
   Initialized: true

   Number of learnables: 15

   Inputs:
      1   'input'   4 features

Create the critic using the network, as well as the observation and action specification objects. The network input layers are automatically associated with the components of the observation signals according to the dimension specifications in obsInfo.

critic = rlVectorQValueFunction(net,obsInfo,actInfo)
critic = 
  rlVectorQValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlNumericSpec]
         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
          UseDevice: "cpu"

To check your critic, use getValue to return the values of a random observation, using the current network weights. There is one value for each of the three possible actions.

v = getValue(critic,{rand(obsInfo.Dimension)})
v = 3x1 single column vector

    0.7232
    0.8177
   -0.2212

You can now use the critic (along with an actor) to create a discrete action space agent relying on a Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).

A vector Q-value function critic takes only the observation as input and produces as output a vector with as many elements as the possible actions. Each element represents the expected cumulative long term reward when the agent starts from the given observation and takes the action corresponding to the position of the element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example, define the action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

To approximate the Q-value function within the critic, use a deep neural network. The input of the network must accept a four-element vector, as defined by obsInfo. The output must be a single output layer having as many elements as the number of possible discrete actions (three in this case, as defined by actInfo).

Create the network as an array of layer objects. Name the network input netObsIn (so you can later explicitly associate it with the observation input channel).

net = [featureInputLayer(4,Name="netObsIn") 
       fullyConnectedLayer(3,Name="value")];

Convert the network to a dlnetwork object and display the number of its learnable parameters.

net = dlnetwork(net)
net = 
  dlnetwork with properties:

         Layers: [2x1 nnet.cnn.layer.Layer]
    Connections: [1x2 table]
     Learnables: [2x3 table]
          State: [0x3 table]
     InputNames: {'netObsIn'}
    OutputNames: {'value'}
    Initialized: 1

  View summary with summary.

summary(net)
   Initialized: true

   Number of learnables: 15

   Inputs:
      1   'netObsIn'   4 features

Create the critic using the network, the observations specification object, and the name of the network input layer. The specified network input layer, netObsIn, is associated with the environment observation, and therefore must have the same data type and dimension as the observation channel specified in obsInfo.

critic = rlVectorQValueFunction(net, ...
    obsInfo,actInfo, ...
    ObservationInputNames="netObsIn")
critic = 
  rlVectorQValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlNumericSpec]
         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the values of a random observation, using the current network weights. The function returns one value for each of the three possible actions.

v = getValue(critic, ...
    {rand(obsInfo.Dimension)})
v = 3x1 single column vector

    0.7232
    0.8177
   -0.2212

You can now use the critic (along with an actor) to create a discrete action space agent relying on a Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).

This critic takes only the observation as input and produces as output a vector with as many elements as the possible actions. Each element represents the expected cumulative long term reward when the agent starts from the given observation and takes the action corresponding to the position of the element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as consisting of two channels, the first a two-by-two continuous matrix and the second is scalar that can assume only two values, 0 and 1.

obsInfo = [rlNumericSpec([2 2]) 
           rlFiniteSetSpec([0 1])];

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example, define the action space as a finite set consisting of three possible vectors, [1 2], [3 4], and [5 6].

actInfo = rlFiniteSetSpec({[1 2],[3 4],[5 6]});

Create a custom basis function to approximate the value function within the critic. The custom basis function must return a column vector. Each vector element must be a function of the observations defined by obsInfo.

myBasisFcn = @(obsA,obsB) [obsA(1,1)+obsB(1)^2;
                           obsA(2,1)-obsB(1)^2;
                           obsA(1,2)^2+obsB(1);
                           obsA(2,2)^2-obsB(1)];

The output of the critic is the vector c = W'*myBasisFcn(obsA,obsB), where W is a weight matrix which must have as many rows as the length of the basis function output and as many columns as the number of possible actions.

Each element of c is the expected cumulative long term reward when the agent starts from the given observation and takes the action corresponding to the position of the considered element. The elements of W are the learnable parameters.

Define an initial parameter matrix.

W0 = rand(4,3);

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial parameter matrix. The second and third arguments are, respectively, the observation and action specification objects.

critic = rlVectorQValueFunction({myBasisFcn,W0},obsInfo,actInfo)
critic = 
  rlVectorQValueFunction with properties:

    ObservationInfo: [2x1 rl.util.RLDataSpec]
         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the values of a random observation, using the current parameter matrix. The function returns one value for each of the three possible actions.

v = getValue(critic,{rand(2,2),0})
v = 3×1

    1.3192
    0.8420
    1.5053

Note that the critic does not enforce the set constraint for the discrete set elements.

v = getValue(critic,{rand(2,2),-1})
v = 3×1

    2.7890
    1.8375
    3.0855

You can now use the critic (along with an actor) to create a discrete action space agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Version History

Introduced in R2022a