lbfgsState

State of limited-memory BFGS (L-BFGS) solver

Since R2023a

Description

An lbfgsState object stores information about steps in the L-BFGS algorithm.

The L-BFGS algorithm [1] is a quasi-Newton method that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. Use the L-BFGS algorithm for small networks and data sets that you can process in a single batch.

Use lbfgsState objects in conjunction with the lbfgsupdate function to train a neural network using the L-BFGS algorithm.

Creation

Syntax

solverState = lbfgsState

solverState = lbfgsState(Name=Value)

Description

solverState = lbfgsState creates an L-BFGS state object with a history size of 10 and an initial inverse Hessian factor of 1.

example

solverState = lbfgsState(Name=Value) sets the HistorySize and InitialInverseHessianFactor properties using one or more name-value arguments.

example

Properties

expand all

L-BFGS State

`HistorySize` — Number of state updates to store
`10` (default) | positive integer

Number of state updates to store, specified as a positive integer. Values between 3 and 20 suit most tasks.

The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.

After creating the lbfgsState object, this property is read-only.

`InitialInverseHessianFactor` — Initial value that characterizes approximate inverse Hessian matrix
`1` (default) | positive scalar

This property is read-only.

Initial value that characterizes the approximate inverse Hessian matrix, specified as a positive scalar.

To save memory, the L-BFGS algorithm does not store and invert the dense Hessian matrix B. Instead, the algorithm uses the approximation $B_{k - m}^{- 1} \approx λ_{k} I$ , where m is the history size, the inverse Hessian factor $λ_{k}$ is a scalar, and I is the identity matrix. The algorithm then stores the scalar inverse Hessian factor only. The algorithm updates the inverse Hessian factor at each step.

The initial inverse hessian factor is the value of $λ_{0}$ .

For more information, see Limited-Memory BFGS.

After creating the lbfgsState object, this property is read-only.

`InverseHessianFactor` — Value that characterizes approximate inverse Hessian matrix
`1` (default) | positive scalar

Value that characterizes the approximate inverse Hessian matrix, specified as a positive scalar.

For more information, see Limited-Memory BFGS.

`InitialGradientsNorm` — Norm of initial gradients
`[]` (default) | `dlarray` scalar

Since R2023b

This property is read-only.

Norm of the initial gradients, specified as a dlarray scalar or [].

If the state object is the output of the lbfgsupdate function, then InitialGradientsNorm is the first value that the GradientsNorm property takes. Otherwise, InitialGradientsNorm is [].

`InitialStepSize` — Initial step size
`[]` (default) | `"auto"` | real finite scalar

Since R2024b

Initial step size, specified as one of these values:

[] — Do not use an initial step size to determine the initial Hessian approximation.
"auto" — Determine the initial step size automatically. The software uses an initial step size of $‖ s_{0} ‖_{\infty} = \frac{1}{2} ‖ W_{0} ‖_{\infty} + 0.1$ , where W₀ are the initial learnable parameters of the network.
Positive real scalar — Use the specified value as the initial step size $‖ s_{0} ‖_{\infty}$ .

If InitialStepSize is "auto" or a positive real scalar, then the software approximates the initial inverse Hessian using $λ_{0} = \frac{‖ s_{0} ‖_{\infty}}{‖ \nabla J (W_{0}) ‖_{\infty}}$ , where λ₀ is the initial inverse Hessian factor and $\nabla J (W_{0})$ denotes the gradients of the loss with respect to the initial learnable parameters. For more information, see Limited-Memory BFGS.

`StepHistory` — Step history
`{}` (default) | cell array

Step history, specified as a cell array.

The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.

Data Types: cell

`GradientsDifferenceHistory` — Gradients difference history
`{}` (default) | cell array

Gradients difference history, specified as a cell array.

The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.

Data Types: cell

`HistoryIndices` — History indices
0-by-1 vector (default) | row vector

History indices, specified as a row vector.

HistoryIndices is a 1-by-HistorySize vector, where StepHistory(i) and GradientsDifferenceHistory(i) correspond to iteration HistoryIndices(i).

For more information, see Limited-Memory BFGS.

Data Types: double

Iteration Information

`Loss` — Loss
`[]` (default) | `dlarray` scalar | numeric scalar

This property is read-only.

Loss, specified as a dlarray scalar, a numeric scalar, or [].

If the state object is the output of the lbfgsupdate function, then Loss is the first output of the loss function that you pass to the lbfgsupdate function. Otherwise, Loss is [].

`Gradients` — Gradients
`[]` (default) | `dlarray` object | numeric array | cell array | structure | table

This property is read-only.

Gradients, specified as a dlarray object, a numeric array, a cell array, a structure, a table, or [].

If the state object is the output of the lbfgsupdate function, then Gradients is the second output of the loss function that you pass to the lbfgsupdate function. Otherwise, Gradients is [].

`AdditionalLossFunctionOutputs` — Additional loss function outputs
1-by-0 cell array (default) | cell array

This property is read-only.

Additional loss function outputs, specified as a cell array.

If the state object is the output of the lbfgsupdate function, then AdditionalLossFunctionOutputs is a cell array containing additional outputs of the loss function that you pass to the lbfgsupdate function. Otherwise, AdditionalLossFunctionOutputs is a 1-by-0 cell array.

Data Types: cell

`StepNorm` — Norm of step
`[]` (default) | `dlarray` scalar | numeric scalar

This property is read-only.

Norm of the step, specified as a dlarray scalar, numeric scalar, or [].

If the state object is the output of the lbfgsupdate function, then StepNorm is the norm of the step that the lbfgsupdate function calculates. Otherwise, StepNorm is [].

`GradientsNorm` — Norm of gradients
`[]` (default) | `dlarray` scalar | numeric scalar

This property is read-only.

Norm of the gradients, specified as a dlarray scalar, a numeric scalar, or [].

If the state object is the output of the lbfgsupdate function, then GradientsNorm is the norm of the second output of the loss function that you pass to the lbfgsupdate function. Otherwise, GradientsNorm is [].

`LineSearchStatus` — Status of line search algorithm
`""` (default) | `"completed"` | `"failed"`

This property is read-only.

Status of the line search algorithm, specified as "", "completed", or "failed".

If the state object is the output of the lbfgsupdate function, then LineSearchStatus is one of these values:

"completed" — The algorithm finds a learning rate that satisfies the LineSearchMethod and MaxNumLineSearchIterations options that the lbfgsupdate function uses.
"failed" — The algorithm fails to find a learning rate that satisfies the LineSearchMethod and MaxNumLineSearchIterations options that the lbfgsupdate function uses.

Otherwise, LineSearchStatus is "".

`LineSearchMethod` — Method solver uses to find suitable learning rate
`""` (default) | `"weak-wolfe"` | `"strong-wolfe"` | `"backtracking"`

This property is read-only.

Method solver uses to find a suitable learning rate, specified as "weak-wolfe", "strong-wolfe", "backtracking", or "".

If the state object is the output of the lbfgsupdate function, then LineSearchMethod is the line search method that the lbfgsupdate function uses. Otherwise, LineSearchMethod is "".

`MaxNumLineSearchIterations` — Maximum number of line search iterations
`0` (default) | nonnegative integer

This property is read-only.

Maximum number of line search iterations, specified as a nonnegative integer.

If the state object is the output of the lbfgsupdate function, then MaxNumLineSearchIterations is the maximum number of line search iterations that the lbfgsupdate function uses. Otherwise, MaxNumLineSearchIterations is 0.

Data Types: double

Examples

collapse all

Create L-BFGS Solver State Object

Open Live Script

Create an L-BFGS solver state object.

solverState = lbfgsState

solverState = 
  LBFGSState with properties:

             InverseHessianFactor: 1
                      StepHistory: {}
       GradientsDifferenceHistory: {}
                   HistoryIndices: [1x0 double]

   Iteration Information
                             Loss: []
                        Gradients: []
    AdditionalLossFunctionOutputs: {1x0 cell}
                    GradientsNorm: []
                         StepNorm: []
                 LineSearchStatus: ""

  Show all properties

Update Learnable Parameters in Neural Network

Open Live Script

Read the transmission casing data from the CSV file "transmissionCasingData.csv".

filename = "transmissionCasingData.csv";
tbl = readtable(filename,TextType="String");

Convert the labels for prediction to categorical using the convertvars function.

labelName = "GearToothCondition";
tbl = convertvars(tbl,labelName,"categorical");

To train a network using categorical features, convert the categorical predictors to categorical using the convertvars function by specifying a string array containing the names of all the categorical input variables.

categoricalPredictorNames = ["SensorCondition" "ShaftCondition"];
tbl = convertvars(tbl,categoricalPredictorNames,"categorical");

Loop over the categorical input variables. For each variable, convert the categorical values to one-hot encoded vectors using the onehotencode function.

for i = 1:numel(categoricalPredictorNames)
    name = categoricalPredictorNames(i);
    tbl.(name) = onehotencode(tbl.(name),2);
end

View the first few rows of the table.

head(tbl)

    SigMean     SigMedian    SigRMS    SigVar     SigPeak    SigPeak2Peak    SigSkewness    SigKurtosis    SigCrestFactor    SigMAD     SigRangeCumSum    SigCorrDimension    SigApproxEntropy    SigLyapExponent    PeakFreq    HighFreqPower    EnvPower    PeakSpecKurtosis    SensorCondition    ShaftCondition    GearToothCondition
    ________    _________    ______    _______    _______    ____________    ___________    ___________    ______________    _______    ______________    ________________    ________________    _______________    ________    _____________    ________    ________________    _______________    ______________    __________________

    -0.94876     -0.9722     1.3726    0.98387    0.81571       3.6314        -0.041525       2.2666           2.0514         0.8081        28562              1.1429             0.031581            79.931            0          6.75e-06       3.23e-07         162.13             0    1             1    0          No Tooth Fault  
    -0.97537    -0.98958     1.3937    0.99105    0.81571       3.6314        -0.023777       2.2598           2.0203        0.81017        29418              1.1362             0.037835            70.325            0          5.08e-08       9.16e-08         226.12             0    1             1    0          No Tooth Fault  
      1.0502      1.0267     1.4449    0.98491     2.8157       3.6314         -0.04162       2.2658           1.9487        0.80853        31710              1.1479             0.031565            125.19            0          6.74e-06       2.85e-07         162.13             0    1             0    1          No Tooth Fault  
      1.0227      1.0045     1.4288    0.99553     2.8157       3.6314        -0.016356       2.2483           1.9707        0.81324        30984              1.1472             0.032088             112.5            0          4.99e-06        2.4e-07         162.13             0    1             0    1          No Tooth Fault  
      1.0123      1.0024     1.4202    0.99233     2.8157       3.6314        -0.014701       2.2542           1.9826        0.81156        30661              1.1469              0.03287            108.86            0          3.62e-06       2.28e-07         230.39             0    1             0    1          No Tooth Fault  
      1.0275      1.0102     1.4338     1.0001     2.8157       3.6314         -0.02659       2.2439           1.9638        0.81589        31102              1.0985             0.033427            64.576            0          2.55e-06       1.65e-07         230.39             0    1             0    1          No Tooth Fault  
      1.0464      1.0275     1.4477     1.0011     2.8157       3.6314        -0.042849       2.2455           1.9449        0.81595        31665              1.1417             0.034159            98.838            0          1.73e-06       1.55e-07         230.39             0    1             0    1          No Tooth Fault  
      1.0459      1.0257     1.4402    0.98047     2.8157       3.6314        -0.035405       2.2757            1.955        0.80583        31554              1.1345               0.0353            44.223            0          1.11e-06       1.39e-07         230.39             0    1             0    1          No Tooth Fault

Extract the training data.

predictorNames = ["SigMean" "SigMedian" "SigRMS" "SigVar" "SigPeak" "SigPeak2Peak" ...
    "SigSkewness" "SigKurtosis" "SigCrestFactor" "SigMAD" "SigRangeCumSum" ...
    "SigCorrDimension" "SigApproxEntropy" "SigLyapExponent" "PeakFreq" ...
    "HighFreqPower" "EnvPower" "PeakSpecKurtosis" "SensorCondition" "ShaftCondition"];
XTrain = table2array(tbl(:,predictorNames));
numInputFeatures = size(XTrain,2);

Extract the targets and convert them to one-hot encoded vectors.

TTrain = tbl.(labelName);
TTrain = onehotencode(TTrain,2);
numClasses = size(TTrain,2);

Convert the predictors and targets to dlarray objects with format "BC" (batch, channel).

XTrain = dlarray(XTrain,"BC");
TTrain = dlarray(TTrain,"BC");

Define the network architecture.

numHiddenUnits = 32;

layers = [
    featureInputLayer(numInputFeatures)
    fullyConnectedLayer(16)
    layerNormalizationLayer
    reluLayer
    fullyConnectedLayer(numClasses)
    softmaxLayer];

net = dlnetwork(layers);

Define the modelLoss function, listed in the Model Loss Function section of the example. This function takes as input a neural network, input data, and targets. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.

The lbfgsupdate function requires a loss function with the syntax [loss,gradients] = f(net). Create a variable that parameterizes the evaluated modelLoss function to take a single input argument.

lossFcn = @(net) dlfeval(@modelLoss,net,XTrain,TTrain);

Initialize an L-BFGS solver state object with a maximum history size of 3 and an initial inverse Hessian approximation factor of 1.1.

solverState = lbfgsState( ...
    HistorySize=3, ...
    InitialInverseHessianFactor=1.1);

Train the network a maximum of 200 iterations. Stop training early when the norm of the gradients or steps are smaller than 0.00001. Print the training loss every 10 iterations.

maxIterations = 200;
gradientTolerance = 1e-5;
stepTolerance = 1e-5;

iteration = 0;

while iteration < maxIterations
    iteration = iteration + 1;
    [net, solverState] = lbfgsupdate(net,lossFcn,solverState);

    if iteration==1 || mod(iteration,10)==0
        fprintf("Iteration %d: Loss: %d\n",iteration,solverState.Loss);
    end

    if solverState.GradientsNorm < gradientTolerance || ...
            solverState.StepNorm < stepTolerance || ...
            solverState.LineSearchStatus == "failed"
        break
    end
end

Iteration 1: Loss: 9.343236e-01
Iteration 10: Loss: 4.721475e-01
Iteration 20: Loss: 4.678575e-01
Iteration 30: Loss: 4.666964e-01
Iteration 40: Loss: 4.665921e-01
Iteration 50: Loss: 4.663871e-01
Iteration 60: Loss: 4.662519e-01
Iteration 70: Loss: 4.660451e-01
Iteration 80: Loss: 4.645303e-01
Iteration 90: Loss: 4.591753e-01
Iteration 100: Loss: 4.562556e-01
Iteration 110: Loss: 4.531167e-01
Iteration 120: Loss: 4.489444e-01
Iteration 130: Loss: 4.392228e-01
Iteration 140: Loss: 4.347853e-01
Iteration 150: Loss: 4.341757e-01
Iteration 160: Loss: 4.325102e-01
Iteration 170: Loss: 4.321948e-01
Iteration 180: Loss: 4.318990e-01
Iteration 190: Loss: 4.313784e-01
Iteration 200: Loss: 4.311314e-01

Model Loss Function

The modelLoss function takes as input a neural network net, input data X, and targets T. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.

function [loss, gradients] = modelLoss(net, X, T)

Y = forward(net,X);
loss = crossentropy(Y,T);
gradients = dlgradient(loss,net.Learnables);

end

Algorithms

expand all

Limited-Memory BFGS

The algorithm updates learnable parameters W at iteration k+1 using the update step given by

$W_{k + 1} = W_{k} - η_{k} B_{k}^{- 1} \nabla J (W_{k}),$

where W_k denotes the weights at iteration k, $η_{k}$ is the learning rate at iteration k, B_k is an approximation of the Hessian matrix at iteration k, and $\nabla J (W_{k})$ denotes the gradients of the loss with respect to the learnable parameters at iteration k.

The L-BFGS algorithm computes the matrix-vector product $B_{k}^{- 1} \nabla J (W_{k})$ directly. The algorithm does not require computing the inverse of B_k.

To compute the matrix-vector product $B_{k}^{- 1} \nabla J (W_{k})$ directly, the L-BFGS algorithm uses this recursive algorithm:

Set $r = B_{k - m}^{- 1} \nabla J (W_{k})$ , where m is the history size.
For $i = m, \dots, 1$ :
1. Let $β = \frac{1}{s_{k - i}^{⊤} y_{k - i}} y_{k - i}^{⊤} r$ , where $s_{k - i}$ and $y_{k - i}$ are the step and gradient differences for iteration $k - i$ , respectively.
2. Set $r = r + s_{k - i} (a_{k - i} - β)$ , where $a$ is derived from $s$ , $y$ , and the gradients of the loss with respect to the loss function. For more information, see [1].
Return $B_{k}^{- 1} \nabla J (W_{k}) = r$ .

References

[1] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimization." Mathematical programming 45, no. 1 (August 1989): 503-528. https://doi.org/10.1007/BF01589116.

Version History

Introduced in R2023a

expand all

R2024b: Specify initial step size for L-BFGS solver

Specify the initial step size for the L-BFGS solver using the InitialStepSize argument.

R2023b: Inspect norm of initial gradients

Inspect the norm of the initial gradients using the InitialGradientsNorm property.

lbfgsState

Description

Creation

Syntax

Description

Properties

L-BFGS State

`HistorySize` — Number of state updates to store
`10` (default) | positive integer

`InitialInverseHessianFactor` — Initial value that characterizes approximate inverse Hessian matrix
`1` (default) | positive scalar

`InverseHessianFactor` — Value that characterizes approximate inverse Hessian matrix
`1` (default) | positive scalar

`InitialGradientsNorm` — Norm of initial gradients
`[]` (default) | `dlarray` scalar

`InitialStepSize` — Initial step size
`[]` (default) | `"auto"` | real finite scalar

`StepHistory` — Step history
`{}` (default) | cell array

`GradientsDifferenceHistory` — Gradients difference history
`{}` (default) | cell array

`HistoryIndices` — History indices
0-by-1 vector (default) | row vector

Iteration Information

`Loss` — Loss
`[]` (default) | `dlarray` scalar | numeric scalar

`Gradients` — Gradients
`[]` (default) | `dlarray` object | numeric array | cell array | structure | table

`AdditionalLossFunctionOutputs` — Additional loss function outputs
1-by-0 cell array (default) | cell array

`StepNorm` — Norm of step
`[]` (default) | `dlarray` scalar | numeric scalar

`GradientsNorm` — Norm of gradients
`[]` (default) | `dlarray` scalar | numeric scalar

`LineSearchStatus` — Status of line search algorithm
`""` (default) | `"completed"` | `"failed"`

`LineSearchMethod` — Method solver uses to find suitable learning rate
`""` (default) | `"weak-wolfe"` | `"strong-wolfe"` | `"backtracking"`

`MaxNumLineSearchIterations` — Maximum number of line search iterations
`0` (default) | nonnegative integer

Examples

Create L-BFGS Solver State Object

Update Learnable Parameters in Neural Network

Algorithms

Limited-Memory BFGS

References

Version History

R2024b: Specify initial step size for L-BFGS solver

R2023b: Inspect norm of initial gradients

See Also

Topics

lbfgsState

Description

Creation

Syntax

Description

Properties

L-BFGS State

HistorySize — Number of state updates to store 10 (default) | positive integer

InitialInverseHessianFactor — Initial value that characterizes approximate inverse Hessian matrix 1 (default) | positive scalar

InverseHessianFactor — Value that characterizes approximate inverse Hessian matrix 1 (default) | positive scalar

InitialGradientsNorm — Norm of initial gradients [] (default) | dlarray scalar

InitialStepSize — Initial step size [] (default) | "auto" | real finite scalar

StepHistory — Step history {} (default) | cell array

GradientsDifferenceHistory — Gradients difference history {} (default) | cell array

HistoryIndices — History indices 0-by-1 vector (default) | row vector

Iteration Information

Loss — Loss [] (default) | dlarray scalar | numeric scalar

Gradients — Gradients [] (default) | dlarray object | numeric array | cell array | structure | table

AdditionalLossFunctionOutputs — Additional loss function outputs 1-by-0 cell array (default) | cell array

StepNorm — Norm of step [] (default) | dlarray scalar | numeric scalar

GradientsNorm — Norm of gradients [] (default) | dlarray scalar | numeric scalar

LineSearchStatus — Status of line search algorithm "" (default) | "completed" | "failed"

LineSearchMethod — Method solver uses to find suitable learning rate "" (default) | "weak-wolfe" | "strong-wolfe" | "backtracking"

MaxNumLineSearchIterations — Maximum number of line search iterations 0 (default) | nonnegative integer

Examples

Create L-BFGS Solver State Object

Update Learnable Parameters in Neural Network

Algorithms

Limited-Memory BFGS

References

Version History

R2024b: Specify initial step size for L-BFGS solver

R2023b: Inspect norm of initial gradients

See Also

Topics

`HistorySize` — Number of state updates to store
`10` (default) | positive integer

`InitialInverseHessianFactor` — Initial value that characterizes approximate inverse Hessian matrix
`1` (default) | positive scalar

`InverseHessianFactor` — Value that characterizes approximate inverse Hessian matrix
`1` (default) | positive scalar

`InitialGradientsNorm` — Norm of initial gradients
`[]` (default) | `dlarray` scalar

`InitialStepSize` — Initial step size
`[]` (default) | `"auto"` | real finite scalar

`StepHistory` — Step history
`{}` (default) | cell array

`GradientsDifferenceHistory` — Gradients difference history
`{}` (default) | cell array

`HistoryIndices` — History indices
0-by-1 vector (default) | row vector

`Loss` — Loss
`[]` (default) | `dlarray` scalar | numeric scalar

`Gradients` — Gradients
`[]` (default) | `dlarray` object | numeric array | cell array | structure | table

`AdditionalLossFunctionOutputs` — Additional loss function outputs
1-by-0 cell array (default) | cell array

`StepNorm` — Norm of step
`[]` (default) | `dlarray` scalar | numeric scalar

`GradientsNorm` — Norm of gradients
`[]` (default) | `dlarray` scalar | numeric scalar

`LineSearchStatus` — Status of line search algorithm
`""` (default) | `"completed"` | `"failed"`

`LineSearchMethod` — Method solver uses to find suitable learning rate
`""` (default) | `"weak-wolfe"` | `"strong-wolfe"` | `"backtracking"`

`MaxNumLineSearchIterations` — Maximum number of line search iterations
`0` (default) | nonnegative integer