# ppca

Probabilistic principal component analysis

## Syntax

``````[coeff,score,pcvar] = ppca(Y,K)``````
``````[coeff,score,pcvar] = ppca(Y,K,Name,Value)``````
``````[coeff,score,pcvar,mu] = ppca(___)``````
``````[coeff,score,pcvar,mu,v,S] = ppca(___)``````

## Description

example

``````[coeff,score,pcvar] = ppca(Y,K)``` returns the principal component coefficients for the n-by-p data matrix `Y` based on a probabilistic principal component analysis (PPCA). It also returns the principal component scores, which are the representations of `Y` in the principal component space, and the principal component variances, which are the eigenvalues of the covariance matrix of `Y`, in `pcvar`.Each column of `coeff` contains coefficients for one principal component, and the columns are in descending order of component variance. Rows of `score` correspond to observations, and columns correspond to components. Rows of `Y` correspond to observations and columns correspond to variables.Probabilistic principal component analysis might be preferable to other algorithms that handle missing data, such as the alternating least squares algorithm when any data vector has one or more missing values. It assumes that the values are missing at random through the data set. An expectation-maximization algorithm is used for both complete and missing data.```

example

``````[coeff,score,pcvar] = ppca(Y,K,Name,Value)``` returns the principal component coefficients, scores, and variances using additional options for computation and handling of special data types, specified by one or more `Name,Value` pair arguments.For example, you can introduce initial values for the residual variance, `v`, or change the termination criteria.```

example

``````[coeff,score,pcvar,mu] = ppca(___)``` also returns the estimated mean of each variable in `Y`. You can use any of the input arguments in the previous syntaxes.```

example

``````[coeff,score,pcvar,mu,v,S] = ppca(___)``` also returns the isotropic residual variance in `v` and the final results at convergence in structure `S`.```

## Examples

collapse all

`load fisheriris`

The double matrix `meas` consists of four types of measurements on the flowers, which, respectively, are the length and width of sepals and petals.

Introduce missing values randomly.

```y = meas; rng('default'); % for reproducibility ix = random('unif',0,1,size(y))<0.20; y(ix) = NaN;```

Now, approximately 20% of the data is missing, indicated by `NaN`.

Perform probabilistic principal component analysis and request the component coefficients and variances.

```[coeff,score,pcvar,mu] = ppca(y,3); coeff```
```coeff = 4×3 0.3562 0.6709 -0.5518 -0.0765 0.7120 0.6332 0.8592 -0.1597 0.0596 0.3592 -0.1318 0.5395 ```
`pcvar`
```pcvar = 3×1 4.0914 0.2125 0.0617 ```

Perform principal component analysis using the alternating least squares algorithm and request the component coefficients and variances.

```[coeff2,score2,pcvar2,mu2] = pca(y,'algorithm','als',... 'NumComponents',3); coeff2```
```coeff2 = 4×3 0.3376 0.4952 0.7406 -0.0731 0.8609 -0.4476 0.8657 -0.1168 -0.1233 0.3623 -0.0086 -0.4857 ```
`pcvar2`
```pcvar2 = 3×1 4.0733 0.2652 0.1222 ```

The coefficients and the variances of the first two principal components are similar.

Another way to compare the results is to find the angle between the two spaces spanned by the coefficient vectors.

`subspace(coeff,coeff2)`
```ans = 0.0884 ```

The angle between the two spaces is pretty small. This indicates that these two results are close to each other.

`load imports-85`

Data matrix `X` has 13 continuous variables in columns 3 to 15: wheel-base, length, width, height, curb-weight, engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, and highway-mpg. The variables bore and stroke are missing four values in rows 56 to 59, and the variables horsepower and peak-rpm are missing two values in rows 131 and 132.

Perform probabilistic principal component analysis and display the first three principal components.

`[coeff,score,pcvar] = ppca(X(:,3:15),3);`
```Warning: Maximum number of iterations 1000 reached. ```

Change the termination tolerance for the cost function to 0.01.

```opt = statset('ppca'); opt.TolFun = 0.01;```

Perform probabilistic principal component analysis.

`[coeff,score,pcvar] = ppca(X(:,3:15),3,'Options',opt);`
```Warning: Maximum number of iterations 1000 reached. ```

`ppca` now terminates before the maximum number of iterations is reached because it meets the tolerance for the cost function.

```load hald y = ingredients;```

The ingredients data has 13 observations for 4 variables.

Introduce missing values to the data.

`y(2:16:end) = NaN;`

Every 16th value is `NaN`. This corresponds to 7.69% of the data.

Find the first three principal components of data using PPCA and display the reconstructed observations.

`[coeff,score,pcvar,mu,v,S] = ppca(y,3);`
```Warning: Maximum number of iterations 1000 reached. ```
`S.Recon`
```ans = 13×4 6.8536 25.8700 5.8389 59.8730 1.0433 28.9710 14.9654 51.9738 11.5770 56.5067 8.6352 20.5076 11.0835 31.0722 8.0920 47.0748 7.0679 52.2556 6.0748 33.0598 11.0486 55.0430 9.0534 22.0423 2.8493 70.8691 16.8339 5.8656 1.0333 31.0281 19.6907 44.0306 2.0400 54.0354 18.0440 22.0349 20.7822 46.8091 3.7603 25.8081 ⋮ ```

You can also reconstruct the observations using the principal components and the estimated mean.

`t = score*coeff' + repmat(mu,13,1);`

`load hald`

Here, `ingredients` is a real-valued matrix of predictor variables.

Perform the probabilistic principal components analysis and display coefficients.

`[coeff,score,pcvariance,mu,v,S] = ppca(ingredients,3);`
```Warning: Maximum number of iterations 1000 reached. ```
`coeff`
```coeff = 4×3 -0.0693 -0.6459 0.5673 -0.6786 -0.0184 -0.5440 0.0308 0.7552 0.4036 0.7306 -0.1102 -0.4684 ```

Display the algorithm results at convergence of the PPCA.

`S`
```S = struct with fields: W: [4x3 double] Xexp: [13x3 double] Recon: [13x4 double] v: 0.2372 NumIter: 1000 RMSResid: 0.2340 nloglk: 149.3388 ```

Display the matrix `W`.

`S.W`
```ans = 4×3 0.5624 2.0279 5.4075 4.8320 -10.3894 5.9202 -3.7521 -3.0555 -4.1552 -1.5144 11.7122 -7.2564 ```

Orthogonalizing `W` recovers the coefficients.

`orth(S.W)`
```ans = 4×3 -0.0693 0.6459 0.5673 -0.6786 0.0184 -0.5440 0.0308 -0.7552 0.4036 0.7306 0.1102 -0.4684 ```

## Input Arguments

collapse all

Input data for which to compute the principal components, specified as an n-by-p matrix. Rows of `Y` correspond to observations and columns correspond to variables.

Data Types: `single` | `double`

Number of principal components to return, specified as an integer value less than the rank of data. The maximum possible rank is min(n,p), where n is the number of observations and p is the number of variables. However, if the data is correlated, the rank might be smaller than min(n,p).

`ppca` orders the components based on their variance.

If `K` is min(n,p), `ppca` sets `K` equal to min(n,p) – 1, and `'W0'` is truncated to min(p,n) – 1 columns if you specify a p-by-p `W0` matrix.

For example, you can request only the first three components, based on the component variance as follows.

Example: `coeff = ppca(Y,3)`

Data Types: `single` | `double`

### Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: `'W0',init,'Options',opt` specifies that the initial values for `'W0'` are in matrix `init` and `ppca` uses the options defined by `opt`.

Initial value of W in the probabilistic principal component analysis algorithm, specified as a comma-separated pair consisting of `'W0'` and a p-by-k matrix.

Data Types: `single` | `double`

Initial value of residual variance, specified as the comma-separated pair consisting of `'v0'` and a positive scalar value.

Data Types: `single` | `double`

Options for the iterations, specified as a comma-separated pair `'Options'` and a structure created by the `statset` function. `ppca` uses the following fields in the options structure.

 `'Display'` Level of display output. Choices are `'off'`, `'final'`, and `'iter'`. `'MaxIter'` Maximum number of steps allowed. The default is 1000. Unlike in optimization settings, reaching the `MaxIter` value is regarded as convergence. `'TolFun'` Positive integer stating the termination tolerance for the cost function. The default is 1e-6. `'TolX'` Positive integer stating the convergence threshold for the relative change in the elements of W. The default is 1e-6.

You can change the values of these fields and specify the new structure in `ppca` using the `'Options'` name-value pair argument.

Example: ```opt = statset('ppca'); opt.MaxIter = 2000; coeff = ppca(Y,3,'Options',opt);```

Data Types: `struct`

## Output Arguments

collapse all

Principal component coefficients, returned as a p-by-k matrix. Each column of `coeff` contains coefficients for one principal component. The columns are in the order of descending component variance, `pcvar`.

Principal component scores, returned as an n-by-k matrix. Rows of `score` correspond to observations, and columns correspond to components.

Principal component variances, which are the eigenvalues of the covariance matrix of `Y`, returned as a column vector.

Estimated mean of each variable in `Y`, returned as a row vector.

Isotropic residual variance, returned as a scalar value.

Final results at convergence, returned as a structure containing the following fields.

 `W` W at convergence. `Xexp` Conditional expectation of the estimated latent variable x. `Recon` Reconstructed observations using k principal components. This is a low dimension approximation of the input data `Y`, and is equal to `mu` + `score`*`coeff'`. `v` Residual variance. `RMSResid` Root mean square of residuals. `NumIter` Number of iteration counts. `nloglk` Negative loglikelihood function value.

collapse all

### Probabilistic Principal Component Analysis

Probabilistic principal component analysis (PPCA) is a method to estimate the principal axes when any data vector has one or more missing values.

PPCA is based on an isotropic error model. It seeks to relate a p-dimensional observation vector y to a corresponding k-dimensional vector of latent (or unobserved) variable x, which is normal with mean zero and covariance I(k). The relationship is

`${y}^{T}=W\ast {x}^{T}+\mu +\epsilon ,$`

where y is the row vector of observed variable, x is the row vector of latent variables, and ε is the isotropic error term. ε is Gaussian with mean zero and covariance of v*I(k), where v is the residual variance. Here, k needs to be smaller than the rank for the residual variance to be greater than 0 (v>0). Standard principal component analysis, where the residual variance is zero, is the limiting case of PPCA. The observed variables, y, are conditionally independent given the values of the latent variables, x. So, the latent variables explain the correlations between the observation variables and the error explains the variability unique to a particular yi. The p-by-k matrix W relates the latent and observation variables, and the vector μ permits the model to have a nonzero mean. PPCA assumes that the values are missing at random through the data set. This means that whether a data value is missing or not does not depend on the latent variable given the observed data values.

Under this model,

`$y~N\left(\mu ,W*{W}^{T}+v*I\left(k\right)\right).$`

There is no closed-form analytical solution for W and v, so their estimates are determined by iterative maximization of the corresponding loglikelihood using an expectation-maximization (EM) algorithm. This EM algorithm handles missing values by treating them as additional latent variables. At convergence, the columns of W spans the subspace, but they are not orthonormal. `ppca` obtains the orthonormal coefficients, `coeff`, for the components by orthogonalization of W.

## References

[1] Tipping, M. E., and C. M. Bishop. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No.3, 1999, pp. 611–622.

[2] Roweis, S. “EM Algorithms for PCA and SPCA.” In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems. Vol.10 (NIPS 1997), Cambridge, MA, USA: MIT Press, 1998, pp. 626–632.

[3] Ilin, A., and T. Raiko. “Practical Approaches to Principal Component Analysis in the Presence of Missing Values.” J. Mach. Learn. Res.. Vol. 11, August, 2010, pp. 1957–2000.

## Version History

Introduced in R2013a