Set Up Multivariate Regression Problems

Response Matrix

To fit a multivariate linear regression model using mvregress, you must set up your response matrix and design matrices in a particular way. Given properly formatted inputs, mvregress can handle a variety of multivariate regression problems.

mvregress expects the n observations of potentially correlated d-dimensional responses to be in an n-by-d matrix, named Y, for example. That is, set up your responses so that the dependency structure is between observations in the same row. If you specify Y as a vector of length n (either a row or column vector), then mvregress assumes that d = 1, and treats the elements as n independent observations. It does not model the vector as one realization of a correlated series (such as a time series).

To illustrate how to set up a response matrix, suppose that your multivariate responses are repeated measurements made on subjects at multiple time points, as in the following figure.

Plot of repeated measurements, where each line corresponds to one subject. The x-axis shows the time points at which the measurements are made.

Suppose that observations within a subject are correlated.

Plot of repeated measurements, where the dark blue points indicate within subject correlation

In this case, set up the response matrix Y such that each row corresponds to a subject, and each column corresponds to a time point.

Response matrix with subjects in rows and time points in columns

Then again, suppose that observations made on subjects at the same time are correlated (concurrent correlation).

Plot of repeated measurements, where the dark blue points indicate between subject correlation

In this case, set up the response matrix Y such that each row corresponds to a time point, and each column corresponds to a subject.

Response matrix with time points in rows and subjects in columns

Design Matrices

In the multivariate linear regression model, each d-dimensional response has a corresponding design matrix. Depending on the model, the design matrix might be comprised of exogenous predictor variables, dummy variables, lagged responses, or a combination of these and other covariate terms.

If d > 1 and all d dimensions have the same design matrix, then specify one n-by-p design matrix, where p is the number of predictor variables. To determine an intercept for each dimension, add a column of ones to the design matrix. In this case, mvregress applies the design matrix to all d dimensions.
If d > 1 and all d dimensions do not have the same design matrix, then specify the design matrices using a length-n cell array of d-by-K arrays, named X, for example. K is the total number of regression coefficients in the model. Note that the rows of the arrays in X correspond to the columns of the response matrix, Y.

If all n observations have the same design matrix, you can specify a cell array containing one d-by-K design matrix. In this case, mvregress applies the design matrix to all n observations. For example, this situation might arise if the predictors are functions of time, and all observations were measured at the same time points.
In the special case that d = 1, you can specify one n-by-K design matrix (not in a cell array). However, you should consider using fitlm to fit regression models to univariate, continuous responses.

The following sections illustrate how to set up the some common multivariate regression problems for estimation using mvregress.

Multivariate General Linear Model

The multivariate general linear model is of the form

$Y_{n \times d} = X_{n \times (p + 1)} B_{(p + 1) \times d} + E_{n \times d} .$

In expanded form,

$[\begin{array}{l} y_{11} y_{12} \dots y_{1 d} \\ y_{21} y_{22} \dots y_{2 d} \\ ⋮ ⋮ ⋱ ⋮ \\ y_{n 1} y_{n 2} \dots y_{n d} \end{array}] = [\begin{array}{l} 1 x_{11} x_{12} \dots x_{1 p} \\ 1 x_{21} x_{22} \dots x_{2 p} \\ ⋮ ⋮ ⋮ ⋱ ⋮ \\ 1 x_{n 1} x_{n 2} \dots x_{n p} \end{array}] [\begin{array}{l} β_{01} β_{02} \dots β_{0 d} \\ β_{11} β_{12} \dots β_{1 d} \\ ⋮ ⋮ ⋱ ⋮ \\ β_{p 1} β_{p 2} \dots β_{p d} \end{array}] + [\begin{array}{l} ε_{11} ε_{12} \dots ε_{1 d} \\ ε_{21} ε_{22} \dots ε_{2 d} \\ ⋮ ⋮ ⋱ ⋮ \\ ε_{n 1} ε_{n 2} \dots ε_{n d} \end{array}] .$

That is, each d-dimensional response has an intercept and p predictor variables, and each dimension has its own set of regression coefficients. In this form, the least squares solution is B = X\Y. To estimate this model using mvregress, use the n-by-d matrix of responses, as above.

If all d dimensions have the same design matrix, use the n-by-(p+1) design matrix, as above. Adding a column of ones to the p predictor variables computes the intercept for each dimension.

If all d dimensions do not have the same design matrix, reformat the n-by-(p + 1) design matrix into a length-n cell array of d-by-K matrices. Here, K = (p + 1)d for an intercept and slopes for each dimension.

For example, suppose n = 4, d = 3, and p = 2 (two predictor terms in addition to an intercept). This figure shows how to format the ith element in the cell array.

$[\begin{matrix} y_{11} & y_{12} & y_{13} \\ \begin{matrix} y_{21} \\ y_{31} \end{matrix} & \begin{matrix} y_{22} \\ y_{32} \end{matrix} & \begin{matrix} y_{23} \\ y_{33} \end{matrix} \\ y_{41} & y_{42} & y_{43} \end{matrix}] = \underset{\begin{matrix} ↓ \\ \underset{X {i}}{\underset{︸}{[\begin{matrix} \begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix} & \begin{matrix} x_{i 1} & 0 & 0 \\ 0 & x_{i 1} & 0 \\ 0 & 0 & x_{i 1} \end{matrix} & \begin{matrix} x_{i 2} & 0 & 0 \\ 0 & x_{i 2} & 0 \\ 0 & 0 & x_{i 2} \end{matrix} \end{matrix}]}} [\begin{matrix} β_{01} \\ β_{02} \\ β_{03} \\ \begin{matrix} \begin{matrix} β_{11} \\ β_{12} \\ β_{13} \end{matrix} \\ β_{21} \\ β_{22} \\ β_{23} \end{matrix} \end{matrix}] \end{matrix}}{\underset{︸}{[\begin{matrix} 1 & x_{11} & x_{12} \\ \begin{matrix} 1 \\ 1 \end{matrix} & \begin{matrix} x_{21} \\ x_{31} \end{matrix} & \begin{matrix} x_{22} \\ x_{32} \end{matrix} \\ 1 & x_{41} & x_{42} \end{matrix}] [\begin{matrix} β_{01} & β_{02} & β_{03} \\ β_{11} & β_{12} & β_{13} \\ β_{21} & β_{22} & β_{23} \end{matrix}]}} + [\begin{matrix} ε_{11} & ε_{12} & ε_{13} \\ \begin{matrix} ε_{21} \\ ε_{31} \end{matrix} & \begin{matrix} ε_{22} \\ ε_{32} \end{matrix} & \begin{matrix} ε_{23} \\ ε_{33} \end{matrix} \\ ε_{41} & ε_{42} & ε_{43} \end{matrix}]$

If you prefer, you can reshape the K-by-1 vector of coefficients back into a (p + 1)-by-d matrix after estimation.

To put constraints on the model parameters, adjust the design matrix accordingly. For example, suppose that the three dimensions in the previous example have a common slope. That is, $β_{11} = β_{12} = β_{13} = β_{1}$ and $β_{21} = β_{22} = β_{23} = β_{2} .$ In this case, each design matrix is 3-by-5, as shown in the following figure.

$\underset{X {i}}{\underset{︸}{[\begin{matrix} \begin{matrix} 1 & 0 & 0 \end{matrix} & x_{i 1} & x_{i 2} \\ \begin{matrix} 0 & 1 & 0 \end{matrix} & x_{i 1} & x_{i 2} \\ \begin{matrix} 0 & 0 & 1 \end{matrix} & x_{i 1} & x_{i 2} \end{matrix}]}} [\begin{matrix} \begin{matrix} β_{01} \\ β_{02} \end{matrix} \\ β_{03} \\ β_{1} \\ β_{2} \end{matrix}]$

Longitudinal Analysis

In a longitudinal analysis, you might measure responses on n subjects at d time points, with correlation between observations made on the same subject. For example, suppose that you measure responses y_ij at times t_ij, i = 1,...,n and j = 1,...,d. In addition, suppose that each subject is in one of two groups (such as male or female), specified by the indicator variable G_i. You could model y_ij as a function of G_i and t_ij, with group-specific intercepts and slopes, as follows:

$y_{i j} = β_{0} + β_{1} G_{i} + β_{2} t_{i j} + β_{3} G_{i} \times t_{i j} + ε_{i j}, i = 1, \dots, n; j = 1, \dots, d,$

where

$ε_{i} = (ε_{i 1}, \dots, ε_{i d})^{'} \sim M V N (0, Σ) .$

Most longitudinal models include time as an explicit predictor.

To fit this model using mvregress, arrange the responses in an n-by-d matrix, where n is the number of subjects and d is the number of time points. Specify the design matrices in an n-length cell array of d-by-K matrices, where here K = 4 for the four regression coefficients.

For example, suppose d = 5 (five observations per subject). The ith design matrix and corresponding parameter vector for the specified model are shown in the following figure.

$\underset{X {i}}{\underset{︸}{[\begin{matrix} 1 & G_{i} & t_{i 1} & G_{i} \times t_{i 1} \\ 1 & G_{i} & t_{i 2} & G_{i} \times t_{i 2} \\ 1 & G_{i} & t_{i 3} & G_{i} \times t_{i 3} \\ \begin{matrix} 1 \\ 1 \end{matrix} & \begin{matrix} G_{i} \\ G_{i} \end{matrix} & \begin{matrix} t_{i 4} \\ t_{i 5} \end{matrix} & \begin{matrix} G_{i} \times t_{i 4} \\ G_{i} \times t_{i 5} \end{matrix} \end{matrix}]}} [\begin{matrix} β_{0} \\ β_{1} \\ β_{2} \\ β_{3} \end{matrix}]$

Panel Analysis

In a panel analysis, you might measure responses and covariates on d subjects (such as individuals or countries) at n time points. For example, suppose you measure responses y_tj and covariates x_tj on subjects j = 1,...,d at times t = 1,...,n. A fixed effects panel model, with subject-specific fixed effects, and concurrent correlation might look like:

$y_{t j} = α_{j} + β x_{t j} + ε_{t j},$

where

$ε_{t} = (ε_{t 1}, ..., ε_{t d})^{'} \sim M V N (0, Σ) .$

In contrast to longitudinal models, the panel analysis model typically includes covariates measured at each time point, instead of using time as an explicit predictor.

To fit this model using mvregress, arrange the responses in an n-by-d matrix, such that each column corresponds to a subject. Specify the design matrices in an n-length cell array of d-by-K matrices, where here K = d + 1 for the d intercepts and a slope term.

For example, suppose d = 4 (four subjects). The tth design matrix and corresponding parameter vector are shown in the following figure.

$\underset{X {t}}{\underset{︸}{[\begin{matrix} 1 & 0 & 0 & \begin{matrix} 0 & x_{t 1} \end{matrix} \\ 0 & 1 & 0 & \begin{matrix} 0 & x_{t 2} \end{matrix} \\ 0 & 0 & 1 & \begin{matrix} 0 & x_{t 3} \end{matrix} \\ 0 & 0 & 0 & \begin{matrix} 1 & x_{t 4} \end{matrix} \end{matrix}]}} [\begin{matrix} \begin{matrix} α_{1} \\ α_{2} \end{matrix} \\ α_{3} \\ α_{4} \\ β \end{matrix}]$

Seemingly Unrelated Regression

In a seemingly unrelated regression (SUR), you model d separate regressions, each with its own intercept and slope, but a common error variance-covariance matrix. For example, suppose you measure responses y_ij and covariates x_ij for regression models j = 1,...,d, with i = 1,...,n observations to fit each regression. The SUR model might look like:

$y_{i j} = β_{0 j} + β_{j} x_{i j} + ε_{i j},$

where

$ε_{i} = (ε_{i 1}, \dots, ε_{i d})^{'} \sim M V N (0, Σ) .$

This model is very similar to the multivariate general linear model, except that it has different covariates for each dimension.

To fit this model using mvregress, arrange the responses in an n-by-d matrix, such that each column has the data for the jth regression model. Specify the design matrices in an n-length cell array of d-by-K matrices, where here K = 2d for d intercepts and d slopes.

For example, suppose d = 3 (three regressions). The ith design matrix and corresponding parameter vector are shown in the following figure.

$\underset{X {i}}{\underset{︸}{[\begin{matrix} \begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix} & \begin{matrix} x_{i 1} & 0 & 0 \\ 0 & x_{i 2} & 0 \\ 0 & 0 & x_{i 3} \end{matrix} \end{matrix}]}} [\begin{matrix} \begin{matrix} β_{01} \\ β_{02} \end{matrix} \\ β_{03} \\ β_{1} \\ \begin{matrix} β_{2} \\ β_{3} \end{matrix} \end{matrix}]$

Vector Autoregressive Model

The VAR(p) vector autoregressive model expresses d-dimensional time series responses as a linear function of p lagged d-dimensional responses from previous times. For example, suppose you measure responses y_tj for time series j = 1,...,d at times t = 1,...,n. The VAR(p) model might look like:

$[\begin{array}{l} y_{t 1} \\ y_{t 2} \\ ⋮ \\ y_{t d} \end{array}] = [\begin{array}{l} c_{1} \\ c_{2} \\ ⋮ \\ c_{d} \end{array}] + [\begin{array}{l} φ_{11}^{(1)} φ_{12}^{(1)} \dots φ_{1 d}^{(1)} \\ ⋮ ⋮ ⋱ ⋮ \\ φ_{d 1}^{(1)} φ_{d 2}^{(1)} \dots φ_{d d}^{(1)} \end{array}] [\begin{array}{l} y_{t - 1, 1} \\ y_{t - 1, 2} \\ ⋮ \\ y_{t - 1, d} \end{array}] + \dots + [\begin{array}{l} φ_{11}^{(p)} φ_{12}^{(p)} \dots φ_{1 d}^{(p)} \\ ⋮ ⋮ ⋱ ⋮ \\ φ_{d 1}^{(p)} φ_{d 2}^{(p)} \dots φ_{d d}^{(p)} \end{array}] [\begin{array}{l} y_{t - p, 1} \\ y_{t - p, 2} \\ ⋮ \\ y_{t - p, d} \end{array}] + [\begin{array}{l} ε_{t 1} \\ ε_{t 2} \\ ⋮ \\ ε_{t d} \end{array}],$

where

$ε_{t} = (ε_{t 1}, ..., ε_{t d})^{'} \sim M V N (0, Σ) .$

When estimating vector autoregressive models, you typically need to use the first p observations to initiate the model, or provide some other presample response values.

To fit this model using mvregress, arrange the responses in an n-by-d matrix, such that each column corresponds to a time series. Specify the design matrices in an n-length cell array of d-by-K matrices, where here K = d + pd².

For example, suppose d = 2 (two time series) and p = 1 (one lag). The tth design matrix and corresponding parameter vector are shown in the following figure.

$\underset{X {t}}{\underset{︸}{[\begin{matrix} \begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix} & \begin{matrix} y_{t - 1, 1} & 0 \\ 0 & y_{t - 1, 1} \end{matrix} & \begin{matrix} y_{t - 1, 2} & 0 \\ 0 & y_{t - 1, 2} \end{matrix} \end{matrix}]}} [\begin{matrix} \begin{matrix} c_{1} \\ c_{2} \end{matrix} \\ φ_{11}^{(1)} \\ φ_{21}^{(1)} \\ \begin{matrix} φ_{12}^{(1)} \\ φ_{22}^{(1)} \end{matrix} \end{matrix}]$

Alternatively, Econometrics Toolbox™ has functions for fitting and forecasting VAR(p) models, including the option to specify exogenous predictor variables.