Jon Minton’s Blog - Part Twenty Four: Time series - Vector Autoregression and multivariate models

Time Series recap

So far in this short series on time series, we’ve looked at time series modelling from some first principles, learning how the types of data and challenge in time series analysis both are similar and different from those of statistical modelling more generally. We started by looking at the concept of auto-regression, then differentiation and integration, and then the moving average model specification, before combining these three components - AR, I, and MA - to produce the ARIMA model specification common in time series analysis. Afterwards, we then extended the ARIMA specification slightly to deal with seasonally varying data, the ARIMA specification begetting the Seasonal ARIMA modelling framework, or SARIMA. As part of the post on Seasonality, we also looked at time series decomposition, using the STL decomposition framework.

Aim of this post

In this post, we’ll take time series in a different direction, to show an application of multivariate regression common in time series, called vector autoregression (VAR). VAR is both simpler in some ways, and more complex in other ways, than SARIMA modelling. It’s simpler in that, as the name suggests, moving average (MA) terms tend to not be part of VAR models; we’ll also not be considering seasonality either. But it’s more complicated in the sense that we are jointly modelling two outcomes at the same time.

Model family tree

The following figure aims to show the family resemblances between model specifications and challenges:

flowchart TB 
    uvm(univariate models)
    mvm(multivariate models)

    ar("AR(p)")
    i("I(d)")
    ma("MA(q)")
    arima("ARIMA(p, d, q)")
    sarima("ARIMA(p, d, q)[P, D, Q]_s")
    var(VAR)

    ar --> var
    mvm --> var

    i -.-> var

    uvm -- autoregression --> ar
    uvm -- differencing --> i
    uvm -- moving average --> ma
    ar & i & ma --> arima

    arima -- seasonality -->  sarima

So, the VAR model is an extension of the autoregressive component of a standard, univariate AR(p) specification models to multivariate models. It can also include both predictor and response variables that are differenced, hence the the dashed line from I(d) to VAR.

So what is a multivariate model?

You might have seen the term multivariate model before, and think you’re familiar with what it means.

In particular, you might have been taught that whereas a univariate regression model looks something like this:

\[ y = \beta_0 + \beta_1 x_1 + \epsilon \]

A multivariate regression model looks more like this:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon \]

i.e. You might have been taught that, if the predictors include one term for the intercept (the \(\beta_0\) term) and one term for the slope (the \(\beta_1\) term), then this is a univariate model. But if there are two or more terms that can claim to be ‘the slope’ then this is a multivariate model.

However, this isn’t the real distinction between a univariate model and a multivariate model. To see this distinction we have to return, for the umpeenth time, to the ‘grandmother model’ specification first introduced at the start of the very first post:

Stochastic Component

\[ Y \sim f(\theta, \alpha) \]

Systematic Component

\[ \theta = g(X, \beta) \]

Now, both the response data, \(Y\), and the predictor data, \(X\), are both taken from the same rectangular dataset, \(D\). Let’s say this dataset, \(D\), has six rows and five columns. As a matrix it would look something like this:

\[ D = \begin{pmatrix} d_{1,1} & d_{1,2} & d_{1,3} & d_{1, 4} & d_{1,5} \\ d_{2,1} & d_{2,2} & d_{2,3} & d_{2, 4} & d_{2,5} \\ d_{3,1} & d_{3,2} & d_{3,3} & d_{3, 4} & d_{3,5} \\ d_{4,1} & d_{4,2} & d_{4,3} & d_{4, 4} & d_{4,5} \\ d_{5,1} & d_{5,2} & d_{5,3} & d_{5, 4} & d_{5,5} \\ d_{6,1} & d_{6,2} & d_{6,3} & d_{6, 4} & d_{6,5} \end{pmatrix} \]

Here the dataset \(D\) is made up of a whole series of elements \(d_{i,j}\), where the first subset value indicates the row number \(i\) and the second subset value indicates the column number \(j\). So, for example, \(d_{5, 2}\) indicates the value of the 5th row and 2nd column, whereas \(d_{2, 5}\) indicates the value of the 2nd row and 5th column.

Fundamentally, the first challenge in building a model is deciding which columns from \(D\) we put in the predictor matrix \(X\), and which parts we put into the response matrix \(Y\). For example, if we wanted to predict the third column \(j=3\) given the fifth column \(j=5\) our predictor and response matrices would look as follows:

\[ Y = \begin{pmatrix} d_{1,3} \\ d_{2,3} \\ d_{3,3} \\ d_{4,3} \\ d_{5,3} \\ d_{6,3} \end{pmatrix} \]

\[ X = \begin{pmatrix} 1 & d_{1,5} \\ 1 & d_{2,5} \\ 1 & d_{3,5} \\ 1 & d_{4,5} \\ 1 & d_{5,5} \\ 1 & d_{6,5} \end{pmatrix} \]

Where does the column of 1s come from? This is how we specify, in matrix notation, that we want an intercept term to be calculated. Models don’t have to have intercept terms, but in almost all cases we’re likely to be familiar with, they tend to.

Let’s say we now want to include two columns, 2 and 5, from \(D\) in the predictor matrix, leading to what’s commonly (and wrongly) called a ‘multivariate regression’. This means that \(Y\) stays the same, but X is now as follows:

\[ X = \begin{pmatrix} 1 & d_{1,2} & d_{1,5}\\ 1 & d_{2,2} & d_{2,5}\\ 1 & d_{3,2} & d_{3,5}\\ 1 & d_{4,2} & d_{4,5}\\ 1 & d_{5,2} & d_{5,5}\\ 1 & d_{6,2} & d_{6,5} \end{pmatrix} \]

No matter now many columns we include in the predictor matrix, X, however, we still don’t have a real multivariate regression model specification. Even if X had a hundred columns, or a thousand, it would still not be a multivariate regression in the more technical sense of the term.

Instead, here’s an example of a multivariate regression model:

\[ Y = \begin{pmatrix} d_{1,1} & d_{1,3} \\ d_{2,1} & d_{2,3} \\ d_{3,1} & d_{3,3} \\ d_{4,1} & d_{4,3} \\ d_{5,1} & d_{5,3} \\ d_{6,1} & d_{6,3} \end{pmatrix} \]

\[ X = \begin{pmatrix} 1 & d_{1,5} \\ 1 & d_{2,5} \\ 1 & d_{3,5} \\ 1 & d_{4,5} \\ 1 & d_{5,5} \\ 1 & d_{6,5} \end{pmatrix} \]

This is an example of a multivariate regression model. We encountered it before when we used the multivariate normal distribution in post 13, and when we draw from the posterior distribution of Bayesian models in post 14, but this is the first time we’ve considered multivariate modelling in the context of trying to represent something we suspect to be true about the world, rather than our uncertainty about the world. And it’s the first example of multivariate regression we’ve encountered in this series. For every previous model, no matter how apparently disparate, complicated or exotic they may appear, they’ve been univariate regression models in the sense that the response component \(Y\) has always only contained one column only.

So, with this definition of multivariate regression, let’s now look at VAR as a particular application of multivariate regression used in time series.

Vector Autoregression

Let’s start with a semi-technical definition:

In vector autoregression (VAR) the values of two or more outcomes, \(\{Y_1(T), Y_2(T)\}\), are predicted based on previous values of those same outcomes \(\{Y_1(T-k), Y_2(T-k)\}\), for various lag periods \(k\).

Where \(Y\) has two columns, and an AR(1) specification (i.e. k is just 1), how is this different from simply running two separate AR(1) regression models, one for \(Y_1\), and the other for \(Y_2\)?

Well, graphically, two separate AR(1) models proposes the following paths of influence:

flowchart LR
Y1_T["Y1(T)"]
Y2_T["Y2(T)"]

Y1_T1["Y1(T-1)"]
Y2_T1["Y2(T-1)"]

Y1_T1 --> Y1_T
Y2_T1 --> Y2_T

By contrast, the paths implied and allowed in the corresponding VAR(1) model look more like the following:

flowchart LR
Y1_T["Y1(T)"]
Y2_T["Y2(T)"]

Y1_T1["Y1(T-1)"]
Y2_T1["Y2(T-1)"]

Y1_T1 & Y2_T1 --> Y1_T & Y2_T

So, each of the two outcomes at time T is influenced both by its own previous value, but also by the previous value of the other outcome. This other outcome influence is what is represented in the figure above by the diagonal lines: from Y2(T-1) to Y1(T), and from Y1(T-1) to Y2(T).

Expressed verbally, if we imagine two entities - self and other - tracked through time, self is influenced both by self’s history, but also by other’s history too.

Example and application in R

In a more substantivelly focused post, I discussed how I suspect economic growth and longevity growth trends are correlated. What I proposed doesn’t exactly lend itself to the simplest kind of VAR(1) model specification, because I suggested a longer lag between the influence of economic growth on longevity growth, and a change in the fundamentals of growth in both cases. However, as an example of VAR I will ignore these complexities, and use the data I prepared for that post:

Code

library(tidyverse)

gdp_growth_pct_series <- read_csv("../../../still-the-economy/both_series.csv") 

gdp_growth_pct_series

# A tibble: 147 × 5
    ...1  year series            pct_change period
   <dbl> <dbl> <chr>                  <dbl> <chr> 
 1     1  1949 1. Per Capita GDP     NA     Old   
 2     2  1950 1. Per Capita GDP      2.20  Old   
 3     3  1951 1. Per Capita GDP      2.92  Old   
 4     4  1952 1. Per Capita GDP      1.36  Old   
 5     5  1953 1. Per Capita GDP      5.07  Old   
 6     6  1954 1. Per Capita GDP      3.87  Old   
 7     7  1955 1. Per Capita GDP      3.55  Old   
 8     8  1956 1. Per Capita GDP      1.28  Old   
 9     9  1957 1. Per Capita GDP      1.49  Old   
10    10  1958 1. Per Capita GDP      0.815 Old   
# ℹ 137 more rows

We need to do a certain amount of reformatting to bring this into a useful format:

Code

wide_ts_series <- 
gdp_growth_pct_series |>
    select(-c(`...1`, period)) |>
    mutate(
        short_series = case_when(
            series == "1. Per Capita GDP" ~ 'gdp',
            series == "2. Life Expectancy at Birth" ~ 'e0',
            TRUE ~ NA_character_
        )
    ) |>
    select(-series) |>
    pivot_wider(names_from = short_series, values_from = pct_change) |>
    arrange(year) |>
    mutate(
        lag_gdp = lag(gdp),
        lag_e0 = lag(e0)
    ) %>%
    filter(complete.cases(.))


wide_ts_series

# A tibble: 71 × 5
    year   gdp      e0 lag_gdp  lag_e0
   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
 1  1951 2.92  -0.568    2.20   0.763 
 2  1952 1.36   1.89     2.92  -0.568 
 3  1953 5.07   0.388    1.36   1.89  
 4  1954 3.87   0.530    5.07   0.388 
 5  1955 3.55  -0.0570   3.87   0.530 
 6  1956 1.28   0.385    3.55  -0.0570
 7  1957 1.49   0.170    1.28   0.385 
 8  1958 0.815  0.255    1.49   0.170 
 9  1959 3.70   0.184    0.815  0.255 
10  1960 5.72   0.268    3.70   0.184 
# ℹ 61 more rows

So, we can map columns to parts of the VAR specification as follows:

Y1: gdp
Y2: e0 (life expectancy at birth)
period T: gdp and e0
period T-1: lag_gdp and lag_e0

To include two or more variables as the response part, \(Y\), of a linear model we can use the cbind() function to combine more than one variable to the left hand side of the linear regression formula for lm or glm:

Code

var_model <- lm(
    cbind(gdp, e0) ~ lag_gdp + lag_e0,
    data = wide_ts_series
)

var_model


Call:
lm(formula = cbind(gdp, e0) ~ lag_gdp + lag_e0, data = wide_ts_series)

Coefficients:
             gdp       e0      
(Intercept)   1.99440   0.28108
lag_gdp       0.06173   0.01179
lag_e0       -0.48525  -0.33063

We can see here that the model reports a small matrix of coefficients: three rows (one for each coefficient term) and two columns: one for each of the response variables. This is as we should expect.

Back in part 12 of the series, we saw we could extract the coefficients, variance-covariance matrix, and error terms of a linear regression model using the functions coefficients, vcov, and sigma respectively.¹ Let’s use those functions here too:

First the coefficients

Code

coefficients(var_model)

                    gdp          e0
(Intercept)  1.99439587  0.28108028
lag_gdp      0.06172721  0.01178781
lag_e0      -0.48524808 -0.33062640

And now the variance-covariance matrix:

Code

vcov(var_model)

                gdp:(Intercept)   gdp:lag_gdp    gdp:lag_e0 e0:(Intercept)
gdp:(Intercept)    0.1804533467 -0.0272190933 -0.1129646084   0.0039007790
gdp:lag_gdp       -0.0272190933  0.0164590434 -0.0180517467  -0.0005883829
gdp:lag_e0        -0.1129646084 -0.0180517467  0.6290771233  -0.0024419053
e0:(Intercept)     0.0039007790 -0.0005883829 -0.0024419053   0.0037904364
e0:lag_gdp        -0.0005883829  0.0003557878 -0.0003902165  -0.0005717391
e0:lag_e0         -0.0024419053 -0.0003902165  0.0135984779  -0.0023728303
                   e0:lag_gdp     e0:lag_e0
gdp:(Intercept) -0.0005883829 -0.0024419053
gdp:lag_gdp      0.0003557878 -0.0003902165
gdp:lag_e0      -0.0003902165  0.0135984779
e0:(Intercept)  -0.0005717391 -0.0023728303
e0:lag_gdp       0.0003457235 -0.0003791783
e0:lag_e0       -0.0003791783  0.0132138133

And finally the error terms

Code

sigma(var_model)

      gdp        e0 
2.6906051 0.3899529

The coefficients returns the same kind of 3x2 matrix we saw previously: two models run simultaneously. The error terms is now a vector of length 2: one for each of these models. The variance-covariance matrix is a square matrix of dimension 6: i.e. 6 rows and 6 columns. This is the number of predictor coefficients in each model (the number of columns of \(X\), i.e. 3) times the number of models simultaneously run, i.e. 2.

\(6^2\) is 36, which is the number of elements in the variance-covariance matrix of this VAR model. By contrast, if we had run two independent models - one for gdp and the other for e0 - we would have two 3x3 variance-variance matrices, producing a total of 18 ² terms. This should provide some reassurance that, when we run a multivariate regression model of two outcomes, we’re not just doing the equivalent of running separate regression models for each outcome, but in slightly fewer lines.

Now, let’s look at the model summary:

Code

summary(var_model)

Response gdp :

Call:
lm(formula = gdp ~ lag_gdp + lag_e0, data = wide_ts_series)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.679  -1.061   0.143   1.483   6.478 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.99440    0.42480   4.695 1.34e-05 ***
lag_gdp      0.06173    0.12829   0.481    0.632    
lag_e0      -0.48525    0.79314  -0.612    0.543    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.691 on 68 degrees of freedom
Multiple R-squared:  0.007555,  Adjusted R-squared:  -0.02163 
F-statistic: 0.2588 on 2 and 68 DF,  p-value: 0.7727


Response e0 :

Call:
lm(formula = e0 ~ lag_gdp + lag_e0, data = wide_ts_series)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.45680 -0.19230 -0.00385  0.24379  1.38706 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.28108    0.06157   4.565 2.15e-05 ***
lag_gdp      0.01179    0.01859   0.634  0.52823    
lag_e0      -0.33063    0.11495  -2.876  0.00537 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.39 on 68 degrees of freedom
Multiple R-squared:  0.1086,    Adjusted R-squared:  0.08243 
F-statistic: 4.144 on 2 and 68 DF,  p-value: 0.02003

The summary is now reported for each of the two outcomes: first gdp, then e0.

Remember that the outcome is percentage annual change in the outcome of interest from the previous year. i.e. both series have already been ‘differenced’ to produce approximately stationary series. It also means that the intercept terms are especially important, as they indicate the long-term trends observed in each series.

In this case the intercepts for both series are positive and statistically significant: over the long term, GDP has grown on average around 2% each year, and life expectancy by around 0.28%. As the post this relates to makes clear, however, these long-term trends may no longer apply.

Of the four lag (AR(1)) terms in the model(s), three are not statistically significant; not even close. The exception is the lag_e0 term for the e0 response model, which is statistically significant and negative. Its coefficient is also of similar magnitude to the intercept too.

What does this mean in practice? In effect, that annual mortality improvement trends have a tendency to oscillate: a better-than-average year tends to be followed by a worse-than-average year, and a worse-than-average year to be followed by a better-than-average year, in both cases at higher-than-chance rates.

What could be the cause of this oscillatory phenomenon? When it comes to longevity, the phenomenon is somewhat well understood (though perhaps not widely enough), and referred to as either ‘forward mortality displacement’ or, more chillingly, ‘harvesting’. This outcome likely comes about because, if there were an exceptionally bad year in terms of (say) influenza mortality, the most frail and vulnerable are likely to be those who die disproportionately from this additional mortality event. This means that the ‘stock’ of people remaining the following year have been selected, on average, to be slightly less frail and vulnerable than those who started the previous year. Similarly, an exceptionally ‘good’ year can mean that the average ‘stock’ of the population in the following year is slightly more frail than in an average year, so more susceptible to mortality. And so, by this means, comparatively-bad-years tend to be followed by comparatively-good-years, and comparatively-good-years by comparatively-bad-years.

Though this general process is not pleasant to think about or reason through, statistical signals such as the negative AR(1) coefficient identified here tend to keep appearing, whether we are looking for them or not.

Conclusion

In this post we’ve both concluded the time series subseries, and returned to and expanded on a few posts earlier in the series. This includes the very first post, where we were first introduced to the grandmother formulae, the posts on statistical modelling using both frequentist and Bayesian methods, and a substantive post linking life expectancy with economic growth.

Although we’ve now first encountered multivariate regression models in the context of time series, they are a much more general phenomenon. Pretty much any type of model we can think of and apply in a univariate fashion - where \(Y\) has just a single column - can conceivably be expanded to two mor more columns, leading to their more complicated multiple regression variants.

Footnotes

A primary aim extracting these components from a linear regression in this way is to allow something approximating a Bayesian posterior distribution of coefficients to be generated, using a multivariate normal distribution (the first place we actually encountered a multivariate regression), without using a Bayesian modelling approach. This allows for the estimating and propagation of ‘honest uncertainty’ in predicted and expected outcomes. However, as we saw in part 13, it can sometimes be as or more straightforward to just use a Bayesian modelling approach.↩︎
i.e. two times three squared.↩︎