index.qmd

---
title: "Bayesian Autoregressions"
author:
  - name: Tomasz Woźniak
    url: https://github.com/donotdespair
    orcid: 0000-0003-2212-2378
  - name: Jonas Loopers Davidsen
    url: https://github.com/jonasld23
  - name: Filippo Dell'Andrea
    url: https://github.com/IoTiFeelo
  - name: Corwin Getty
    url: https://github.com/Wintersend
  - name: Ray Gomez
    url: https://github.com/rayccg
  - name: Xiaoman Guo
    url: https://github.com/mandyxmg
  - name: Manting Huang
    url: https://github.com/mantihuang
  - name: Nathan Huynh
    url: https://github.com/nathanh93
  - name: Eungwon Lee
    url: https://github.com/eung1
  - name: Thomas Kronholm Møller
    url: https://github.com/ThomasKronhol
  - name: Victoria Sonnemans
    url: https://github.com/vsonnemans
  - name: Yobin Timilsena
    url: https://github.com/yobin-tim
  - name: Django Trueman-Greinke
    url: https://github.com/DjangoTG
  - name: Hanwen Zhang
    url: https://github.com/hanwenzhang0317
execute:
  echo: false
citation: 
  issued: 2023-05-25
  url: https://donotdespair.github.io/Bayesian-Autoregressions/
  doi: 10.26188/23255657
bibliography: references.bib
editor: 
  markdown: 
    wrap: 72
---

> **Abstract.** We present the basics of Bayesian estimation and
> inference for autoregressive models. The range of topics includes the
> natural conjugate analysis using normal-inverted-gamma 2 prior
> distribution and its extensions focusing on hierarchical modelling,
> conditional heteroskedasticity, and Student-t error terms. We focus on
> forecasting and sampling from the predictive density.
>
> **Keywords.** Autoregressions, Bayesian Inference, Forecasting,
> Heteroskedasticity, Hierarchical Modelling, Natural Conjugacy,
> Shrinkage Prior

# Autoregressions

Autoregressions are a popular class of linear models that are the most
useful for time series persistence analysis and forecasting a random
variable's unknown future values. The simplicity of their formulation,
estimation, and range of applications in which they occur useful decides
on their continued employment.

## The AR($p$) model

The model is set for a univariate time series whose observation at time
$t$ is denoted by $y_t$. It includes a $d$-vector $d_t$ of deterministic
terms and $p$ lags of the dependent variable on the right-hand side of
the model equation. It is complemented by error term $u_t$ that, in this
note, is zero-mean normally distributed with variance $\sigma^2$. Then
the model equations are: \begin{align}
y_t &= \alpha_d' d_t + \alpha_1 y_{t-1} + \dots + \alpha_p y_{t-p} + u_t\\
u_t\mid d_t, y_{t-1}, \dots, y_{t-p} &\sim\mathcal{N}\left(0, \sigma^2\right)
\end{align} where $\alpha_d$ is a $d$-vector of coefficients on
deterministic terms, and parameters $\alpha_1,\dots,\alpha_p$ are
autoregressive slopes.

## Matrix notation for the model

To simplify the notation and the derivations introduce matrix notation
for the model. Let $T$ be the available sample size for the variable
$y$. Define a $T$-vector of zeros, $\mathbf{0}_T$, the identity matrix
of order $T$, $\mathbf{I}_T$, $T\times1$ vectors: \begin{align}
\mathbf{y} = \begin{bmatrix} y_1\\ \vdots \\ y_T\end{bmatrix}, \quad
\text{ and }\quad
\mathbf{u} = \begin{bmatrix} u_1\\ \vdots \\ u_T\end{bmatrix},
\end{align} a $k\times1$ vector
$\mathbf{x}_t = \begin{bmatrix}d_t' & y_{t-1}&\dots& y_{t-} \end{bmatrix}'$,
where $k=d+p$, and a $T\times k$ matrix collecting the explanatory
variables: \begin{align}
\mathbf{X} = \begin{bmatrix} \mathbf{x}_1'\\ \vdots \\ \mathbf{x}_T'\end{bmatrix}.
\end{align} Collect the parameters of the conditional mean equation in a
$k$-vector: \begin{align}
\boldsymbol\alpha = \begin{bmatrix} \alpha_d'& \alpha_1 & \dots & \alpha_p\end{bmatrix}'.
\end{align}

Then the model can be written in a concise notation as: \begin{align}
\mathbf{y} &= \mathbf{X} \boldsymbol\alpha + \mathbf{u}\\
\mathbf{u}\mid \mathbf{X} &\sim\mathcal{N}_T\left(\mathbf{0}_T, \sigma^2\mathbf{I}_T\right).
\end{align}

## Likelihood function

The model equations imply the predictive density of the data vector
$\mathbf{y}$. To see this, consider the model equation as a linear
transformation of a normal vector $\mathbf{u}$. Therefore, the data
vector follows a multivariate normal distribution given by:
\begin{align}
\mathbf{y}\mid \mathbf{X}, \boldsymbol\alpha, \sigma^2 &\sim\mathcal{N}_T\left(\mathbf{X} \boldsymbol\alpha, \sigma^2\mathbf{I}_T\right).
\end{align}

This distribution determines the shape of the likelihood function that
is defined as the sampling data density: \begin{align}
L(\boldsymbol\alpha,\sigma^2|\mathbf{y}, \mathbf{X})\equiv p\left(\mathbf{y}\mid \mathbf{X}, \boldsymbol\alpha, \sigma^2 \right).
\end{align}

The likelihood function that for the sake of the estimation of the
parameters, and after plugging in data in place of matrices $\mathbf{y}$
and $\mathbf{X}$, is considered a function of parameters
$\boldsymbol\alpha$ and $\sigma^2$ is given by: \begin{align}
L(\boldsymbol\alpha,\sigma^2|\mathbf{y}, \mathbf{X}) = 
(2\pi)^{-\frac{T}{2}}\left(\sigma^2\right)^{-\frac{T}{2}}\exp\left\{-\frac{1}{2}\frac{1}{\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol\alpha)'(\mathbf{y} - \mathbf{X}\boldsymbol\alpha)\right\}.
\end{align}

# Natural-Conjugate Analysis

> **Authors:** Thomas Kronholm Møller & Jonas Loopers Davidsen

## Likelihood as normal-inverted gamma 2

In order to facilitate deriving the posterior distribution, the likelihood function can be rewritten to a $\mathcal{NIG}2$-distribution given by:

```{=tex}
\begin{align}
L(\boldsymbol\alpha,\sigma^2|\mathbf{y}, \mathbf{X}) &\propto \left(\sigma^2\right)^{-\frac{T-(k+2)+k+2}{2}}\exp\left\{-\frac{1}{2}\frac{1}{\sigma^2}\left(\boldsymbol\alpha-\boldsymbol{\hat{\alpha}}\right)'\mathbf{X}'\mathbf{X}\left(\boldsymbol\alpha-\boldsymbol{\hat{\alpha}}\right) \right\}\\
&\qquad\times\exp\left\{-\frac{1}{2}\frac{1}{\sigma^2}\left(\mathbf{y}-\mathbf{X}\boldsymbol{\hat{\alpha}}\right)'\left(\mathbf{y}-\mathbf{X}\boldsymbol{\hat{\alpha}}\right) \right\},
\end{align}
```
where $\boldsymbol{\hat{\alpha}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}$ in the maximum likelihood estimator of $\boldsymbol\alpha$. It is now quite straight forward to identify the $\mathcal{NIG}2$-kernel. By remembering that the $\mathcal{NIG}2$-distribution is characterized by its four moments, we get the following outcome:

```{=tex}
\begin{align}
L(\boldsymbol\alpha,\sigma^2|\mathbf{y}, \mathbf{X}) = \mathcal{NIG}2\left(\mu=\boldsymbol{\hat{\alpha}},\Sigma=\left(\mathbf{X}'\mathbf{X}\right)^{-1},s=\left(\mathbf{y}-\mathbf{X}\boldsymbol{\hat{\alpha}}\right)'\left(\mathbf{y}-\mathbf{X}\boldsymbol{\hat{\alpha}}\right),\nu=T-k-2\right)
\end{align}
```
## Normal-inverted gamma 2 prior

The prior distribution of the natural conjugate is determined by the form of the distribution of the parameters implied by the likelihood function discussed above. The priors for the Normal-inverse gamma 2 distribution can thus be written as:

```{=tex}
\begin{align}
p(\boldsymbol\alpha,\sigma^2) &= p(\boldsymbol\alpha|\sigma^2)p(\sigma^2)\\
      p(\boldsymbol\alpha|\sigma^2) &= \mathcal{N}(\underline{\boldsymbol\alpha},\sigma^2\underline{\mathbf{V}}_{\boldsymbol\alpha})\\
      p(\sigma^2) & = \mathcal{IG}2(\underline{s},\underline{\nu})
\end{align}
```
Using the distributions above, we can write the kernel of the $\mathcal{NIG}2$ prior as:

```{=tex}
\begin{align}
\mathcal{NIG}2_{k}(\underline{\boldsymbol\alpha},\underline{\mathbf{V}}_{\boldsymbol\alpha}, \underline{s}, \underline{\nu}) \propto (\sigma^2)^{-\frac{\underline{\nu}+k+2}{2}}\exp\left\{-\frac{1}{2}\frac{1}{\sigma^2}(\boldsymbol\alpha-\underline{\boldsymbol\alpha})'\underline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}(\boldsymbol\alpha-\underline{\boldsymbol\alpha})\right\}\exp\left\{-\frac{1}{2}\frac{\underline{s}}{\sigma^2}\right\}
\end{align}
```
## Normal-inverted gamma 2 posterior

The product of the prior distribution and the likelihood function as introduced above gives the posterior distribution given by:

```{=tex}
\begin{align}
p(\boldsymbol\alpha,\sigma^2|\mathbf{y},\mathbf{X}) \propto L(\mathbf{y}|\mathbf{X}, \boldsymbol\alpha, \sigma^2)p(\boldsymbol\alpha,\sigma^2) =  L(\mathbf{y}|\mathbf{X}, \boldsymbol\alpha, \sigma^2)p( \boldsymbol\alpha| \sigma^2)p(\sigma^2)
\end{align}
```
Now plugging in the expressions for the likelihood and the prior distribution:
```{=tex}
\begin{align}
p(\boldsymbol\alpha,\sigma^2|\mathbf{y},\mathbf{X}) &\propto \left(\sigma^2\right)^{-\frac{T-(k-2)+k+2}{2}}\exp\left\{-\frac{1}{2}\frac{1}{\sigma^2}\left(\boldsymbol\alpha-\boldsymbol{\hat{\alpha}}\right)'\mathbf{X}'\mathbf{X}\left(\boldsymbol\alpha-\boldsymbol{\hat{\alpha}}\right) \right\}\exp\left\{-\frac{1}{2}\frac{1}{\sigma^2}\left(\mathbf{y}-\mathbf{X}\boldsymbol{\hat{\alpha}}\right)'\left(\mathbf{y}-\mathbf{X}\boldsymbol{\hat{\alpha}}\right) \right\} \\
&\times (\sigma^2)^{-\frac{\underline{\nu}+k+2}{2}}\exp\left\{-\frac{1}{2}\frac{1}{\sigma^2}(\boldsymbol\alpha-\underline{\boldsymbol\alpha})'\underline{\mathbf{V}}_{\boldsymbol\alpha}^{-1}(\boldsymbol\alpha-\underline{\boldsymbol\alpha})\right\}\exp\left\{-\frac{1}{2}\frac{\underline{s}}{\sigma^2}\right\}
\end{align}
```
By collecting all terms and simplifying the expressions, the joint posterior distribution can be derived to be:

```{=tex}
\begin{align}
p(\boldsymbol\alpha,\sigma^2|\mathbf{y},\mathbf{X}) &\propto (\sigma^2)^{-\frac{T+\underline{\nu}+k+2}{2}}\exp\left\{-\frac{1}{2}\frac{1}{\sigma^2} (\boldsymbol\alpha-\overline{\boldsymbol\alpha})'\overline{\mathbf{V}}_{\boldsymbol\alpha}^{-1}(\boldsymbol\alpha-\overline{\boldsymbol\alpha})\right\} \times \exp\left\{-\frac{1}{2}\frac{1}{\sigma^2}\left(\underline{s}-\overline{\boldsymbol\alpha}'\overline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}\overline{\boldsymbol\alpha}+\underline{\boldsymbol\alpha}'\underline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}\underline{\boldsymbol\alpha}+\mathbf{y}'\mathbf{y}\right)\right\}
\\&= (\sigma^2)^{\frac{\overline{\nu}+k+2}{2}}\exp\left\{-\frac{1}{2}\frac{1}{\sigma^2} (\boldsymbol\alpha-\overline{\boldsymbol\alpha})'\overline{\mathbf{V}}_{\boldsymbol\alpha}^{-1}(\boldsymbol\alpha-\overline{\boldsymbol\alpha})\right\} \times \exp\left\{-\frac{1}{2}\frac{\overline{s}}{\sigma^2}\right\}
\end{align}
```
This now fully defines the joint posterior distribution as is it is in a normal inverse gamma 2 form with its corresponding four moments:

```{=tex}
\begin{align}
p(\boldsymbol\alpha,\sigma^2|\mathbf{y},\mathbf{X}) &= \mathcal{NIG}2_{k}\left(\overline{\boldsymbol\alpha},\overline{\mathbf{V}}_{\boldsymbol\alpha},\overline{s},\overline{\nu}\right)\\
\overline{\boldsymbol\alpha} &= \overline{\mathbf{V}}_{\boldsymbol\alpha}( \underline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}\underline{\boldsymbol\alpha}+\mathbf{X}'\mathbf{y})\\
\overline{\mathbf{V}}_{\boldsymbol\alpha} &=\left(\underline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}+\mathbf{X}'\mathbf{X}\right)^{-1} \\
\overline{s} &= \underline{s}-\overline{\boldsymbol\alpha}'\overline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}\overline{\boldsymbol\alpha}+\underline{\boldsymbol\alpha}'\underline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}\underline{\boldsymbol\alpha}+\mathbf{y}'\mathbf{y}\\
\overline{\nu} &= \underline{\nu}+T
\end{align}
```
## Sampling draws from the posterior

We start by generating a random walk, which can be used to validate that our estimation is indeed correct.

```{r}
#| echo: true
#| message: false
#| warning: false
T       = 100
N       = 1

y       = apply(matrix(rnorm(T * N), ncol = N), 2, cumsum)
y       = as.matrix(y)
p       = 1
d       = 1

k    = p + d 

T     = nrow(y)
Y     = as.matrix(y[(p+1):T,])
X     = cbind(rep(1, T - p), y[1:(T - p),])

```

Now, defining our priors for $\underline{\boldsymbol\alpha}$, $\underline{\mathbf{V}}_{\boldsymbol\alpha}$, $\underline{s}$ and $\underline{\nu}$:

```{r}
#| echo: true
#| message: false
#| warning: false

priors = list(
  alpha   = as.matrix(rep(0, k)),
  Sigma   = diag(2),
  S       = 1,
  nu      = 3
)
```

and computing the function for the posterior parameters:

```{r}
#| echo: true
#| message: false
#| warning: false
posterior = function(y, priors){
  Sigma.inv = t(X) %*% X + priors$Sigma
  Sigma     = solve(Sigma.inv)
  alpha     = Sigma %*% (t(X) %*% Y + solve(priors$Sigma) %*% priors$alpha)
  S         = as.numeric(t(Y) %*% Y + priors$S + t(priors$alpha) %*% solve(priors$Sigma) %*% priors$alpha 
                     - t(alpha) %*% Sigma.inv %*% alpha)
  nu        = T + priors$nu 
  
  return(list(
    Sigma  = Sigma,
    alpha  = alpha,
    S      = S,
    nu     = nu
  ))
}

post = posterior(y = y, priors = priors)
```

We are then able to do the estimation of our parameters using the Gibbs sampler provided below.

```{r}
#| echo: true
#| message: false
#| warning: false
posterior.draws = function(S, posterior){
  Sigma2.posterior      = as.matrix(posterior$S / rchisq(S, posterior$nu))
  alpha.posterior       = simplify2array(
    lapply(1:S, function(i){
      mvtnorm::rmvnorm(1, mean = posterior$alpha, sigma = Sigma2.posterior[i,] * posterior$Sigma)
    })
  )
  output = cbind(t(alpha.posterior[1,,]), Sigma2.posterior)
  return(output)
}

draws = posterior.draws(S = 1000, posterior = post)
```


# Hierarchical Prior Analysis

## Estimating autoregressive prior shrinkage

### Inverted gamma 2 scale mixture of normal
Given the scalar scale follows an inverted gamma 2 distribution: 
\begin{align}
\kappa_{\boldsymbol\alpha} \sim \mathcal{IG}2 (\underline{s}_{\boldsymbol\alpha},\underline{\nu}_{\boldsymbol\alpha})
\end{align}

We have priors as: 
\begin{align}
\boldsymbol\alpha | \sigma^2, \kappa_{\boldsymbol\alpha} &\sim \mathcal{N}_{k}(\underline{\boldsymbol\alpha}, \sigma^2 \kappa_{\boldsymbol\alpha} \underline{\mathbf{V}} _{\boldsymbol\alpha}) \\ 
\\ \sigma^2 &\sim \mathcal{IG}2(\underline{s},\underline{\nu}) \\
\\ p(\boldsymbol\alpha, \sigma^2, \kappa_{\boldsymbol\alpha}) &= p(\boldsymbol\alpha | \sigma^2, \kappa_{\boldsymbol\alpha}) p(\sigma^2) p(\kappa_{\boldsymbol\alpha})
\end{align}

Then, the joint distribution of $(\boldsymbol\alpha,\sigma^2)$ given $\kappa_{\boldsymbol\alpha}$ would follow a normal-inverted gamma 2 distribution: 
\begin{align*}
p\left(\boldsymbol\alpha,\sigma^2 \mid \kappa_{\boldsymbol\alpha}\right) = \mathcal{NIG}2_N(\underline{\boldsymbol\alpha},\kappa_{\boldsymbol\alpha} \underline{\mathbf{V}} _{\boldsymbol\alpha}, \underline{s},\underline{\nu})
\end{align*}

A **Gibbs Sampler** can be applied using the following steps: 

Initialize $\kappa_{\boldsymbol\alpha}$ at $\kappa_{\boldsymbol\alpha}^{(0)}$. 

| At each iteration s: 

|     Step 1. Draw $(\boldsymbol\alpha,\sigma^2)^{(s)} \sim p(\boldsymbol\alpha,\sigma^2|\mathbf{X},\mathbf{y},\kappa_{\boldsymbol\alpha}^{(s-1)}) = \mathcal{NIG}2(\overline{\boldsymbol\alpha},\overline{\mathbf{V}},\overline{s},\overline{\nu})$. 

|     Step 2. Draw $\kappa_{\boldsymbol\alpha}^{(s)} \sim p(\kappa_{\boldsymbol\alpha}|\mathbf{X},\mathbf{y},\boldsymbol\alpha,\sigma^2) = \mathcal{IG}2(\overline{s},\overline{\nu})$.

Repeat step 1 and 2 $(S_1+S_2)$ times and discard the first $S_1$ draws. 

The **full conditional posterior** of $\kappa_{\boldsymbol\alpha}$ is: 
\begin{gather*}
 p(\kappa_{\boldsymbol\alpha}|\mathbf{X},\mathbf{y},\boldsymbol\alpha,\sigma^2) = L(\mathbf{y}|\mathbf{X},\boldsymbol\alpha,\sigma^2) p(\kappa_{\boldsymbol\alpha})  p(\boldsymbol\alpha|\sigma^2,\kappa_{\boldsymbol\alpha}) p(\sigma^2) \\

\\ \propto p(\kappa_{\boldsymbol\alpha}) p(\boldsymbol\alpha|\sigma^2,\kappa_{\boldsymbol\alpha}) \\

\\ \propto (\kappa_{\boldsymbol\alpha})^{-\frac{\nu+2}{2}} exp \left\{ -\frac{1}{2}\frac{s}{\kappa_{\boldsymbol\alpha}} \right\} 
\times \text{det}(\sigma^2 \kappa_{\boldsymbol\alpha} \underline{\mathbf{V}} _{\boldsymbol\alpha})^{-\frac{1}{2}} exp \left\{ -\frac{1}{2} (\boldsymbol\alpha-\underline{\boldsymbol\alpha})^{'}(\sigma^2 \kappa_{\boldsymbol\alpha} \underline{\mathbf{V}} _{\boldsymbol\alpha})^{-1} (\boldsymbol\alpha-\underline{\boldsymbol\alpha})\right\} \\ 

\\ =  (\kappa_{\boldsymbol\alpha})^{-\frac{\nu+2+k}{2}} exp \left\{ -\frac{1}{2}\frac{s}{\kappa_{\boldsymbol\alpha}} \right\} exp \left\{ -\frac{1}{2}\frac{1}{\kappa_{\boldsymbol\alpha}} (\boldsymbol\alpha-\underline{\boldsymbol\alpha})^{'} (\sigma^2 \underline{\mathbf{V}}_{\boldsymbol\alpha})^{-1} (\boldsymbol\alpha-\underline{\boldsymbol\alpha})\right\}
\end{gather*}

in which we recognize a kernel of inverted gamma 2 distribution with parameters:

\begin{align}
\overline{s}_{\boldsymbol\alpha} &= s + (\boldsymbol\alpha-\underline{\boldsymbol\alpha})^{'} \sigma^{-2} \underline{\mathbf{V}}_{\boldsymbol\alpha}^{-1} (\boldsymbol\alpha-\underline{\boldsymbol\alpha}) \\
\overline{\nu}_{\boldsymbol\alpha} &= \nu + k 
\end{align}


To sample from the $\mathcal{IG}2(s,\nu)$ distribution, the following code can be used:

```{r}
#| echo: true
# Set parameters of IG2 distribution 
s <- 1 
nu <- 1  
# Draw one sample from IG2 distribution 
sample <- s / rchisq(1, df = nu)
```


### Gamma scale mixture of normal

> Contributor: Yobin Timilsena

Alternatively, the scalar scale $\kappa_{\boldsymbol\alpha}$ that premultiplies the covariance matrix of the normal prior for vector $\boldsymbol\alpha$ can be assumed to have a hierarchical prior with a gamma distribution:
$$
\kappa_{\boldsymbol\alpha}|\underline s_\kappa, \underline a_\kappa \sim \mathcal{G}(\underline s_\kappa, \underline a_\kappa)
$$
The following code creates a list `kappa.priors` to store priors on $\kappa_\boldsymbol\alpha$.
```{r kappa priors for gamma scale mixture of normal, echo=TRUE}
kappa.priors = list(
  a = 1,
  s = .1
)
```


Given the likelihood and priors above, we can obtain the **full conditional posterior** of $\kappa_\boldsymbol\alpha$ as
$$
\begin{align*}
    p(\kappa_{\boldsymbol\alpha} | \bf y, \bf X, \boldsymbol\alpha, \sigma^2) & \propto p(\kappa_{\boldsymbol\alpha} | \underline s_{\boldsymbol\kappa}, \underline a_{\boldsymbol\kappa}) \cdot p(\boldsymbol\alpha | \kappa_{\boldsymbol\alpha}, \sigma^2)\\
    &\qquad\times (\kappa_{\boldsymbol\alpha})^{\underline a_{\boldsymbol\kappa} - 1} \exp \left\{  - \frac{\kappa_{\boldsymbol\alpha}}{\underline s_{\boldsymbol\kappa}} \right\} \cdot \det(\sigma^2 \kappa_{\boldsymbol\alpha} \underline{\bf V}_{\boldsymbol\alpha})^{-\frac{1}{2}} exp \left\{ -\frac{1}{2} (\boldsymbol\alpha-\underline{\boldsymbol\alpha})^{'}(\sigma^2 \kappa_{\boldsymbol\alpha} \underline{\mathbf{V}}_{\boldsymbol\alpha})^{-1} (\boldsymbol\alpha-\underline{\boldsymbol\alpha})\right\}  \\
    & \propto (\kappa_{\boldsymbol\alpha})^{\underline a_{\boldsymbol\kappa} - 1} \exp \left\{  - \frac{\kappa_{\boldsymbol\alpha}}{\underline s_{\boldsymbol\kappa}} \right\} \cdot (\sigma^2)^{-\frac{K}{2}} (\kappa_{\boldsymbol\alpha})^{-\frac{K}{2}} \det(\underline{\bf V}_{\boldsymbol\alpha})^{-\frac{1}{2}} exp \left\{ -\frac{1}{2 \kappa_{\boldsymbol\alpha}} (\boldsymbol\alpha-\underline{\boldsymbol\alpha})^{'} (\sigma^2 \underline{\mathbf{V}}_{\boldsymbol\alpha})^{-1} (\boldsymbol\alpha-\underline{\boldsymbol\alpha})\right\}  \\
    & \propto (\kappa_{\boldsymbol\alpha})^{\underline a_{\boldsymbol\kappa} - \frac{K}{2} - 1} \exp \left\{  -\frac{1}{2} \left( (\boldsymbol\alpha-\underline{\boldsymbol\alpha})^{'} (\sigma^2 \underline{\mathbf{V}}_{\boldsymbol\alpha})^{-1} (\boldsymbol\alpha-\underline{\boldsymbol\alpha}) \cdot \frac{1}{\kappa_{\boldsymbol\alpha}} + \frac{2}{\underline s_{\boldsymbol\kappa}} \kappa_{\boldsymbol\alpha} \right) \right\} 
\end{align*}
$$
which is a kernel for a Generalised Inverse Gaussian distribution with parameters:
$$
\begin{align*}
    & \overline\lambda = \underline a_{\boldsymbol\kappa} - \frac{K}{2}\\
    & \overline\chi = (\boldsymbol\alpha-\underline{\boldsymbol\alpha})^{'} (\sigma^2 \underline{\mathbf{V}}_{\boldsymbol\alpha})^{-1} (\boldsymbol\alpha-\underline{\boldsymbol\alpha}) \\
    & \overline\Psi = \frac{2}{\underline s_{\boldsymbol\kappa}}
\end{align*}
$$
Hence, the **full-conditional posterior** distribution of $\kappa_\alpha$ follows a  Generalised Inverse Gaussian distribution.
$$
\kappa_{\boldsymbol\alpha} | \bf y, \bf X, {\boldsymbol\alpha}, \sigma^2 \sim \mathcal{GIG}(\overline\lambda, \overline\chi, \overline\Psi)
$$
Given the full conditional posterior of both $\kappa_\alpha$ and $(\boldsymbol\alpha, \sigma^2)$, we can implement the **Gibbs Sampler** with the following steps:

- Initialise $\kappa_\alpha^{(0)}$.
- At each iteration $s$:
  - Step 1: Draw $(\boldsymbol\alpha,\sigma^2)^{(s)} \sim p(\boldsymbol\alpha,\sigma^2|\mathbf{X},\mathbf{y},\kappa_{\boldsymbol\alpha}^{(s-1)}) = \mathcal{NIG}2(\overline{\boldsymbol\alpha},\overline{\mathbf{V}},\overline{s},\overline{\nu})$. 
  - Step 2. Draw $\kappa_{\boldsymbol\alpha}^{(s)} \sim p(\kappa_{\boldsymbol\alpha}|\mathbf{X},\mathbf{y},(\boldsymbol\alpha,\sigma^2)^{(s)}) = \mathcal{GIG}(\lambda, \chi, \Psi)$.

The `Gamma.sampler` function below implements the Gibbs Sampler using the aforementioned algorithm.
```{r sample from a gamma distributed scalar scale, echo=TRUE}
Gamma.sampler = function(S, posterior, priors, kappa.priors){
  set.seed(12345)
  S.burnin = 100
  S.total = S + S.burnin
  
  #create matrices to store posterior draws
  alpha.posterior  = array(NA,c(k,1,S.total))
  kappa.posterior  = array(NA,c(1,S.total))
  #initial value for kappa
  kappa.posterior[1] = 10
  
  # Draw sigma^2 from IG2 outside the loop as it does not update iteratively 
  Sigma2.posterior = posterior$S / rchisq(n=S.total,df=posterior$nu)
  
  for (i in 1:(S.total)) {
    # Plug in current kappa to update Sigma (or V in slides)
    prior.Sigma.inv = solve(kappa.posterior[i] * priors$Sigma)
    Sigma.inv = t(X) %*% X + prior.Sigma.inv
    Sigma = solve(Sigma.inv)
    
    # Draw alpha from normal dist using sigma^2 and updated var-covar matrix Sigma
    
    alpha.posterior[,,i]  = t(mvtnorm::rmvnorm(n=1, mean = posterior$alpha, sigma = Sigma2.posterior[i] * Sigma))
    
    # Update parameters for kappa posterior
    lambda = kappa.priors$a - k/2
    chi    = t(alpha.posterior[,,i] - priors$alpha) %*% solve(Sigma2.posterior[i] * priors$Sigma) %*% (alpha.posterior[,,i] - priors$alpha)
    Psi    = 2 / kappa.priors$s
    
    # Draw next period value for kappa from GIG distribution
    if (i != S.total){
      kappa.posterior[i+1] = GIGrvg::rgig(n=1, lambda = lambda, chi = chi, psi = Psi)
    }
  }
  
  # save output as a list 
  list(
    alpha.posterior = alpha.posterior[,,((S.burnin+1):S.total)],
    Sigma2.posterior = Sigma2.posterior[((S.burnin+1):S.total)],
    kappa.posterior = kappa.posterior[((S.burnin+1):S.total)]
  )
}
GammaScale.draws = Gamma.sampler(S = 1000, post, priors, kappa.priors)
```

I compute and display the posterior means for $({\boldsymbol\alpha, \sigma^2, \kappa_{\boldsymbol\alpha}})$.
```{r compute and display alpha posterior mean, echo=TRUE}
alpha.posterior.mean = rowMeans(GammaScale.draws$alpha.posterior)
alpha.posterior.mean
```

```{r compute and display sigmasq posterior mean, echo=TRUE}
Sigma2.posterior.mean = mean(GammaScale.draws$Sigma2.posterior)
Sigma2.posterior.mean
```

```{r compute and display kappa posterior mean, echo=TRUE}
kappa.posterior.mean = mean(GammaScale.draws$kappa.posterior)
kappa.posterior.mean
```


## Estimating error term variance prior scale

In this section, we estimate the error term variance prior scale
$\underline{s}$ that follows a Gamma distribution $G(\underline{s}_s,\underline{a}_s)$ with scale $\underline{s}_s$ shape $\underline{a}_s$ that follow a probability density function equal to:
$$p(\underline{s})=\Gamma(\underline{a}_s)^{-1}\underline{s}_s^{-\underline{a}_s}\underline{s}^{\underline{a}_s-1}\exp\left\{-\frac{\underline{s}}{\underline{s}_s}\right\}$$
In order to find our full conditional posterior of $\underline{s}$ write out its kernel as:
\begin{align}
p(\underline{s}|y,X,\alpha,\sigma^2) &\propto 
L(y|X,\alpha,\sigma^2) \times p(\alpha|\sigma^2) \times p(\sigma^2|\underline{s}) \times p(\underline{s})\\
&\propto p(\sigma^2|\underline{s}) \times p(\underline{s})\\
&\propto \underline{s}^{\frac{\underline{\nu}}{2}}
\exp\left\{-\frac{1}{2}\frac{\underline{s}}{\sigma^2}\right\}
\underline{s}^{\underline{a}_s-1}\exp\left\{-\frac{\underline{s}}{\underline{s}_s}\right\}\\
&\propto
\underline{s}^{\frac{\underline{\nu}}{2}+\underline{a}_s-1}\exp\left\{-\frac{\underline{s}}{
\left[\left(2\sigma^2\right)^{-1} + \underline{s}_s^{-1}\right]^{-1}
} \right\}
\end{align} 
from which we recognise a Gamma function $G(\overline{a}_s, \overline{s}_s)$ with parameters:
\begin{align}
\overline{a}_s&=\frac{\underline{\nu}}{2}+\underline{a}_s\\
\overline{s}_s&=\left[\left(2\sigma^2\right)^{-1} + \underline{s}_s^{-1}\right]^{-1}
\end{align}

In order to obtain a sample from the posterior distribution we use  a Gibbs sampler. We generate
random draws from the joint posterior distribution and we update them at
each iteration. In this case we exploit the following procedure: 


Initialize $\underline{s}$ at $\underline{s}^{(0)}$

At each iteration s:

1.  Draw $(\boldsymbol\alpha,\sigma^2)^{(s)} \sim p\left(\boldsymbol\alpha,\sigma^2 \mid \mathbf{y},\mathbf{X}, \boldsymbol\alpha^{(s-1)}, {\sigma^2}^{(s-1)}, \underline{s}^{(s)}\right)$
2.  Draw $\underline{s}^{(s)} \sim p(\underline{s}\mid\mathbf{y},\mathbf{X},\boldsymbol\alpha,\sigma^2)$

Repeat steps 1 and 2 for $(S_1 + S_2)$ times. Discard the first $S_1$
repetitions. Return the output as
$\left \{ \boldsymbol\alpha^{(s)}, \sigma^{2(s)}, \underline{s}^{(s)}\right \}^{S_1+S_2}_{s=S_1+1}$.

The following script illustrates sampling from the full conditional posterior distribution of $\underline{s}$.

```{r}
#| echo: true 
#| message: false
#| warning: false
nu = 1
sigma2 = 10

# define the prior hyper-parameters
s.prior.s  <- 0.01
a.prior.s  <- 1

#define the posteriors
a.bar.s <- (nu / 2) + s.prior.s
s.bar.s <- 1 / (1/(2 * sigma2) + (1 / s.prior.s))

#sample from the gamma distribution using rgamma function
s.sigma <- rgamma(1, shape = a.bar.s, scale = s.bar.s)
```
## Dummy observation prior

In this extended model, the system is augmented using the *sum of coefficients* and *dummy initial observation* prior. The idea for this way of setting a prior is to generate artificial new rows that augment data matrices $\bf y$ and $\bf X$. It is equivalent to assuming a normal-inverted gamma 2 prior whose parameters are determined by dummy observations augmenting the system.

The *sum of coefficients* prior takes the average of the lagged values and adds them to the basic equation. This is because these values are assumed to be a good forecast of future observations. The *dummy initial observation* prior adds a single dummy observation such that all values are set equal to the averages of initial conditions, up to a scaling factor.

In order to practically generate the additional rows the following steps should be taken.

Firstly, the average of lagged values needs to be calculated, `y.bar`. This is done by finding the mean of the first $p$ values in the data. Next, values of of the scaling factors need to be selected, where values equal to 1 are chosen typically.

Once this has been done, the rows can be created. In the univariate case this is simple, two rows are added to vector $\bf y$, the first equal to `y.bar` divided by the first scaling factor, $\lambda_3$, the second equal to th same value divided by the second scaling factor, $\lambda_4$.

Adding rows to $\bf X$ is slightly more complicated. The additional rows are as follows. The values in the first row, corresponding to the sum of coefficients, are all equal to `y.bar` divided by $\lambda_3$, except the last column, which is equal to zero. The values in the second row, corresponding to the dummy initial observation, are all equal to `y.bar`, except the last column, which is equal to one, al of which are divided by $\lambda_4$.

The following **R** function `prior.ex()` generates vectors and matrices by which $\bf y$ and $\bf X$ should be extended.

```{r}
#| echo: true 
#| 
prior.ex <- function(data, p = 1, lambda_3 = 1, lambda_4 = 1){
  
  N = 1
  M = p + 1
  
  y.bar     = mean(data[1:p])
    
  Y_star    = rbind((y.bar)/lambda_3, y.bar/lambda_4)
  X_star    = as.matrix(c(rep(0, N), 1/lambda_4))
  for (i in 1:p) {
    X_star  = cbind(Y_star, X_star)
  }
  
  ext.data <- list(YN = Y_star, XN = X_star)
  
  return(ext.data)
  
  
}

```

# Model Extensions

## Student-$t$ error term

A possible extension to the model is to relax the assumption of normally distributed errors. By assuming a t-distribution for the errors we can better account for excess kurtosis in the data, also referred to as 'fat tails'. Such a specification lends itself well to estimating models on data commonly found to be leptokurtic, such as financial time series.

$$ u_t \sim \mathcal{t}(0,\Sigma,\nu) $$


#### Scale Mixture of Normals Representation of the $t$ Distribution

Implementation of t-distributed errors can be achieved by representing the t-distribution as a 'scale mixture of normal distributions'. 

If,
$$
\begin{align*}
  x|\lambda &\sim \mathcal{N}(0, \lambda\sigma^2) \\
  \lambda &\sim \mathcal{IG2}(s, \nu)
\end{align*}
$$
then
$$ x \sim t(0, s, \sigma^2, \nu) $$
This result can be derived by expressing the joint distribution of $x$ and $\lambda$ as
$$ p(x,\lambda) = p(x|\lambda) p(\lambda) $$
and then intergrating with respect to $\lambda$

$$ \int_0^\infty p(x|\lambda) p(\lambda) d\lambda $$

This integration produces a density function which is proportional to the distribution for a Student-t distribution with $\nu$ degrees of freedom. Hence, $x\sim t_\nu$.

#### Likelihood and Posteriors under Student-t errors

Implementing the scale mixture specification, the conditional distribution of $y$ now takes the following form

$$ \boldsymbol{y}|\boldsymbol{X},\boldsymbol{\alpha},\sigma^2,\lambda \sim \mathcal{N_T}(\boldsymbol{X}\boldsymbol{\alpha}, \sigma^2\lambda I_T) $$

The likelihood under this specification now becomes

$$ L(\boldsymbol{\alpha}, \sigma^2, \lambda | \boldsymbol{y}, \boldsymbol{X}) = (2\pi)^{-\frac{T}{2}} (\sigma^2)^{-\frac{T}{2}} \det(\lambda I_T)^{-\frac{1}{2}}
exp(-\frac{1}{2}\frac{1}{\sigma^2}(\boldsymbol{y - X\alpha})'(\lambda I_T)^{-1}(\boldsymbol{y - X\alpha}))$$

As before, we can derive the full conditional posteriors of the parameters by multiplying the likelihood and the priors. The natural conjugacy of $\boldsymbol\alpha$ and $\sigma^2$ are preserved under this specification and therefore the joint posterior is given by

$$ p(\boldsymbol{\alpha}, \sigma^2|\boldsymbol{y, X}, \lambda) \propto L(\boldsymbol{\alpha}, \sigma^2, \lambda | \boldsymbol{y}, \boldsymbol{X}) p(\boldsymbol{\alpha}, \sigma^2) $$

Expanding and rearranging the resulting expression allows the posterior to be expressed in the form of a Normal Inverse Gamma 2 with the following moments

$$\begin{align*}
p(\boldsymbol{\alpha}, \sigma^2|\boldsymbol{\mathbf{y}, \mathbf{X}}, \lambda) &= \mathcal{NIG2}(\boldsymbol{\overline\alpha, \overline V_\alpha}, \overline s, \overline \nu) \\
\overline{\boldsymbol\alpha} &= \overline{\mathbf{V}}_{\boldsymbol\alpha}( \underline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}\underline{\boldsymbol\alpha}+\mathbf{X}'(\lambda I_T)^{-1}\mathbf{y})\\
\overline{\mathbf{V}}_{\boldsymbol\alpha} &=\left(\underline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}+\mathbf{X}'(\lambda I_T)^{-1}\mathbf{X}\right)^{-1} \\
\overline{s} &= \underline{s}+\overline{\boldsymbol\alpha}'\overline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}\overline{\boldsymbol\alpha}+\underline{\boldsymbol\alpha}'\underline{\mathbf{V}}^{-1}_{\boldsymbol\alpha}\underline{\boldsymbol\alpha}+\mathbf{y}'(\lambda I_T)^{-1}\mathbf{y}\\
\overline{\nu} &= \underline{\nu}+T
\end{align*}$$

The full conditional for $\lambda$ is also derived in a similar manner, with the posterior distribution taking the form of an Inverse Gamma 2

$$\begin{align*}
p(\lambda|\mathbf{y,X},\boldsymbol{\alpha},\sigma^2) &\propto L(\boldsymbol{\alpha}, \sigma^2, \lambda | \mathbf{y, X})p(\lambda) \\
&\propto (\sigma^2)^{-\frac{T}{2}}\det(\lambda I_T)^{-\frac{1}{2}}\exp(-\frac{1}{2}\frac{1}{\sigma^2}(\mathbf{y - X\alpha})'(\lambda I_T)^{-1}(\mathbf{y - X\alpha})) \\
&\times \lambda^{-\frac{\underline \nu_\lambda +2}{2}}\exp(-\frac{1}{2}\frac{\underline s_\lambda}{\lambda})
\end{align*}$$

Rearranging the above to express in the form of $\mathcal{IG2}$

$$\begin{align*}
p(\lambda|\mathbf{y,X},\boldsymbol{\alpha},\sigma^2) &= \mathcal{IG2}(\overline s_\lambda, \overline \nu_\lambda) \\
\overline s_\lambda &= \frac{1}{\sigma^2}(\mathbf{y - X\boldsymbol\alpha})'(\mathbf{y - X\boldsymbol\alpha}) + \underline s_\lambda \\
\overline \nu_\lambda &= T + \underline \nu_\lambda
\end{align*}$$


#### Gibbs Sampler Code for t-distributed errors

Now that we have the full conditionals for all of the parameters we can incorporate them into the Gibbs Sampler routine in order to draw a sample from the posterior densities.

The routine proceeds as follows:

Initialize lambda based on priors for $\underline s_\lambda$ and $\underline \nu_\lambda$.

At each iteration s:

1.    Draw $\sigma^{2(s)}$ from the $\mathcal{IG2}(\overline S, \overline\nu)$ distribution
2.    Draw $\mathbf{\alpha^{(s)}}$ from the $\mathcal{N}(\boldsymbol{\overline \alpha}, \sigma^{2(s)} \overline V)$ distribution
3.    Draw $\lambda^{(s)}$ from $\mathcal{IG2}(\overline s_\lambda, \overline\nu_\lambda)$

The sampling routine is implemented via the following function in R and the results of applying the sampler to the artificial random walk data are displayed. For the purposes of this simulation a relatively large prior for the degrees of freedom parameter is chosen, which communicates our prior belief that the data should follow a random walk.

```{r}
#| echo: true 
#| message: false
#| warning: false

TDist.Gibbs.sampler = function(S, Y, X, priors){
  # A function to sample from the full conditionals of alpha, sigma and lambda
  #
  # priors: a list containing the usual priors for alpha, Sigma, S and nu
  # as well as additional priors for lambda hyper-parameters lambda_s and lambda_nu.
  
  A.prior = priors$alpha
  A_V.gprior = priors$Sigma
  Sigma_s.prior = priors$S
  Sigma_v.prior = priors$nu
  lambda_s.prior = priors$lambda_s
  lambda_v.prior = priors$lambda_nu
  
  lambda.draw = lambda_s.prior/rchisq(1, lambda_v.prior)
  
  Sigma.posterior.draws = array(NA, c(1,1,S))
  A.posterior.draws = array(NA, c(k,1,S))
  lambda.posterior.draws = rep(NA,S)
  for (s in 1:S){
    
    lambda.gprior.diag = diag(lambda.draw, nrow(Y))
    
    A_V.posterior     = solve(t(X)%*%diag(1/diag(lambda.gprior.diag))%*%X + solve(A_V.gprior))
    A.posterior       = A_V.posterior%*%(t(X)%*%diag(1/diag(lambda.gprior.diag))%*%Y + solve(A_V.gprior)%*%A.prior)
    Sigma_s.posterior = t(Y)%*%diag(1/diag(lambda.gprior.diag))%*%Y + t(A.prior)%*%solve(A_V.gprior)%*%A.prior + Sigma_s.prior - t(A.posterior)%*%solve(A_V.posterior)%*%A.posterior
    Sigma_v.posterior = nrow(Y) + Sigma_v.prior
    
    #Sigma.inv.draw      = rWishart(1, Sigma_v.posterior, solve(Sigma_s.posterior))[,,1]
    #Sigma.posterior.draws[,,s] = solve(Sigma.inv.draw)
    
    Sigma.posterior.draws[,,s] = Sigma_s.posterior/ rchisq(1, df=Sigma_v.posterior)
    Sigma.inv.draw = solve(Sigma.posterior.draws[,,s])
    
    A.posterior.draws[,,s] = matrix(mvtnorm::rmvnorm(1, mean=as.vector(A.posterior), sigma=Sigma.posterior.draws[,,s]%x%A_V.posterior), ncol=2)
    
    lambda_s.posterior = sum(diag(Sigma.inv.draw%*%t(Y - X%*%A.posterior.draws[,,s])%*%(Y - X%*%A.posterior.draws[,,s]))) + lambda_s.prior
    lambda_v.posterior = nrow(Y)*2 + lambda_v.prior
    
    lambda.draw = lambda_s.posterior / rchisq(1, lambda_v.posterior)
    lambda.posterior.draws[s] = lambda.draw
  }
  return(list(A.posterior.draws, Sigma.posterior.draws, lambda.posterior.draws))
}

tdist.res = TDist.Gibbs.sampler(1000, Y, X, c(priors, list(lambda_s=30, lambda_nu=30)))

#Alpha posterior mean:
round(apply(tdist.res[[1]], 1:2, mean),2)

#Sigma posterior mean:
round(mean(tdist.res[[2]]),2)

#Lambda posterior mean:
round(mean(tdist.res[[3]]),2)

```


## Estimating autoregressions after 2020

The 2020 COVID-19 pandemic significantly altered the global economic
landscape. It may be argued that the pre and post COVID periods are not
easily comparable, or that the most severe changes for COVID would best
be discounted due to their warping effect on the overall data. However,
@lenza2022estimate disagree. In their paper *How to estimate a vector
autoregression after March 2020*, rather than discount data, they
suggest that the effects of COVID be modeled as a temporary spike in
volatility. They found that the first three periods of the pandemic are
where volatility is the most unpredictable and this finding holds
regardless of whether monthly or quarterly data is being used. For
higher frequency data the exact effect is unknown though should cover at
least the first 3 months of the pandemic. Thus, they propose the
following change to the standard formula for autoregressions:

$$y_t = \mathbf{x}_t'\boldsymbol\alpha + c_tu_t,$$

where $c_t$ is a standard deviation multiplier. That is, for every
period before COVID it takes a value of 1, for the first 3 periods of
COVID it takes values $\bar{c}_0$, $\bar{c}_1$, $\bar{c}_2$, and then in
all future periods a value of
$$c_{t*+j} = 1 + (\bar{c}_2 - 1)\rho^{j-2}, $$ where $\rho\in(0,1)$
captures the exponential decay in the value of the conditional standard
deviation towards the value one. This creates a vector that leaves the
error term unchanged before COVID, has a surge in volatility during
COVID, and then decays geometrically after COVID. This structure
approximates the observed shocks to volatility and facilitates
straightforward estimation.

By dividing both sides by $c_t$ the model equation can be rewritten as
$$\tilde{y}_t = \tilde{\mathbf{x}}_t'\boldsymbol\alpha + u_t$$ where
$\tilde{y}_t = y_t/c_t$ and $\tilde{\mathbf{x}}_t = \mathbf{x}_t/c_t$. These
simple transformations then lend themselves to analysis using whatever
estimation method is preferred.

The main difficulty arises in the estimation of $c_t$ as it is an
unknown parameter in many cases.

### Methodology

To estimate the values for $c_0$, $c_1$, and $c_2$ define a $T$-vector
$\mathbf{c}$ of COVID volatility variables:

$$\mathbf{c}=\begin{bmatrix}1&\dots& c_0 & c_1 & c_2 & 1+(c_2-1)\rho &  1+(c_2-1)\rho^2&\dots\end{bmatrix}'$$

Intuitively, volatility variable for the period before COVID is set to
unity. Heightened volatility during the first three quarters of COVID
are parameterized, then they are assumed to decay at a geometric rate
beginning at the fourth quarter since the pandemic's onset.

These COVID volatility parameters are collected in a vector
$\boldsymbol\theta=(c_0\quad c_1\quad c_2\quad\rho)$ and are estimated
from their own marginal posterior as proposed by @lenza2022estimate:

$$p(\boldsymbol\theta\mid\mathbf{y},\mathbf{X})\propto P(\mathbf{y}\mid\mathbf{X},\boldsymbol\theta)p(\boldsymbol\theta)$$

where the likelihood function is given as:

$$\begin{align}p(\mathbf{y},\mathbf{X}|\boldsymbol\theta) &\propto \Bigg(\prod^T_{t=1}c_t^{-N}\Bigg)\underline{\mathbf{s}}^{\frac{\underline{\nu}}{2}}\det\left(\underline{\mathbf{V}}\right)^{-\frac{1}{2}}\det\left(\tilde{\mathbf{X}}'\tilde{\mathbf{X}}+\underline{\mathbf{V}}^{-1}\right)^{-\frac{1}{2}}\\
&\det\left(\underline{\mathbf{s}}+\hat{\tilde{\mathbf{u}}}'\hat{\tilde{\mathbf{u}}}+(\hat{\tilde{\boldsymbol\alpha}}-\underline{\boldsymbol\alpha})'\underline{\mathbf{V}}^{-1}(\hat{\tilde{\boldsymbol\alpha}}-\underline{\boldsymbol\alpha})\right)^{-\frac{T-p+\underline{\nu}}{2}},\end{align}$$
where $\tilde{\mathbf{X}} = \begin{bmatrix}\tilde{\mathbf{x}}_1' & \dots & \tilde{\mathbf{x}}_T' \end{bmatrix}'$, $\tilde{\mathbf{y}} = \begin{bmatrix}\tilde{y}_1 & \dots & \tilde{y}_T \end{bmatrix}'$, $\hat{\tilde{\mathbf{u}}} = \tilde{\mathbf{y}} - \tilde{\mathbf{X}}\hat{\tilde{\boldsymbol\alpha}}$, and
$\hat{\tilde{\boldsymbol\alpha}}=(\tilde{\mathbf{X}}'\tilde{\mathbf{X}}-\underline{\mathbf{V}}^{-1})^{-1}(\tilde{\mathbf{X}}'\tilde{\mathbf{y}}-\underline{\mathbf{V}}^{-1}\underline{\boldsymbol\alpha})$.

and the priors are assumed to be: 
$$\begin{align}
\mathbf{c}_0,\mathbf{c}_1,\mathbf{c}_2 &\sim \mathcal{P}areto(1,1)\\
\rho&\sim \mathcal{B}eta(3,1.5)
\end{align}$$


### Algorithm with code

This first step is to sample draws of the vector of COVID volatility variables $\textbf{c}$, which is done by estimating parameters in $\boldsymbol\theta=(\bar{c}_0,\bar{c}_1,\bar{c}_2,\rho)$ by sampling from their own marginal posterior via a Metropolis MCMC algorithm:

1.  Initialize $\boldsymbol\theta$ at the posterior mode which is can be located
    via numerical optimization. We executed this in `R` through the
    following function, `c.posterior.mode`. This function takes in data
    representing matrix $\textbf{y}$ and returns values for $\boldsymbol\theta$
    that minimize its negative log-posterior, as well as the
    corresponding Hessian.

```{r COVID parameters posterior maximization, echo = TRUE}
c.posterior.mode <- function(data, p=4, k1=1, k2=100, start_date=c(1991,1)){

  v.neglogPost <- function(theta){
    N = ncol(data)
    K = 1 + p*N
    Y       = ts(data[(p+1):nrow(data),], start=start_date, frequency=4)
    T = nrow(Y)
    X       = matrix(1,T,1)
    # nrow(X)
    for (i in 1:p){
      X     = cbind(X,data[(p+1):nrow(data)-i,])
    }
    
    # Calculate MLE for prior 
    ############################################################
    A.hat       = solve(t(X)%*%X)%*%t(X)%*%Y
    Sigma.hat   = t(Y-X%*%A.hat)%*%(Y-X%*%A.hat)/nrow(Y)
    
    # Specify prior distribution
    ############################################################
    kappa.1     = k1
    kappa.2     = k2
    kappa.3     = 1
    A.prior     = matrix(0,nrow(A.hat),ncol(A.hat))
    A.prior[2:(N+1),] = kappa.3*diag(N)
    V.prior     = diag(c(kappa.2,kappa.1*((1:p)^(-2))%x%rep(1,N)))
    S.prior     = diag(diag(Sigma.hat))
    nu.prior    = N+1
    
    vec <- theta[1:3]
    for (i in 4:12){
      vec <- c(vec, 1 + (theta[3]-1)*theta[4]^(i-3))
    }  
    
    V <- c(ts(rep(1, nrow(Y)-12), c(1991,1), frequency = 4) , vec)    
    
    Y.tilde <- diag(1/V)%*%Y
    X.tilde <- diag(1/V)%*%X
    A.tilde.hat <- solve((t(X.tilde)%*%X.tilde+solve(V.prior)))%*%(t(X.tilde)%*%Y.tilde+solve(V.prior)%*%A.prior)
    epsilon.tilde <-Y.tilde - X.tilde%*%A.tilde.hat
    
    # Log-likelihood      
    logL <- log(prod(V^(-N)))+(-N/2)*log(det(t(X.tilde)%*%X.tilde+solve(V.prior)))+
            (-(T-p+nu.prior)/2)*log(det(S.prior +t(epsilon.tilde)%*%epsilon.tilde + 
            t(A.tilde.hat-A.prior)%*%solve(V.prior)%*%(A.tilde.hat-A.prior)))
    
    # Pareto(1,1) and Beta(3,1.5) priors 
    pareto.a=1
    pareto.b=1
    beta.a=3
    beta.b=1.5
    beta.cons <- 1/beta(beta.a,beta.b)
    
    # Log-prior
    logP <- log((pareto.a*pareto.b^pareto.a)/(theta[1]^(pareto.a+1))*
    (pareto.a*pareto.b^pareto.a)/(theta[2]^(pareto.a+1))*
    (pareto.a*pareto.b^pareto.a)/(theta[3]^(pareto.a+1))*
    beta.cons*theta[4]^(beta.a-1)*(1-theta[4])^(beta.b-1))
    
    # negative log-posterior
    neglogPost <- -(logL+logP)
    
    return(neglogPost)
  }
   
  # numerically minimize the negative log-likelihood
  post.maximizer <- optim(par=c(50, 50, 50, 0.5), fn=v.neglogPost, method="L-BFGS-B", 
                          lower=c(1, 1, 1, 0.0001),
                          upper=c(100,100,100,0.99999), hessian = TRUE)
  
  return(list(maximizer=post.maximizer$par, hessian=post.maximizer$hessian))

}
```

2.  Draw candidate $\boldsymbol\theta^{*}$ from $N_4(\boldsymbol\theta^{(s-1)},cW)$, where $W$
    is the inverse Hessian of the negative log posterior of $\boldsymbol\theta$ at
    the mode, which is also calculated computationally, and $c$ is a
    scaling factor.

3.  Set:

$$\boldsymbol\theta^{(s)}=  \begin{cases}
        \boldsymbol\theta^* & \text{with pr.} \quad \alpha^{(s)} 
        \\
        \\
        \boldsymbol\theta^{(s-1)} & \text{with pr.} \quad 1-\alpha^{(s)} 
  \end{cases}$$

$$\alpha^{(s)} =\text{min}\Big[1,\frac{p(\boldsymbol\theta^*|Y,X,\underline{\gamma})}{p(\boldsymbol\theta^{(s-1)}|Y,X,\underline{\gamma})}\Big]$$

4.  Define $\textbf{c}^{(s)}$ matrix using $\boldsymbol\theta^{(s)}$:

$$\textbf{c}^{(s)}=[1\quad...\quad1 \quad \bar{c}_0^{(s)}\quad \bar{c}_1^{(s)}\quad \bar{c}_2^{(s)}\quad 1+(\bar{c}_2^{(s)}-1)\rho^{(s)}\quad 1+(\bar{c}_2^{(s)}-1)\rho^{(s)2}\quad...]'$$

Steps 2 to 4 are implemented via the function, `mh.mcmc` which takes in
data, the posterior mode of $\boldsymbol\theta$, and the inverse Hessian from
`c.posterior.mode` and returns the draws of $\boldsymbol\theta$.

```{r Metropolis-Hastings function, echo = TRUE}
library(MASS)
library(coda)
mh.mcmc <- function(data, p=1, S.mh = 1000, c, W = diag(4), theta.init,
                    k1 = 1, k2 = 100, start_date = c(1991,1)){
 # N = no. of variables
  N = ncol(data)
  # p = no. of lags
  K = 1 + p*N
  # forecast horizon
  # h       = 8
  Y       = ts(data[(p+1):nrow(data),], start=start_date, frequency=4)
  T = nrow(Y)
  X       = matrix(1,T,1)
  # nrow(X)
  for (i in 1:p){
    X     = cbind(X,data[(p+1):nrow(data)-i,])
  }
  

  # Calculate MLE for prior 
  ############################################################
  A.hat       = solve(t(X)%*%X)%*%t(X)%*%Y
  Sigma.hat   = t(Y-X%*%A.hat)%*%(Y-X%*%A.hat)/nrow(Y)
  
  # Specify prior distribution
  ############################################################
  kappa.1     = k1
  kappa.2     = k2
  kappa.3     = 1
  A.prior     = matrix(0,nrow(A.hat),ncol(A.hat))
  A.prior[2:(N+1),] = kappa.3*diag(N)
  V.prior     = diag(c(kappa.2,kappa.1*((1:p)^(-2))%x%rep(1,N)))
  S.prior     = diag(diag(Sigma.hat))
  nu.prior    = N+1
  
  # Metropolis-Hastings 
  ###########################################################
  # v0, v1, v2, rho
  Theta <- matrix(NA,S.mh,4)
  theta_old <- theta.init
  
  set.seed(1)
  for (s in 1:S.mh){

    covid.vec <- function(theta){
      vec <- theta[1:3]
      for (i in 4:12){
        vec <- c(vec, 1 + (theta[3]-1)*theta[4]^(i-3))
      }
      
      return(vec)
    }

    # Covid volatility likelihood kernel
    v.logL <- function(V){
      Y.tilde <- diag(1/V)%*%Y
      X.tilde <- diag(1/V)%*%X
      A.tilde.hat <- solve((t(X.tilde)%*%X.tilde+solve(V.prior)))%*%(t(X.tilde)%*%Y.tilde+solve(V.prior)%*%A.prior)
      epsilon.tilde <-Y.tilde - X.tilde%*%A.tilde.hat

      logL <- log(prod(V^(-N)))+(-N/2)*log(det(t(X.tilde)%*%X.tilde+solve(V.prior)))+
              (-(T-p+nu.prior)/2)*log(det(S.prior +t(epsilon.tilde)%*%epsilon.tilde + 
              t(A.tilde.hat-A.prior)%*%solve(V.prior)%*%(A.tilde.hat-A.prior)))

      return(logL)
    }
  
    # Covid volatility prior
    v.logP <- function(theta, pareto.a=1, pareto.b=1, beta.a=3, beta.b=1.5){
      beta.cons <- 1/beta(beta.a,beta.b)
  
      logP <- log((pareto.a*pareto.b^pareto.a)/(theta[1]^(pareto.a+1))*
      (pareto.a*pareto.b^pareto.a)/(theta[2]^(pareto.a+1))*
      (pareto.a*pareto.b^pareto.a)/(theta[3]^(pareto.a+1))*
       beta.cons*theta[4]^(beta.a-1)*(1-theta[4])^(beta.b-1))
      
      return(logP)
    }

    v_ones <- ts(rep(1, nrow(Y)-12), c(1991,1), frequency = 4) 
    V.old <- c(v_ones, covid.vec(theta_old))    
      
    # New candidate parameters values
    theta_new <- mvrnorm(1, theta_old, c*W)
    V.new <- c(v_ones, covid.vec(theta_new))
    
    # Calculate posteriors 
    v.logpost_old <- v.logL(V.old)+v.logP(theta_old)
    v.logpost_new <- v.logL(V.new)+v.logP(theta_new)
    
    # Posterior ratio
    post.ratio <- exp(v.logpost_new-v.logpost_old)
    
    # Acceptance/rejection alpha
    alpha <- min(1, post.ratio)
    
    u_star <- runif(1)
    
    if (alpha > u_star){
      Theta[s,] <- theta_new
    } else {Theta[s,] <- theta_old}
    
    theta_old <- Theta[s,]  
  }
  
  colnames(Theta) <- c("c0", "c1" , "c2", "rho")

  re <- list(Theta=Theta, 
             AcceptRate = 1 - rejectionRate(as.mcmc(Theta[,1])))
  return(re)
}

```

5.  These $\boldsymbol\theta$ draws, are then used to draw $\boldsymbol{\alpha}$
    and $\sigma^2$ from the following posterior distribution:

$$p(\boldsymbol{\alpha}\mid\mathbf{y},\mathbf{X},\sigma^2,\textbf{c})=\mathcal{N}_{k}(\bar{\boldsymbol{\alpha}},\sigma^2,\bar{\mathbf{V}})$$
$$p(\sigma^2|\mathbf{y},\mathbf{X},\textbf{c})=\mathcal{IG}2(\bar{s},\bar{\nu})\\$$

$$\bar{\mathbf{V}}=(\mathbf{X}'\text{diag}(\textbf{c}^2)^{-1}\mathbf{X}+\underline{\mathbf{V}}^{-1})^{-1}$$
$$\bar{\boldsymbol{\alpha}}=\bar{\mathbf{V}}(\mathbf{X}'\text{diag}(\textbf{c}^2)^{-1}\mathbf{y}+\underline{\mathbf{V}}^{-1}\underline{\boldsymbol{\alpha}})$$
$$\bar{\nu}=T+\underline{\nu}$$
$$\bar{s}=\underline{s}+\mathbf{y}'\text{diag}(\textbf{c}^2)^{-1}\mathbf{y}+\underline{\boldsymbol{\alpha}}'\underline{\mathbf{V}}^{-1}\underline{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}}'\bar{\mathbf{V}}^{-1}\bar{\boldsymbol{\alpha}}$$

This sampling is implemented via the function `covid.est` below:

```{r Estimation, echo = TRUE}

covid.est <- function(data, p=4, S=100, k1=1, k2=100, start_date = c(1991,1), Theta.mh){

  N = ncol(data)
  K = 1 + p*N

  Y       = ts(data[(p+1):nrow(data),], start=start_date, frequency=4)
  T = nrow(Y)
  X       = matrix(1,T,1)

  for (i in 1:p){
    X     = cbind(X,data[(p+1):nrow(data)-i,])
  }
  
  covid.vec <- function(theta){
    vec <- theta[1:3]
    for (i in 4:12){
      vec <- c(vec, 1 + (theta[3]-1)*theta[4]^(i-3))
    }
      
    return(vec)
  }
  
  diagV.sqinv <- array(NA, c(nrow(Y),nrow(Y),S))
  
  for (s in 1:S){
    v_ones <- ts(rep(1, nrow(Y)-12), c(1991,1), frequency = 4) 
    diagV.sqinv[,,s] <- diag(c(v_ones, covid.vec(Theta.mh[s,]))^(-2))
  }
  
  # Calculate MLE for prior 
  ############################################################
  A.hat       = solve(t(X)%*%X)%*%t(X)%*%Y
  Sigma.hat   = t(Y-X%*%A.hat)%*%(Y-X%*%A.hat)/nrow(Y)

  # Specify prior distribution
  ############################################################
  kappa.1     = k1
  kappa.2     = k2
  kappa.3     = 1
  A.prior     = matrix(0,nrow(A.hat),ncol(A.hat))
  A.prior[2:(N+1),] = kappa.3*diag(N)
  V.prior     = diag(c(kappa.2,kappa.1*((1:p)^(-2))%x%rep(1,N)))
  S.prior     = diag(diag(Sigma.hat))
  nu.prior    = N+1
  
  # Posterior draws 
  ############################################################
  Sigma.posterior   = array(NA,c(N,N,S))
  A.posterior       = array (NA,c(K,N,S))
  
  for (s in 1:S){
    V.bar.inv   = t(X)%*%diagV.sqinv[,,s]%*%X + diag(1/diag(V.prior))
    V.bar       = solve(V.bar.inv)
    A.bar       = V.bar%*%(t(X)%*%diagV.sqinv[,,s]%*%Y + diag(1/diag(V.prior))%*%A.prior)
    nu.bar      = nrow(Y) + nu.prior
    S.bar       = S.prior + t(Y)%*%diagV.sqinv[,,s]%*%Y + t(A.prior)%*%diag(1/diag(V.prior))%*%
                  A.prior - t(A.bar)%*%V.bar.inv%*%A.bar
    S.bar.inv   = solve(S.bar)
    L                 = t(chol(V.bar))
    
    # RF posterior draws
    Sigma.posterior[,,s] <- solve(rWishart(1, df=nu.bar, Sigma=S.bar.inv)[,,1])
    cholSigma.s     = chol(Sigma.posterior[,,s])
    A.posterior[,,s]       = matrix(mvrnorm(1,as.vector(A.bar), Sigma.posterior[,,s]%x%V.bar),ncol=N)
    A.posterior[,,s]= A.bar + L%*%A.posterior[,,s]%*%cholSigma.s
  }

  re <- list("A.posterior"=A.posterior, "Sigma.posterior"=Sigma.posterior, "Theta"= Theta.mh)
  return(re)
}

```

## Stochastic volatility heteroskedasticity

### Model equations

Introduction of Stochastic Volatility requires changing the assumptions regarding the error term:
$$
\mathbf{u}\mid\mathbf{X} \sim N_T(\mathbf{0}_{T},\text{diag}(\boldsymbol\sigma^2)),
$$
where $\boldsymbol\sigma^2$ is a $T$-vector:
$$
\boldsymbol\sigma^2 = (\exp\{h_1\}, ... , \exp\{h_t\}),
$$
and $h_t$ follows a Stochastic Volatility process:
$$
\begin{align*}
h_t &= h_{t-1}+\sigma_vv_t \\
v_t &\sim N(0,1).
\end{align*}
$$

The likelihood function is then
$$
L(\boldsymbol\alpha|\mathbf{y},\mathbf{X},\boldsymbol\sigma^2) \propto \det(\text{diag}(\boldsymbol\sigma^2))^{-1/2}\times \exp\{  -\frac{1}{2}  (\mathbf{y}-\boldsymbol\alpha \mathbf{X})' \text{diag}(\boldsymbol\sigma^2)^{-1}(\mathbf{y}-\boldsymbol\alpha \mathbf{X}) \}
$$
The full conditional posterior distribution for $\boldsymbol\alpha$ is given by
$$
p(\boldsymbol\alpha|\mathbf{y},\mathbf{X},\boldsymbol\sigma^2) = \mathcal{N}_K (\bar{\boldsymbol\alpha},\bar{\mathbf{V}})
$$
where
$$\begin{align}
\bar{\mathbf{V}}&=(\mathbf{X}'\text{diag}(\boldsymbol\sigma^2)^{-1}\mathbf{X}+\underline{\mathbf{V}}^{-1})^{-1}\\
\bar{\boldsymbol\alpha}&=\bar{\mathbf{V}}(\mathbf{X}'\text{diag}(\boldsymbol\sigma^2)^{-1}\mathbf{y}+\underline{\mathbf{V}}^{-1}\underline{\boldsymbol\alpha})
\end{align}$$

### Essential techniques 

#### Log-linearisation

In order to estimate the SV model rewrite the autoregressive equation as
$$\begin{align}
y_t - \mathbf{x}_t'\boldsymbol\alpha &= \exp\left\{\frac{1}{2}h_t\right\}z_t\\
z_t\mid\mathbf{x}_t &\sim\mathcal{N}(0,1)
\end{align}$$
Square both sides of the equation and take the log to obtain:
$$\begin{align}
\tilde{y}_t &= h_t + \tilde{z}_t\\
\tilde{z}_t\mid\mathbf{x}_t &\sim\log\chi^2_1
\end{align}$$
where $\tilde{z}_t = \log\left(y_t - \mathbf{x}_t'\boldsymbol\alpha\right)^2$, and $\tilde{z}_t = \log z_t^2$ that follows the $\log\chi^2_1$ distribution. This is a non-Gaussian linear model in $h_t$.


#### Auxiliary mixture 

To deal with the non-Gaussianity, @omori2007stochastic propose to approximate the log chi-squared distribution with one degree of freedom by a mixture of ten normal distributions given by:
$$
log\chi^2_1 \approx \sum^{10}_{m=1}Pr[s_t=m]N(\mu_m, \sigma^2_m)
$$ 
where $s_t\in\{1,...,10\}$ is a discrete-valued random indicator of the mixture component and $\mu_m, \sigma^2_m, Pr(s_t=m)$ are predetermined.

The conditional distribution of $\tilde{\epsilon_t}$ is given by
$$
\tilde{\epsilon_t}|s_t = m \sim N(\mu_m, \sigma^2_m)
$$ 
A simple stochastic volatility model is given by
$$
\begin{align*}
\tilde{y} &= h + \tilde{\epsilon} \\
Hh &= h_0e_{1.T}+\sigma_vv \\
\tilde{\epsilon_t}|s &\sim N_T(\mu_s, diag(\sigma^2_s)) \\
v &\sim N_T(0_T, I_T) \\
\end{align*}
$$ 


#### Precision sampler 
The full-conditional posterior distribution of the log-volatilities $h$ is a $T$-variate normal distribution with the precision matrix that is a tridiagonal matrix. Therefore, a computationally fast method of sampling from this full-conditional posterior distribution is by the precision sampler introduced by @chan2009efficient. The implementation in R requires using computer routines for tridiagonal matrices from package mgcv by @wood2020mgcv.

### Gibbs sampler

Initialize the algorithm by proposing starting values for $h^{(0)},s^{(0)},\mu_0^{(0)},\boldsymbol\alpha^{(0)},\sigma_v^{2(0)}$


For each iteration $i$ of the Gibbs sampling algorithm, draw
$$\begin{align}
h_{0}^{(i)} &\sim \mathcal{N}(\bar{h_0},\bar{\sigma}^2_h) \\
\sigma_v^{2(i)} &\sim \mathcal{IG}2(\bar{s},\bar{v})\\
\mu_{0}^{(i)} &\sim \mathcal{N}(\bar{\mu_0},\bar{\sigma}^2_\mu)
\end{align}$$
$$
s_t^{(i)} \sim \mathcal{M}ultinomial (\{m\}_{m=1}^{10}, \{Pr[s_t = m|\tilde{y}_t,h_t^{(i)}\}_{m=1}^{10})
$$ 
using inverse transform method, and:
$$
\mathbf{h}^{(i)} \sim \mathcal{N}_T(\bar{\mathbf{h}},\bar{\mathbf{V}}_h)
$$
using the precision sampler and compute vector $\boldsymbol\sigma^{2(i)}$.

Then, sample the autoregressive parameters:
$$
\boldsymbol\alpha^{(i)} \sim \mathcal{N}_K(\bar{\boldsymbol\alpha},\bar{\mathbf{V}}_\boldsymbol\alpha),
$$
and return a sample of $S$ draws from the posterior distribution:
$$
\{\mathbf{h}^{(i)},\mathbf{s}^{(i)},\mu_{0}^{(i)},h_0^{(i)}, \sigma_v^{2(i)},\boldsymbol\sigma^{2(i)},\boldsymbol\alpha^{(i)}\}_{i=1}^{S}.
$$


### Algorithm 

Following is a function implementing the Gibbs sampler.

```{r}

ARSV_Gibbs = function(S, aux, priors){
  # aux is a list containing:
  #   Y - a TxN matrix
  #   X - a TxK matrix
  #   H - a Tx1 matrix
  #   h0 - a scalar
  #   sigma.v2 - a scalar
  #   s - a Tx1 matrix
  #   A - a KxN matrix
  #   Sigma - an NxN matrix
  #   sigma2 - a Tx1 matrix
  #
  # priors is a list containing:
  #   h0.v - a positive scalar
  #   h0.m - a scalar
  #   sigmav.s - a positive scalar
  #   sigmav.nu - a positive scalar
  #   HH - a TxT matrix

  T             = dim(aux$Y)[1]
  N             = dim(aux$Y)[2]
  K             = dim(X)[2]

  alpha.st      = c(1.92677,1.34744,0.73504,0.02266,0-0.85173,-1.97278,-3.46788,-5.55246,-8.68384,-14.65000)
  sigma.st      = c(0.11265,0.17788,0.26768,0.40611,0.62699,0.98583,1.57469,2.54498,4.16591,7.33342)
  pi.st         = c(0.00609,0.04775,0.13057,0.20674,0.22715,0.18842,0.12047,0.05591,0.01575,0.00115)
  
  posteriors      = list(
    H           = matrix(NA,T,S),
    sigma2      = matrix(NA,T,S),
    s           = matrix(NA,T,S),
    h0          = rep(NA,S),
    sigma.v2    = rep(NA,S),
    A           = array(NA, c(K,N,S)),
    Sigma       = array(NA, c(N,N,S))
  )

  
    # Sampling A and Sigma
  kappa.1     = 1
  kappa.2     = 100
  kappa.3     = 1

  V.prior     = diag(c(kappa.2,kappa.1*((1:p)^(-2))%x%rep(1,N)))
  nu.prior    = N+1

  for (s in 1:S) {
    # normal-inverse Wishard posterior parameters
    ############################################################
    V.bar.inv   = t(aux$X)%*%diag(1/as.vector(aux$sigma2))%*%aux$X + diag(1/diag(V.prior))
    V.bar       = solve(V.bar.inv)
    A.bar       = V.bar%*%(t(aux$X)%*%aux$Y + diag(1/diag(V.prior))%*%aux$A)
    nu.bar      = nrow(aux$Y) + nu.prior
    S.bar       = aux$Sigma + t(aux$Y)%*%aux$Y + t(A.prior)%*%diag(1/diag(V.prior))%*%aux$A - t(A.bar)%*%V.bar.inv%*%A.bar
    S.bar.inv   = solve(S.bar)
    
    # posterior draws 
    ############################################################
    Sigma.posterior.IW     = rWishart(1, df=nu.bar, Sigma=S.bar.inv)
    Sigma.posterior.draw   = apply(Sigma.posterior.IW, 3 ,solve)
    aux$Sigma              = array(Sigma.posterior.draw,c(N,N,1))
    A.norm                 = array(rnorm(prod(c(K,N,1))),c(K,N,1))
    L                      = t(chol(V.bar))
    aux$A                  = A.bar + L%*%A.norm[,,1]%*%chol(aux$Sigma[,,1])
    
    Lambda        = solve(chol(aux$Sigma))
    Z             = rowSums( ( aux$Y - aux$X %*% aux$A ) %*% Lambda ) / sqrt(N)
    Y.tilde       = as.vector(log((Z + 0.0000001)^2))
    Ytilde.alpha  = as.matrix(Y.tilde - alpha.st[as.vector(aux$s)])
    
    # sampling initial condition
    ############################################################
    V.h0.bar      = 1/((1 / priors$h0.v) + (1 / aux$sigma.v2))
    m.h0.bar      = V.h0.bar*((priors$h0.m / priors$h0.v) + (aux$H[1] / aux$sigma.v2))
    h0.draw       = rnorm(1, mean = m.h0.bar, sd = sqrt(V.h0.bar))
    aux$h0        = h0.draw
    
    # sampling sigma.v2
    ############################################################
    sigma.v2.s    = priors$sigmav.s + sum(c(aux$H[1] - aux$h0, diff(aux$H))^2)
    sigma.v2.draw = sigma.v2.s / rchisq(1, priors$sigmav.nu + T)
    aux$sigma.v2  = sigma.v2.draw
    
    # sampling auxiliary states
    ############################################################
    Pr.tmp        = simplify2array(lapply(1:10,function(x){
      dnorm(Y.tilde, mean = as.vector(aux$H + alpha.st[x]), sd = sqrt(sigma.st[x]), log = TRUE) + log(pi.st[x])
    }))
    Pr            = t(apply(Pr.tmp, 1, function(x){exp(x - max(x)) / sum(exp(x - max(x)))}))
    s.cum         = t(apply(Pr, 1, cumsum))
    r             = matrix(rep(runif(T), 10), ncol = 10)
    ss            = apply(s.cum < r, 1, sum) + 1
    aux$s         = as.matrix(ss)
    
    # sampling log-volatilities using functions for tridiagonal precision matrix
    ############################################################
    Sigma.s.inv   = diag(1 / sigma.st[as.vector(aux$s)])
    D.inv         = Sigma.s.inv + (1 / aux$sigma.v2) * priors$HH
    b             = as.matrix(Ytilde.alpha / sigma.st[as.vector(aux$s)] + (aux$h0/aux$sigma.v2)*diag(T)[,1])
    lead.diag     = diag(D.inv)
    sub.diag      = mgcv::sdiag(D.inv, -1)
    D.chol        = mgcv::trichol(ld = lead.diag, sd = sub.diag)
    D.L           = diag(D.chol$ld)
    mgcv::sdiag(D.L,-1) = D.chol$sd
    x             = as.matrix(rnorm(T))
    a             = forwardsolve(D.L, b)
    draw          = backsolve(t(D.L), a + x)
    aux$H         = as.matrix(draw)
    aux$sigma2    = as.matrix(exp(draw))
    
    posteriors$H[,s]             = aux$H
    posteriors$sigma2[,s]        = aux$sigma2
    posteriors$s[,s]             = aux$s
    posteriors$h0[s]             = aux$h0
    posteriors$sigma.v2[s]       = aux$sigma.v2
    posteriors$A[,,s]            = aux$A
    posteriors$Sigma[,,s]        = aux$Sigma
  }
  
  return(posteriors)
}

```


# Forecasting

## Analytical results for one-period-ahead forecast 

In the frequentist approach, we condition the predictive density on the parameters $\left(\boldsymbol\alpha\right)$, the parameters. The latter are treated as non-random.  Therefore, it can be stated that this approach to forecasting ignores estimation uncertainty. 

To address this issue, the Bayesian approach treats the parameters $\boldsymbol\alpha$ and $\sigma^2$ as random variables and thus, it incorporates estimation uncertainty into the predictive density. Additionally, by imposing appropriate priors Bayesian forecasting often leads to more precise forecasts as it narrows the predictive intervals. 

Moreover, Bayesian forecasting uses the conditional predictive density as a starting point. In the following formula for the conditional one-period ahead forecast, the first expression under the integral is the conditional predictive density and the second expression is the joint posterior distribution that we are now used to derive. 

$$\begin{align}
p\left(y _{t+1} \mid \mathbf{y}, \mathbf{X}\right) &=\iint p\left(y _{t+1} \mid\mathbf{y}, \mathbf{X}, \boldsymbol\alpha, \sigma^2\right) p\left(\boldsymbol\alpha, \sigma^2 \mid \mathbf{y}, \mathbf{X}\right) d \boldsymbol\alpha d \sigma^2\\
p\left(y _{t+1} \mid\mathbf{y}, \mathbf{X}, \boldsymbol\alpha, \sigma^2 \right) &=\mathcal{N} \left(x_{t+1}^{\prime} \boldsymbol\alpha, \sigma^2\right)\\
p\left(\boldsymbol\alpha, \sigma^2 \mid \mathbf{y}, \mathbf{X}\right) &=p\left(\boldsymbol\alpha \mid \mathbf{y}, \mathbf{X}, \sigma^2\right) p\left(\sigma^2 \mid \mathbf{y}, \mathbf{X}\right)\\
p\left(\boldsymbol\alpha \mid \mathbf{y}, \mathbf{X}, \sigma^2 \right) &= \mathcal{N}_{K}\left(\overline{\boldsymbol\alpha}, \sigma^2\overline{\mathbf{V}}\right)\\
p\left(\sigma^2 \mid \mathbf{y}, \mathbf{X}\right) &=\mathcal{IG}2(\overline{s}, \overline{\nu})
\end{align}
$$
To derive the solution of this expression analytically, we integrate the parameters out in this order: 

1. We integrate out $\boldsymbol\alpha$ and obtain:

$$
p\left(y_{t+1} \mid \mathbf{y}, \mathbf{X}, \sigma^2\right)=\mathcal{N}\left(\mathbf{x}_{t+1}^{\prime} \overline{\boldsymbol\alpha}, \sigma^2\left(1+\mathbf{x}_{t+1}^{\prime} \overline{\mathbf{V}} \mathbf{x}_{t+1}\right)\right)
$$

2. We integrate out $\sigma^2$ and obtain the one-period-ahead predictive density given by a Student-$t$ distribution as defined by @Bauwens1999:
$$
p\left(y_{t+1} \mid \mathbf{y}, \mathbf{X}\right)=t\left(\mathbf{x}_{t+1}^{\prime} \overline{\boldsymbol\alpha}, 1+\mathbf{x}_{t+1}^{\prime} \overline{\mathbf{V}} \mathbf{x}_{t+1}, \overline{s}, \overline{\nu}\right).
$$

## Numerical approach to one-period-ahead forecast

The Bayesian approach of the conditional predictive density is a combination of a conditional predictive density and the posterior distribution of the unknown parameters $\boldsymbol\alpha$ and $\sigma^{2}$. The one-period ahead conditional predictive density can be expressed as below. 

$$
p(y_{t+1} |x_{t+1},\mathbf{y}, \mathbf{X}) = \int\int p(y_{t+1} |x_{t+1}, \mathbf{y}, \mathbf{X}, \boldsymbol\alpha, \sigma^{2})p(\boldsymbol\alpha, \sigma^{2}|\mathbf{y},\mathbf{X}) d\boldsymbol\alpha d\sigma^{2}$$

$$p(y_{t+1} |x_{t+1}, \mathbf{y},\mathbf{X}, \boldsymbol\alpha, \sigma^{2}) = \mathcal{N}( \alpha'_{d}d_{t+1|t} + \alpha_{1}y_{t}+\dots+\alpha_{p}y_{t-p+1}, \sigma^{2})$$

$$p(\boldsymbol\alpha, \sigma^{2}|\mathbf{y},\mathbf{X}) = \mathcal{NIG}2(\overline{\boldsymbol\alpha}, \overline{\sigma^{2}}, \overline{S}, \overline{v}) $$


To sample the joint predictive density, we integrate out $\boldsymbol\alpha$ and $\sigma^{2}$. Below steps show how the one-period ahead joint predictive density can be obtained. 

Sample draws from the posterior distribution $p(\boldsymbol\alpha, \sigma^{2}|\bf y,\bf X)$ and 

Obtain $\left\{ \boldsymbol\alpha^{(s)},  \sigma^{2(s)} \right\}^{S}_{s=1}$

For each draw of $\boldsymbol\alpha$ and $\sigma^{2}$ sample from $p(y_{t+1} |x_{t+1})$ that is specified as $y_{t+1} \sim \mathcal{N} (x_{t+1|t}\boldsymbol\alpha^{(s)},  \sigma^{2(s)})$

Obtain $\left\{ y_{t+1|t}^{(s)} \right\}^{S}_{s=1}$

Characterise the predictive density using $\left\{ y_{t+1|t}^{(s)} \right\}^{S}_{s=1}$


## Forecasting many periods ahead

## Sampler implementation in R


```{r forecasting h step}
#| echo: true
#| message: false
#| warning: false
#| eval: false
## Apply function when needed
posterior.sample.draws = posterior.draws(S=50000, Y=Y, X=X)         # using the posterior function to draw alpha and sigma
A.posterior.simu       = posterior.sample.draws$A.posterior         # obtain posterior alpha
Sigma.posterior.simu   = posterior.sample.draws$Sigma.posterior     # obtain posterior sigma

## forecasting setup
h                      = 12                                         # specify the desired number of steps ahead
S                      = 50000                                      # the number of sampling time, no greater than the simulation time in the posterior draw function
Y.h                    = matrix(NA,h,S)                             # create empty matrix to store the h-step ahead Y 

## sampling predictive density
for (s in 1:S){
  A.posterior.draw     = A.posterior.simu[,s]
  Sigma.posterior.draw = Sigma.posterior.simu[,s]
    x.Ti               = Y[(nrow(Y)-p+1):nrow(Y)]                   # the number of lags in Y
    x.Ti               = x.Ti[p:1]
  for (i in 1:h){
    x.T                = c(d,as.vector(t(x.Ti)))                    # d refers to d row vector for the deterministic term data
    Y.f                = rnorm(1, mean = x.T%*%A.posterior.draw, sigma=Sigma.posterior.draw)
      x.Ti             = rbind(Y.f,x.Ti[1:(p-1)])                   # refresh the initial data in Y
    Y.h[i,s]           = Y.f
  }
}
```

# References {.unnumbered}