# LAN for Linear Processes

Consider a m-vector linear process

$\displaystyle \mathbf{X}(t) = \sum\limits_{j=0}^{\infty} A_{\theta}(j)\mathbf{U}(t-j), \qquad t \in \mathbb{Z}$

where ${\mathbf{U}(t)}$ are i.i.d. m-vector random variables with p.d.f. ${p(\mathbf{u})>0}$ on ${\mathbf{R}^m}$, ${A_{\theta} (j)}$ are ${m \times m}$ matrices depending on a parameter vector ${ \mathbf{\theta} = (\theta_1,...,\theta_q) \in \Theta \subset \mathbf{R}^q}$.

Set

$\displaystyle A_{\theta}(z) = \sum\limits_{j=0}^{\infty} A_{\theta}(j)z^j, \qquad |z| \leq 1.$

Assume the following conditions are satisfied

A1 i) For some ${D}$ ${(0

$\displaystyle \pmb{|} A_{\theta}(j) \pmb{|} = O(j^{-1+D}), \qquad j \in \mathbb{N},$

where ${ \pmb{|} A_{\theta}(j) \pmb{|}}$ denotes the sum of the absolute values of the entries of ${ A_{\theta}(j)}$.

ii) Every ${ A_{\theta}(j)}$ is continuously two times differentiable with respect to ${\theta}$, and the derivatives satisfy

$\displaystyle |\partial_{i_1} \partial_{i_2}... \partial_{i_k} A_{\theta, ab}(j)| = O \{j^{-1+D}(logj)^k\}, \qquad k=0,1,2$

for ${a,b=1,...,m,}$ where ${\partial_i = \partial/ \partial\theta_i}$.

iii) ${det A_{\theta}(z) \neq 0}$ for ${|z| \leq 1}$ and ${A_{\theta}(z)^{-1}}$ can be expanded as follows:

$\displaystyle A_{\theta}(z)^{-1} = I_m + B_{\theta}(1)z + B_{\theta}(2)z^2 + ...,$

where ${ B_{\theta}(j)}$, ${j=1,2,...,}$ satisfy

$\displaystyle \pmb{|} B_{\theta}(j) \pmb{|} = O(j^{-1-D}).$

iv) Every ${ B_{\theta}(j)}$ is continuously two times differentiable with respect to ${\theta}$, and the derivatives satisfy

$\displaystyle |\partial_{i_1} \partial_{i_2}... \partial_{i_k} B_{\theta, ab}(j)| = O \{j^{-1+D}(logj)^k\}, \qquad k=0,1,2$

for ${a,b=1,...,m.}$

A2 ${p(.)}$ satisfies

$\displaystyle \lim\limits_{\| \mathbf{u} \| \rightarrow \infty} p(\mathbf{u})=0, \qquad \int \mathbf{u} p(\mathbf{u}) d \mathbf{u} =0, \qquad \text{and} \qquad \int \mathbf{uu'}p(\mathbf{u}) d \mathbf{u}=I_m$

A3 The continuous derivative ${Dp}$ of ${p(.)}$ exists on ${\mathbf{R}^m}$.

A4

$\displaystyle \int \pmb{|} \phi(\mathbf{u}) \pmb{|}^4 p (\mathbf{u}) d \mathbf{u} < \infty,$
where ${\phi(\mathbf{u}) = p^{-1}Dp}$.

From A1 the linear process can be expressed as

$\displaystyle \sum\limits_{j=0}^{\infty} B_{\theta}(j) \mathbf{X}(t-j) = \mathbf{U}(t), \qquad B_{\theta} (0) = I_m$
and hence

$\displaystyle \mathbf{U}(t) = \sum\limits_{j=0}^{t-1}B_{\theta}(j)\mathbf{X}(t-j)+\sum\limits_{r=0}^{\infty}C_{\theta}(r,t)\mathbf{U}(-r),$

where

$\displaystyle C_{\theta}(r,t)= \sum\limits_{r'=0}^{r}B_{\theta}(r'+t)A_{\theta}(r-r').$

# Local Asymptotic Normality

The concept of Local Asymptotic Normality (LAN) – introduced by Lucien LeCam – is one of the most important and fundamental ideas of the general asymptotic statistical theory. The LAN property is of particular importance in the asymptotic theory of testing, estimation and discriminant analysis. Many statistical models  have got likelihood ratios which are locally asymptotic normal  – that is the likelihood ratio processes of those models are asymptotically similar to those for the normal location parameter.

Let ${P_{0,n}}$ and ${P_{1,n}}$ be two sequences of probability measures on ${( \Omega_n, \mathcal{F}_n )}$. Suppose there is a sequence ${\mathcal{F}_{n,k}}$, ${k=1,...,k_n,}$ of sub ${\sigma}$-algebras of ${\mathcal{F}_n}$ s.th. ${\mathcal{F}_{n,k} \subset \mathcal{F}_{k+1}}$ and ${\mathcal{F}_{n,k_n} = \mathcal{F}_n}$. Let ${P_{i,n,k}}$ be the restriction of ${P_{i,n}}$ to ${\mathcal{F}_{n,k}}$ and let ${\gamma_{n,k}}$ be the Radon-Nikodym density taken on ${\mathcal{F}_{n,k}}$ of the part of ${P_{1,n,k}}$ that is dominated by ${P_{0,n,k}}$. Put

$\displaystyle Y_{n,k} = (\gamma_{n,k}/\gamma_{n,k-1})^{1/2} -1$

where ${\gamma_{n,0}=1}$ and ${n=1,2,...}$.

The logarithm of likelihood ratio

$\displaystyle \Lambda_n = \log \frac{dP_{1,n}}{dP_{0,n}}$

taken on ${\mathcal{F}_n}$ is then

$\displaystyle \Lambda_n = 2 \sum_k \log (Y_{n,k}+1)$

since ${ \log (\gamma_{n,k}/\gamma_{n,k-1}) = 2(Y_{n,k}+1) }$.

(LeCam 1986). Suppose that under ${P_{0,n}}$ the following conditions are satisfied

• L1: ${\max_k |Y_{n,k}| \xrightarrow{p} 0}$
• L2: ${\sum_{k}Y^2_{n,k} \xrightarrow{p} \tau^2/4 }$,
• L3: ${\sum_{k}E(Y^2_{n,k}+2Y_{n,k}| \mathcal{F}_{n,k-1}) \xrightarrow{p} 0}$, and
• L4: ${\sum_k E\{ Y^2_{n,k} \mathbb{I}(|Y_{n,k}|> \delta)| \mathcal{F}_{n,k-1} \} \xrightarrow{p} 0}$ for some ${\delta > 0}$. then

$\displaystyle \boxed{ \Lambda_n \xrightarrow{d} N(-\tau^2/2,\tau^2)}.$

# Whittle’s Approximate Likelihood

The Whittle Likelihood is a frequency-based approximation to the Gaussian Likelihood which is up to a constant asymptotically efficient. The Whittle estimate is asymptotically efficient and can be interpreted as minimum distance estimate of the distance between the parametric spectral density and the (nonparametric) periodogram. It also minimises the asymptotic Kullback-Leibler divergence and, for autoregressive processes, is identical to the Yule-Walker estimate. The evaluation of the Whittle Likelihood can be done very fast by computing the periodogram via the FFT in only ${O(NlogN)}$ operations.

Suppose that a stationary, zero mean, gaussian process ${\{X_t \}}$ is observed at times ${t=1,2,..T}$. Assume ${\{X_t \}}$ has spectral density ${f_{\theta}(\lambda)}$, ${\lambda \in \Pi := (-\pi, \pi]}$, depending on a vector of unknown parameters ${\theta \subset \Theta \in \mathbb{R}^p}$. A natural approach to estimate the parameter ${\theta}$ from the sample ${\mathbf{X}_T}$ is to maximize the likelihood function or alternatively to minimise ${-1/T}$ times the log-likelihood. The later takes the form

# Kullback-Leibler information and the consistency of the Hellinger metric.

Suppose $p_{\theta_0}$ is the true density of a random sample $X_1, ..., X_n$ while $p_{\theta}$ is the assumed model. The Kullback-Leibler distance is defined as

$K(p_{\theta}, p_{\theta_0})= E log \frac{p_{\theta_0}(X)}{p_{\theta} (X)}=\int log\left( \frac{p_{\theta_0}}{p_{\theta}}\right)p_{\theta_0}d\mu$

As we will show below the Kullback-Leibler information has a very useful property.

We know that $\frac{1}{2}log(w) \leq\sqrt{w}-1$$\forall w>0$. Hence

$\frac{1}{2}log\frac{p_{\theta}(x)}{p_{\theta_0}(x)} \leq \sqrt{\frac{p_{\theta}(x)}{p_{\theta_0}(x)}}-1$

so

$\frac{1}{2}K(p_{\theta}, p_{\theta_0})\geq 1- E\left( \sqrt{\frac{p_{\theta}(x)}{p_{\theta_0}(x)}}\right)$

notice that the rhs of the inequality can be rewritten as $1- \int p_{\theta}^{1/2} p_{\theta_0}^{1/2}d \mu$ which (since a density integrates to one) is equal to

$\frac{1}{2}\int p_{\theta}d \mu +\frac{1}{2}\int p_{\theta_0}d \mu -\int p_{\theta}^{1/2} p_{\theta_0}^{1/2} d \mu= \frac{1}{2}\int (p_{\theta}^{1/2}-p_{\theta_0}^{1/2})^2 d \mu=h^2(p_{\theta},p_{\theta_0})$

# Akaike Information Criterion Statistics

Consider a distribution ${(q_1, q_2, ...,q_k)}$ with ${q_i >0}$ and ${ q_1 + q_2 + ...+ q_k=1}$. Suppose ${N }$ independent drawings are made from the distribution and the resulting frequencies are given by ${ (N_1,N_2,...,N_k)}$, where ${N_1+N_2+...+N_k=N}$. Then the probability of getting the same frequencies by sampling from ${(q_1, q_2, ...,q_k)}$ is given by

$\displaystyle W = \frac{N!}{N_1!...N_k!} q_1^{N_1} q_2^{N_2}... q_k^{N_k}$

and thus

$\displaystyle \ln W \approx - N \sum\limits_{i=1}^{k}\frac{N_i}{N} \ln \left( \frac{N_i}{N q_i} \right)$

since ${\ln N! \approx N \ln N - N}$. Set ${p_i = N_i/N}$. Then

$\displaystyle \begin{array}{rcl} \ln W &=& - N \sum\limits_{i=1}^{k} p_i \ln (p_i / q_i) \\ &=& NB(p;q) \end{array}$

where ${B(p;q)}$ is the entropy of the distribution ${\{p_i \}}$ w.r.t. the distribution ${\{q_i \}}$. The entropy here can be interpreted as the logarithm of the probability of getting the distribution ${\{ p_i \}}$ (which could asymptotically be the true distribution) by sampling from an hypothetical distribution ${\{q_i\}}$.

Based on Sanov’s result (1961) the above discussion may be extended to more general distributions. Let ${f}$ and ${g}$ be the pdfs of the true and hypothetical distributions respectively, and ${F_N}$ the pdf estimate based on the random sampling of ${N}$ observations from ${g}$. Then

$\displaystyle B(f;g) = - \int f(z) \ln(f(z)/g(z)dz$

as ${ \lim\limits_{\epsilon \downarrow 0} \lim\limits_{N \rightarrow \infty} N^{-1} P(\sup_x |f_N(x)- f(x)| < \epsilon).}$ Note that ${- B(f;g) }$ equals ${E_f [ \ln (f(z)/g(z))] }$ which is the Kullback-Leibler divergence between ${f}$ and ${g}$. Note also that ${B(f;g) \leq 0 }$. That is because

$\displaystyle \begin{array}{rcl} - \mathbb{E}_f \left[ \ln \frac{f(z)}{g(z)}\right] &=& \mathbb{E}_f \left[ \ln \frac{g(z)}{f(z)} \right] \\ &\leq& \ln \mathbb{E}_f \left[\frac{g(z)}{f(z)}\right] = \ln \int \frac{g(z)}{f(z)}f(z) dz = 0 \end{array}$

Suppose that we observe a data set ${\mathbf{x}}$ of N elements. We could predict the future observations ${\mathbf{y}}$ whose distribution is identical to that of ${\mathbf{x}}$ by specifying a predictive distribution ${ g(\mathbf{y} | \mathbf{x}) }$ which is a function of the given dataset ${ \mathbf{x}}$. The “closeness” of ${ g(\mathbf{y} | \mathbf{x}) }$ to the true distribution of the future observations ${f(\mathbf{y})}$ is measured by the entropy

$\displaystyle \begin{array}{rcl} B(f(.); g(.| \mathbf{x})) &=& -\int \left( \frac{f(\mathbf{y})}{ g(\mathbf{y} | \mathbf{x})} \right) \ln \left( \frac{f(\mathbf{y})}{ g(\mathbf{y} | \mathbf{x})} \right) g(\mathbf{y} | \mathbf{x}) d \mathbf{y}\\ &=& \int f(\mathbf{y}) \ln g(\mathbf{y} | \mathbf{x}) d \mathbf{y} - \int f(\mathbf{y}) \ln f(\mathbf{y}) d (\mathbf{y}) \\ &=& \mathbb{E}_y \ln g(\mathbf{y} | \mathbf{x}) - c \end{array}$

Hence the entropy is equivalent to the expected log-likelihood with respect to a future observation apart for a constant. The goodness of the estimation procedure specified by ${ g(\mathbf{y} | \mathbf{x}) }$ is measured by ${\mathbb{E}_x \mathbb{E}_y \ln g(\mathbf{y} | \mathbf{x})}$ which is the average over the observed data of the expected log-likelihood of the model ${ g(\mathbf{y} | \mathbf{x}) }$ w.r.t. a future observation.

Suppose ${\mathbf{x}}$ and ${\mathbf{y}}$ are independent and that the distribution ${g(.|\mathbf{x})}$ is specified by a fixed parameter vector ${\mathbf{\theta}}$ (i.e.${ g(.|\mathbf{x}) = g(.|\mathbf{\theta}))}$. Then ${\ln g(\mathbf{x}|\mathbf{x})=\ln g(\mathbf{x}|\mathbf{\theta})}$ and hence the conventional ML estimation procedure is justified as

$\displaystyle \mathbb{E}_x \ln g(\mathbf{x}|\mathbf{\theta}) = \mathbb{E}_x \mathbb{E}_y \ln g(\mathbf{y}|\mathbf{x})$

However generally

$\displaystyle \mathbb{E}_x \ln g(\mathbf{x}|\mathbf{x}) \neq \mathbb{E}_x \mathbb{E}_y \ln g(\mathbf{y}|\mathbf{x})$