## Akaike Information Criterion Statistics

Consider a distribution ${(q_1, q_2, ...,q_k)}$ with ${q_i >0}$ and ${ q_1 + q_2 + ...+ q_k=1}$. Suppose ${N }$ independent drawings are made from the distribution and the resulting frequencies are given by ${ (N_1,N_2,...,N_k)}$, where ${N_1+N_2+...+N_k=N}$. Then the probability of getting the same frequencies by sampling from ${(q_1, q_2, ...,q_k)}$ is given by

$\displaystyle W = \frac{N!}{N_1!...N_k!} q_1^{N_1} q_2^{N_2}... q_k^{N_k}$

and thus

$\displaystyle \ln W \approx - N \sum\limits_{i=1}^{k}\frac{N_i}{N} \ln \left( \frac{N_i}{N q_i} \right)$

since ${\ln N! \approx N \ln N - N}$. Set ${p_i = N_i/N}$. Then

$\displaystyle \begin{array}{rcl} \ln W &=& - N \sum\limits_{i=1}^{k} p_i \ln (p_i / q_i) \\ &=& NB(p;q) \end{array}$

where ${B(p;q)}$ is the entropy of the distribution ${\{p_i \}}$ w.r.t. the distribution ${\{q_i \}}$. The entropy here can be interpreted as the logarithm of the probability of getting the distribution ${\{ p_i \}}$ (which could asymptotically be the true distribution) by sampling from an hypothetical distribution ${\{q_i\}}$.

Based on Sanov’s result (1961) the above discussion may be extended to more general distributions. Let ${f}$ and ${g}$ be the pdfs of the true and hypothetical distributions respectively, and ${F_N}$ the pdf estimate based on the random sampling of ${N}$ observations from ${g}$. Then

$\displaystyle B(f;g) = - \int f(z) \ln(f(z)/g(z)dz$

as ${ \lim\limits_{\epsilon \downarrow 0} \lim\limits_{N \rightarrow \infty} N^{-1} P(\sup_x |f_N(x)- f(x)| < \epsilon).}$ Note that ${- B(f;g) }$ equals ${E_f [ \ln (f(z)/g(z))] }$ which is the Kullback-Leibler divergence between ${f}$ and ${g}$. Note also that ${B(f;g) \leq 0 }$. That is because

$\displaystyle \begin{array}{rcl} - \mathbb{E}_f \left[ \ln \frac{f(z)}{g(z)}\right] &=& \mathbb{E}_f \left[ \ln \frac{g(z)}{f(z)} \right] \\ &\leq& \ln \mathbb{E}_f \left[\frac{g(z)}{f(z)}\right] = \ln \int \frac{g(z)}{f(z)}f(z) dz = 0 \end{array}$

Suppose that we observe a data set ${\mathbf{x}}$ of N elements. We could predict the future observations ${\mathbf{y}}$ whose distribution is identical to that of ${\mathbf{x}}$ by specifying a predictive distribution ${ g(\mathbf{y} | \mathbf{x}) }$ which is a function of the given dataset ${ \mathbf{x}}$. The “closeness” of ${ g(\mathbf{y} | \mathbf{x}) }$ to the true distribution of the future observations ${f(\mathbf{y})}$ is measured by the entropy

$\displaystyle \begin{array}{rcl} B(f(.); g(.| \mathbf{x})) &=& -\int \left( \frac{f(\mathbf{y})}{ g(\mathbf{y} | \mathbf{x})} \right) \ln \left( \frac{f(\mathbf{y})}{ g(\mathbf{y} | \mathbf{x})} \right) g(\mathbf{y} | \mathbf{x}) d \mathbf{y}\\ &=& \int f(\mathbf{y}) \ln g(\mathbf{y} | \mathbf{x}) d \mathbf{y} - \int f(\mathbf{y}) \ln f(\mathbf{y}) d (\mathbf{y}) \\ &=& \mathbb{E}_y \ln g(\mathbf{y} | \mathbf{x}) - c \end{array}$

Hence the entropy is equivalent to the expected log-likelihood with respect to a future observation apart for a constant. The goodness of the estimation procedure specified by ${ g(\mathbf{y} | \mathbf{x}) }$ is measured by ${\mathbb{E}_x \mathbb{E}_y \ln g(\mathbf{y} | \mathbf{x})}$ which is the average over the observed data of the expected log-likelihood of the model ${ g(\mathbf{y} | \mathbf{x}) }$ w.r.t. a future observation.

Suppose ${\mathbf{x}}$ and ${\mathbf{y}}$ are independent and that the distribution ${g(.|\mathbf{x})}$ is specified by a fixed parameter vector ${\mathbf{\theta}}$ (i.e.${ g(.|\mathbf{x}) = g(.|\mathbf{\theta}))}$. Then ${\ln g(\mathbf{x}|\mathbf{x})=\ln g(\mathbf{x}|\mathbf{\theta})}$ and hence the conventional ML estimation procedure is justified as

$\displaystyle \mathbb{E}_x \ln g(\mathbf{x}|\mathbf{\theta}) = \mathbb{E}_x \mathbb{E}_y \ln g(\mathbf{y}|\mathbf{x})$

However generally

$\displaystyle \mathbb{E}_x \ln g(\mathbf{x}|\mathbf{x}) \neq \mathbb{E}_x \mathbb{E}_y \ln g(\mathbf{y}|\mathbf{x})$

Akaike proposes the log-likelihood of the data-dependent model as distinct form the log-likelihood of the parameter ${\theta}$, be defined by

$\displaystyle l(g(.|\mathbf{x})) = \ln (g(\mathbf{x}| \mathbf{x})) +C$

where ${C}$ is a constant s.th.

$\displaystyle \mathbb{E}_x l(g(.|\mathbf{x})) = \mathbb{E}_x \mathbb{E}_y \ln g(\mathbf{y}| \mathbf{x}).$

For the above definition to be operational the constant C must be a constant for the members of a family of possible models. That could be done by restricting ${g(\mathbf{y|x})}$ to be of the form ${g(\mathbf{x|\theta(x)})}$.

Let ${g_m (\mathbf{y}|_m \theta(\mathbf{x}))}$, ${m=1,...,M}$ denote ${M}$ competing models. Assume that the true distribution belongs to each of these models. Use the notation ${g(\mathbf{y}| \mathbf{\theta}_0)}$ for ${ f (\mathbf{y})}$ and assume that the usual regularity conditions hold. Let ${ _m \hat{\mathbf{\theta}} (\mathbf{x}) }$ denote the ML estimate of ${ _m \mathbf{\theta} (\mathbf{x})}$.

1. As ${ N \rightarrow \infty }$, the LR statistic

$\displaystyle 2 \ln g (\mathbf{x} |_m \mathbf{\hat{\theta}(x)}) - 2 \ln g(\mathbf{x} | \mathbf{\theta}_0) \sim \chi^2_r$

asymptotically, where ${r=\dim_m \mathbf{\theta(x)}}$.

2. Expanding the following expression in the neighbourhood of ${ \ln g(\mathbf{x} | \mathbf{\theta}_0)}$ we get

$\displaystyle \begin{array}{rcl} 2 \ln g(\mathbf{y} | \mathbf{\theta}_0) - 2 \ln g (\mathbf{y} |_m \mathbf{\hat{\theta}(x)}) &=& 2 \ln g(\mathbf{y} | \mathbf{\theta}_0) \\ &-& 2[\ln g(\mathbf{y} | \mathbf{\theta}_0) + \frac{1}{2} (_m \mathbf{\hat{\theta}(x)} - \mathbf{\theta}_0 )^T D^2_{\theta} (\theta_0)(_m \mathbf{\hat{\theta}(x)} - \mathbf{\theta}_0 ) \\ &+& \text{terms of higher order in } (_m \mathbf{\hat{\theta}(x)} - \mathbf{\theta}_0 )] \end{array}$

Ignoring the higher order terms we have

$\displaystyle \begin{array}{rcl} 2 \ln g(\mathbf{y} | \mathbf{\theta}_0) - 2 \ln g (\mathbf{y} |_m \mathbf{\hat{\theta}(x)}) &=& (_m \mathbf{\hat{\theta}(x)} - \mathbf{\theta}_0 )^T (- D^2_{\theta} (\theta_0))(_m \mathbf{\hat{\theta}(x)} - \mathbf{\theta}_0 ) \end{array}$

Note that the property of the best asymptotic normality of ${_m \mathbf{\hat{\theta}(x)} }$ implies that

$\displaystyle \sqrt{N} \{ _m \mathbf{\hat{\theta}(x)} - \mathbf{\theta}_0 \} \xrightarrow{D} N(0, \mathbf{I}^{-1}_{\theta_0})$

where ${\mathbf{I}_{\theta_0} \equiv - \frac{1}{N} \mathbb{E}_y [D^2_{\theta} (\theta_0) ]}$. Hence

$\displaystyle \begin{array}{rcl} 2 \mathbb{E}_y [ \ln g(\mathbf{y} | \mathbf{\theta}_0) - \ln g (\mathbf{y} |_m \mathbf{\hat{\theta}(x)})]= N \{ _m \mathbf{\hat{\theta}(x)} - \mathbf{\theta}_0 \}^T \mathbf{I}_{\theta_0}^{-1} \{ _m \mathbf{\hat{\theta}(x)} - \mathbf{\theta}_0 \} \xrightarrow{D} \chi^2_r \end{array}$

as ${N \rightarrow \infty}$.

Combining 1 and 2 we thus get

$\displaystyle 2 \mathbb{E}_x [ \ln g (\mathbf{x} |_m \mathbf{\hat{\theta}(x)}) - \ln g(\mathbf{x} | \mathbf{\theta}_0) ]= r$

and

$\displaystyle 2 \mathbb{E}_x \mathbb{E}_y [ \ln g(\mathbf{y} | \mathbf{\theta}_0) - \ln g (\mathbf{y} |_m \mathbf{\hat{\theta}(x)})]= r$

and thus it follows that

$\displaystyle C=-r$

Hence a model which could be adopted is the one which maximizes ${\ln g(\mathbf{x}|_m \hat{\theta}(\mathbf{x}))-r }$ over ${m=1,...,M}$. The basic principle underlying this procedure is the \underline{entropy maximization} as the maximization of ${\ln g(\mathbf{x}|_m \hat{\theta}(\mathbf{x}))-r }$ is equivalent to the maximization of ${B\{g(.\theta_0); g(._m \hat{\theta}(\mathbf{x}))\}}$. This maximization problem is more commonly replaced by the equivalent problem of minimization of ${-2\ln g(\mathbf{x}|_m \hat{\theta}(\mathbf{x}))+2r }$ which can be generally expressed as

$\displaystyle \boxed{AIC(m) = - 2 \ln(ml)+2(\sharp \text{ of independently adjusted parameters})}$

The RHS the first term reflects the lack of fit while the second penalizes the model complexity. The optimum model which minimises the AIC reflects the trade-off between the two terms.

——

References

Akaike, H. (1978a). On the Likelihood of a Time Series Model. Journal of the Royal Statistical Society. Series D (The Statistician). Vol. 27, No. 3/4, Partial Proceedings of the 1978 I.O.S. Annual Conference on Time Series Analysis (and Forecasting) (Sep. – Dec., 1978), pp. 217-235

Akaike, H. (1985). Prediction and entropy. Pages 1-24 in A. C. Atkinson, and S. E. Fienberg (Eds.) A celebration of statistics. Springer, New York, NY.

Sanov, I. (1961). On the probability of large deviations of random variables.  IMS and ASM Selected Translations in Mathematical Statistics and Probability. 1, 213-44

Sakamoto, Y., Ishiguro, M., and Kitagawa, G. (1986). Akaike information criterion statistics. KTK Scientific Publishers, Tokyo.

Tong, H. (1990). Non-linear Time Series: A Dynamical System Approach. Oxford University Press.

Tong, H. (1994). Akaike’s approach can yield consistent order determination. Pages 93-103 in H. Bozdogan (Ed.) Engineering and Scientific Applications. Vol. 1, Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach. Kluwer Academic Publishers, Dordrecht, Netherlands.