STAT3017 Final Project Page 1 of 6
Big Data Statistics - Final Project (v1)
Total of 100 Marks
due Monday 29 October 2018 at 17:00
In this project we consider how to test hypotheses about covariance matrices and, since everything
“eighties” is trendy again, we will look at some multivariate time series papers from that epoch and
make them fresh again. There are interesting connections between these two topics. The aim of the
project is to take the modern viewpoint and understand what happens in both situations when the
dimensionality p of the observations becomes large.
Testing Covariance Matrices
Question 1 [5 marks]
As a warm-up, read Section 6.6 in [C] and reproduce the calculations of Example 6.12 in R. In
this example, Box’s M-test is used to study nursing home data from Wisconsin (data found in
Example 6.10). If you have slightly different results to the book, briefly explain why.
Question 2 [10 marks]
Box’s M-test (aka. Box’s χ
2 approximation) is a classic result that is based on a likelihood ratio
test (LRT). The general philosophy behind a LRT is to maximise the likelihood under the null
hypothesis H0 and also to maximise the likelihood under the alternative hypothesis H1.
Definition 1. If the distribution of the random sample ? = (?1, . . . , ?n)
0 depends upon a parameter
vector θ, and if H0 : θ ∈ ?0 and H1 : θ ∈ ?1 are any two hypotheses, then the likelihood ratio
statistic for testing H0 against H1 is defined as
is the largest value which the likelihood function takes in the region ?i
, i = 0, 1.
At this point it is good to remember that a multivariate Normal distribution is completely characterised
by the parameter vector θ = (μ, Σ), i.e., only the mean vector and the covariance matrix
are needed to know the distribution.
The LRT has the following important asymptotic property as n → ∞ that Box leverages to obtain
his χ
2 approximation.
Theorem 1. If 1 q and if 0 is an r-dimensional subregion of 1 then (under some technical
assumptions) for each ω ∈ 0, 2 log(λ1) has an asymptotic χ
2
q?r distribution as n → ∞.
The explanation why Theorem 1 is true starts in Section 10.2 of [B] where the LRT is derived,
culminating in critical region for λ1 given by eq. (9). At this point, no assumptions are made about
the distribution of the population covariance matrices Σ1, . . . , Σq (so we don’t know how λ1 is
distributed). Assumptions are made in Section 10.4: covariances are assumed Wishart distributed
which occurs when the random samples 1, . . . , n are multivariate Normal. Box’s χ
2 asymptotic
approximation is obtained in Section 10.5 thanks to a formula for the h-moment of λ1. As [λh1]
has a specific form (given in terms of ratios of Gamma functions), Theorem 8.5.1 of [B] can be
applied to get an approximation of ?(?2ρ log(λ1) ≤ z) in terms of the χ
2 distribution.
Dale Roberts - Australian National University
Last updated: September 21, 2018
STAT3017 Final Project Page 2 of 6
Now that you understand some of the theory, study the classic “iris” dataset (available in R in the
iris variable). The populations are Iris versicolor (1), Iris setosa (2), and Iris virginica (3); each
sample consists of 50 observations. Use Box’s M-test (or otherwise) to:
(a)[5] Test the hypothesis Σ1 = Σ2 at the 5% significance level.
(b)[5] Test the hypothesis Σ1 = Σ2 = Σ3 at the 5% significance level.
Note: this is Problem 10.1 from [B].
Question 3 [10 marks]
On page 311 in [C], just above Example 6.12, the authors make the comment that “Box’s χ
2
approximation works well if each n` exceeds 20 and if p and g do not exceed 5”. Your task is to
perform a simulation study (see [J]) to show what happens to Box’s χ
2 approximation when p
exceeds 5 while holding g fixed, e.g., g = 2. This means you have to design an experiment to
show how badly Box’s test performs for large p by choosing appropriate Σ1 and Σ2, simulating
sample data, etc. Present your results in a clear manner (see [J] for presentation tips).
Question 4 [10 marks]
We are now going to look at the problem of testing that a covariance matrix is equal to a given
matrix. If observations ?1, . . . , ?n are multivariate Normal Np(ν, Ψ), we wish to test the hypothesis
H0 : Ψ = Ψ0 where Ψ0 is a given positive definite matrix. Let Q be the matrix such that
QΨ0Q
0 = I,
then set μ := Qν and Σ := QΨQ0
. If we define ?i
:= Q?i
it follows that ?1, . . . , ?n are observations
from Np(μ, Σ) and the hypothesis H0 is transformed to H0 : Σ = I. Using the LRT approach, we
can find the test statistic
Unfortunately λ1 is a biased statistic. The following unbiased estimator was proposed
where N := n 1 and := /n. The distribution of λ
1
has the following χ
2 approximation
(2ρ log λ
1 ≤ z) = (Cf ≤ z) + γ2
ρ
2(n 1)2
((Cf +4 ≤ z)(Cf ≤ z)) + O(n3). (1)
where Ck ~ χ2k
(i.e., χ2 distributed with k degrees of freedom), f :=12
p(p + 1), ρ := 1 (2p2 +3p 1)/[6(n 1)(p + 1)], and γ2 := p(2p
4 + 6p
3 + p
2 12p13)/[288(p + 1)]. All the details
can be found in [B] Section 10.8.1, [B] around Eq. (19) on p. 441, and [A].
Perform a simulation study to understand the performance (type I error and power) of (1) for
n = 500 and p = 5, 10, 50, 100, 300; see [K].
Dale Roberts - Australian National University
Last updated: September 21, 2018
STAT3017 Final Project Page 3 of 6
Question 5 [10 marks]
Continuing the previous question (and its notation), notice that
1 = tr log |?| p.
Setting T1 := tr log |?| p, prove the following theorem.
Theorem 2. Assume that n → ∞, p → ∞, and p/n → y ∈ (0, 1). Then
T1 p d1(yN) → N(μ, σ21)
where N := n 1, YN := p/N and
d1(y ) := 1 +
log(1 y ),
μ1 := 12
log(1 y ),σ21:= 2 log(1y ) 2y.
Hint: Apply Theorem in Lecture 6 on page 7 with 1
p
T1 := F
(f ) with f (x) = x ? log x ? 1. Also
see [D].
Question 6 [10 marks]
Continuing the previous question and notation, use the Theorem to construct an algorithm that
tests H1 : Σ = I and perform a simulation study to understand its performance (type I error and
power) for p = 5, 10, 50, 100, 300. Comment on how it performs compared to (1).
Multivariate Time Series
Let denote the set of integers. A sequence of random vector observations (?t
: t = 1, . . . , T)
with values in ?p
is called a p-dimensional (vector) time series. We denote the sample mean and
sample covariance matrix by
The lag-τ sample cross-covariance (aka. autocovariance) matrix is defined as
The lag-τ cross-correlation is given by
ρτ = D?τD
where D = diag(1/
√
s11, 1/
√
s22, . . . , 1/
√spp) and the values come from ?0 = [sij]. Assuming
[t] = 0, some authors (e.g., [H], [I]) omit ?t and consider the symmetrised lag-τ sample
cross-covariance given by
Dale Roberts - Australian National University
Last updated: September 21, 2018
STAT3017 Final Project Page 4 of 6
Question 7 [12 marks]
Simulation is a helpful way to learn about vector time series. Define the matrices
Generate 300 observations from the “vector autoregressive” VAR(1) model
t = At1 + εt (2)
where εt ~ N2(0, Σ), i.e., they are i.i.d. bivariate normal random variables with mean zero and
covariance Σ. Note that when simulating is it customary omit the first 100 or more observations
and you can start with 0 = (0, 0)0
.
Also generate 300 observations from the “vector moving average” VMA(1) model
t = εt + Aεt1. (3)
(a)[1] Plot the time series t for the VAR(1) model given by (2)
(b)[1] Obtain the first five lags of sample cross-correlations of ?t for the VAR(1) model, i.e.,
ρ1, . . . , ρ5.
(c)[1] Plot the time series ?t for the MA(1) model given by (3).
(d)[1] Obtain the first two lags of sample cross-correlations of ?t for the MA(1) model.
(e)[5] Implement the test from [F] and reproduce the simulation experiment given in Section 5.
This means you need to generate Table 1 from [F].
(f)[3] The file q-fdebt.txt contains the U.S. quarterly federal debts held by (i) foreign and
international investors, (ii) federal reserve banks, and (iii) the public. The data are from
the Federal Reserve Bank of St. Louis, from 1970 to 2012 for 171 observations, and not
seasonally adjusted. The debts are in billions of dollars. Take the log transformation and the
first difference for each time series. Let (?t) be the differenced log series.
Test H0 : ρ1 = . . . = ρ10 = 0 vs Ha : ρτ = 0 6 for some τ ∈ {1, . . . , 10} using the test from
[F]. Draw the conclusion using the 5% significance level.
Question 8 [13 marks]
More generally, a p-dimensional time series ?t follows a VAR model of order `, VAR(`),
i=1
Ai?t?i + εt (4)
where a0 is a p-dimensional constant vector and Ai are p × p (non-zero) matrices for i > 0, and
i.i.d. εt ~ Np(0, Σ) for all t with p × p covariance matrix Σ.
One day you might want to “build a model” using the VAR(`) framework. One of the first things
you need to do is to determine the optimal order `. Tiao and Box (1981) suggest using sequential
likelihood ratio tests; see Section 4 in [G]. Their approach is to compare a VAR(`) model with a
VAR(` 1) model and amounts to considering the hypothesis testing problem
H0 : A` = 0 vs. H1 : A` 6= 0.
Dale Roberts - Australian National University
Last updated: September 21, 2018
STAT3017 Final Project Page 5 of 6
We can do this by determining model parameters using a least-squares approach. We rewrite (4)
as
is a (p` + 1)-dimensional vector and ? = [a0, A1, . . . , A`
] is a
p × 1 + ` × (p × p) = p × (p` + 1) matrix. With observations at times t = ` + 1, . . . , T, we write
the data as
X = X + E (5)
where X is a (T ? `) × p matrix with the ith row being ?0
`+i
, X is a (T `) × (p` + 1) design
matrix with the ith row being X0
`+i
, and E is a (T `) × p matrix with the ith row being ε
0
`+i
.
The matrix contains the coefficient parameters of the VAR(`) model and let Σ,` be the
corresponding innovation covariance matrix. Under a normality assumption, the likelihood ratio for
the testing problem is
The likelihood ratio test of H0 is equivalent to rejecting H0 for large values of
A commonly used statistic is Bartlett’s approximation given by
M(`) = ?(T ? ` ? 1.5 ? p`) log
which follows asymptotically (as n → ∞ and p fixed) a χ
2 distribution with p
2 degrees of freedom.
The following methodology is suggested for selecting the order `:
1. Select a positive integer P, which is the maximum VAR order that we would like to consider.
2. Setup the regression framework (5) for the VAR(P) model. That is, there are T ? P
observations (i.e., rows) in the X matrix.
3. For ` = 0, . . . , P compute the least-squares estimate of the AR coefficient matrix ?. For
` = 0, we have ? = a0. Then compute the ML estimate for Σ, ` given by
Σ,` := (1/T ? P)R
0
`R`
where R` = ? ? X? is the residual matrix of the fitted VAR(`) model.
4. For ` = 1, . . . , P , compute test statistic M(`) and its p-value, which is based on the
asymptotic χ
2
k
2 distribution.
5. Examine the test statistics sequentially starting with ` = 1. If all the p-values of the M(`)
test statistics are greater than the specified type I error for ` > m, then a VAR(m) model is
specified. This is so because the test rejects the null hypothesis A` = 0, but fails to reject
A` = 0 for ` > m.
Dale Roberts - Australian National University
Last updated: September 21, 2018
STAT3017 Final Project Page 6 of 6
Consider a bivariate time series
is the change in monthly US treasury
bills with maturity 3 months and ?
CPI
t
is the inflation rate, in percentage, of the U.S. monthly
consumer price index (CPI). This data from the Federal Reserve Bank of St. Louis. The CPI
rate is 100 times the difference of the log CPI index. The sample period is from January 1947 to
December 2012. The data are in the file m-cpib3m.txt.
(a)[1] Plot the time series ?t
.
(b)[6] Select a VAR order for ?t using the methodology (described above).
(c)[6] Drawing on your results obtained in this project and the theory discussed in class, explain
and demonstrate (e.g., simulation study) what might happen with this methodology if the
dimensionality p of the time series becomes large.
Question 9 [20 marks]
The recent paper [H] is concerned with extensions of the classical Marchenko-Pastur to the time
series case. Reproduce their simulation study which is found in Section 5 and Figure 1.
References
[A] Sugiura, Nagao (1968). Unbiasedness of some test criteria for the equality of one or two covariance matrices.
Annals of Mathematical Statistics Vol. 39, No. 5, 1686–1692.
[B] Anderson (2003). An introduction to Multivariate Statistical Analysis. Wiley.
[C] Johnson, Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall.
[D] Bai, Jiang, Yao, Zheng (2009). Corrections to LRT on large-dimensional covariance matrix by RMT. Annals of
Statistics Vol 37, No. 6B, 3822–3840.
[E] Zheng, Bai, Yao (2017). CLT for eigenvalue statistics of large-dimensional general Fisher matrices with applications.
Bernouilli 23(2), 1130–1178.
[F] Li, McLeod (1981). Distribution of the Residual Autocorrelations in Multivariate ARMA Time Series Models, J.R.
Stat. Soc. B 43, No. 2, 231–239.
[G] Tiao and Box (1981). Modelling multiple time series with applications. Journal of the American Statistical
Association, 76. 802 – 816.
[H] Liu, Aue, Paul (2015). On the Marchenko-Pastur Law for Linear Time Series. Annals of Statistics Vol. 43, No. 2,
675–712.
[I] Liu, Aue, Paul (2017). Spectral analysis of sample autocovariance matrices of a class of linear time series in
moderately high dimensions. Bernouilli 23(4A), 2181–2209.
[J] http://www4.stat.ncsu.edu/~davidian/st810a/simulation_handout.pdf
[K] https://stats.stackexchange.com/a/40874
Dale Roberts - Australian National University
Last updated: September 21, 2018
版权所有:留学生程序网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。