联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> javajava

日期:2018-10-30 10:05

STAT3017 Final Project Page 1 of 6

Big Data Statistics - Final Project (v1)

Total of 100 Marks

due Monday 29 October 2018 at 17:00

In this project we consider how to test hypotheses about covariance matrices and, since everything

“eighties” is trendy again, we will look at some multivariate time series papers from that epoch and

make them fresh again. There are interesting connections between these two topics. The aim of the

project is to take the modern viewpoint and understand what happens in both situations when the

dimensionality p of the observations becomes large.

Testing Covariance Matrices

Question 1 [5 marks]

As a warm-up, read Section 6.6 in [C] and reproduce the calculations of Example 6.12 in R. In

this example, Box’s M-test is used to study nursing home data from Wisconsin (data found in

Example 6.10). If you have slightly different results to the book, briefly explain why.

Question 2 [10 marks]

Box’s M-test (aka. Box’s χ

2 approximation) is a classic result that is based on a likelihood ratio

test (LRT). The general philosophy behind a LRT is to maximise the likelihood under the null

hypothesis H0 and also to maximise the likelihood under the alternative hypothesis H1.

Definition 1. If the distribution of the random sample ? = (?1, . . . , ?n)

0 depends upon a parameter

vector θ, and if H0 : θ ∈ ?0 and H1 : θ ∈ ?1 are any two hypotheses, then the likelihood ratio

statistic for testing H0 against H1 is defined as

is the largest value which the likelihood function takes in the region ?i

, i = 0, 1.

At this point it is good to remember that a multivariate Normal distribution is completely characterised

by the parameter vector θ = (μ, Σ), i.e., only the mean vector and the covariance matrix

are needed to know the distribution.

The LRT has the following important asymptotic property as n → ∞ that Box leverages to obtain

his χ

2 approximation.

Theorem 1. If 1 q and if 0 is an r-dimensional subregion of 1 then (under some technical

assumptions) for each ω ∈ 0, 2 log(λ1) has an asymptotic χ

2

q?r distribution as n → ∞.

The explanation why Theorem 1 is true starts in Section 10.2 of [B] where the LRT is derived,

culminating in critical region for λ1 given by eq. (9). At this point, no assumptions are made about

the distribution of the population covariance matrices Σ1, . . . , Σq (so we don’t know how λ1 is

distributed). Assumptions are made in Section 10.4: covariances are assumed Wishart distributed

which occurs when the random samples 1, . . . , n are multivariate Normal. Box’s χ

2 asymptotic

approximation is obtained in Section 10.5 thanks to a formula for the h-moment of λ1. As [λh1]

has a specific form (given in terms of ratios of Gamma functions), Theorem 8.5.1 of [B] can be

applied to get an approximation of ?(?2ρ log(λ1) ≤ z) in terms of the χ

2 distribution.

Dale Roberts - Australian National University

Last updated: September 21, 2018

STAT3017 Final Project Page 2 of 6

Now that you understand some of the theory, study the classic “iris” dataset (available in R in the

iris variable). The populations are Iris versicolor (1), Iris setosa (2), and Iris virginica (3); each

sample consists of 50 observations. Use Box’s M-test (or otherwise) to:

(a)[5] Test the hypothesis Σ1 = Σ2 at the 5% significance level.

(b)[5] Test the hypothesis Σ1 = Σ2 = Σ3 at the 5% significance level.

Note: this is Problem 10.1 from [B].

Question 3 [10 marks]

On page 311 in [C], just above Example 6.12, the authors make the comment that “Box’s χ

2

approximation works well if each n` exceeds 20 and if p and g do not exceed 5”. Your task is to

perform a simulation study (see [J]) to show what happens to Box’s χ

2 approximation when p

exceeds 5 while holding g fixed, e.g., g = 2. This means you have to design an experiment to

show how badly Box’s test performs for large p by choosing appropriate Σ1 and Σ2, simulating

sample data, etc. Present your results in a clear manner (see [J] for presentation tips).

Question 4 [10 marks]

We are now going to look at the problem of testing that a covariance matrix is equal to a given

matrix. If observations ?1, . . . , ?n are multivariate Normal Np(ν, Ψ), we wish to test the hypothesis

H0 : Ψ = Ψ0 where Ψ0 is a given positive definite matrix. Let Q be the matrix such that

QΨ0Q

0 = I,

then set μ := Qν and Σ := QΨQ0

. If we define ?i

:= Q?i

it follows that ?1, . . . , ?n are observations

from Np(μ, Σ) and the hypothesis H0 is transformed to H0 : Σ = I. Using the LRT approach, we

can find the test statistic

Unfortunately λ1 is a biased statistic. The following unbiased estimator was proposed

where N := n 1 and := /n. The distribution of λ

1

has the following χ

2 approximation

(2ρ log λ

1 ≤ z) = (Cf ≤ z) + γ2

ρ

2(n 1)2

((Cf +4 ≤ z)(Cf ≤ z)) + O(n3). (1)

where Ck ~ χ2k

(i.e., χ2 distributed with k degrees of freedom), f :=12

p(p + 1), ρ := 1 (2p2 +3p 1)/[6(n 1)(p + 1)], and γ2 := p(2p

4 + 6p

3 + p

2 12p13)/[288(p + 1)]. All the details

can be found in [B] Section 10.8.1, [B] around Eq. (19) on p. 441, and [A].

Perform a simulation study to understand the performance (type I error and power) of (1) for

n = 500 and p = 5, 10, 50, 100, 300; see [K].

Dale Roberts - Australian National University

Last updated: September 21, 2018

STAT3017 Final Project Page 3 of 6

Question 5 [10 marks]

Continuing the previous question (and its notation), notice that

1 = tr log |?| p.

Setting T1 := tr log |?| p, prove the following theorem.

Theorem 2. Assume that n → ∞, p → ∞, and p/n → y ∈ (0, 1). Then

T1 p d1(yN) → N(μ, σ21)

where N := n 1, YN := p/N and

d1(y ) := 1 +

log(1  y ),

μ1 := 12

log(1 y ),σ21:= 2 log(1y ) 2y.

Hint: Apply Theorem in Lecture 6 on page 7 with 1

p

T1 := F

(f ) with f (x) = x ? log x ? 1. Also

see [D].

Question 6 [10 marks]

Continuing the previous question and notation, use the Theorem to construct an algorithm that

tests H1 : Σ = I and perform a simulation study to understand its performance (type I error and

power) for p = 5, 10, 50, 100, 300. Comment on how it performs compared to (1).

Multivariate Time Series

Let denote the set of integers. A sequence of random vector observations (?t

: t = 1, . . . , T)

with values in ?p

is called a p-dimensional (vector) time series. We denote the sample mean and

sample covariance matrix by

The lag-τ sample cross-covariance (aka. autocovariance) matrix is defined as

The lag-τ cross-correlation is given by

ρτ = D?τD

where D = diag(1/

s11, 1/

s22, . . . , 1/

√spp) and the values come from ?0 = [sij]. Assuming

[t] = 0, some authors (e.g., [H], [I]) omit ?t and consider the symmetrised lag-τ sample

cross-covariance given by

Dale Roberts - Australian National University

Last updated: September 21, 2018

STAT3017 Final Project Page 4 of 6

Question 7 [12 marks]

Simulation is a helpful way to learn about vector time series. Define the matrices

Generate 300 observations from the “vector autoregressive” VAR(1) model

t = At1 + εt (2)

where εt ~ N2(0, Σ), i.e., they are i.i.d. bivariate normal random variables with mean zero and

covariance Σ. Note that when simulating is it customary omit the first 100 or more observations

and you can start with 0 = (0, 0)0

.

Also generate 300 observations from the “vector moving average” VMA(1) model

t = εt + Aεt1. (3)

(a)[1] Plot the time series t for the VAR(1) model given by (2)

(b)[1] Obtain the first five lags of sample cross-correlations of ?t for the VAR(1) model, i.e.,

ρ1, . . . , ρ5.

(c)[1] Plot the time series ?t for the MA(1) model given by (3).

(d)[1] Obtain the first two lags of sample cross-correlations of ?t for the MA(1) model.

(e)[5] Implement the test from [F] and reproduce the simulation experiment given in Section 5.

This means you need to generate Table 1 from [F].

(f)[3] The file q-fdebt.txt contains the U.S. quarterly federal debts held by (i) foreign and

international investors, (ii) federal reserve banks, and (iii) the public. The data are from

the Federal Reserve Bank of St. Louis, from 1970 to 2012 for 171 observations, and not

seasonally adjusted. The debts are in billions of dollars. Take the log transformation and the

first difference for each time series. Let (?t) be the differenced log series.

Test H0 : ρ1 = . . . = ρ10 = 0 vs Ha : ρτ = 0 6 for some τ ∈ {1, . . . , 10} using the test from

[F]. Draw the conclusion using the 5% significance level.

Question 8 [13 marks]

More generally, a p-dimensional time series ?t follows a VAR model of order `, VAR(`),

i=1

Ai?t?i + εt (4)

where a0 is a p-dimensional constant vector and Ai are p × p (non-zero) matrices for i > 0, and

i.i.d. εt ~ Np(0, Σ) for all t with p × p covariance matrix Σ.

One day you might want to “build a model” using the VAR(`) framework. One of the first things

you need to do is to determine the optimal order `. Tiao and Box (1981) suggest using sequential

likelihood ratio tests; see Section 4 in [G]. Their approach is to compare a VAR(`) model with a

VAR(` 1) model and amounts to considering the hypothesis testing problem

H0 : A` = 0 vs. H1 : A` 6= 0.

Dale Roberts - Australian National University

Last updated: September 21, 2018

STAT3017 Final Project Page 5 of 6

We can do this by determining model parameters using a least-squares approach. We rewrite (4)

as

is a (p` + 1)-dimensional vector and ? = [a0, A1, . . . , A`

] is a

p × 1 + ` × (p × p) = p × (p` + 1) matrix. With observations at times t = ` + 1, . . . , T, we write

the data as

X = X + E (5)

where X is a (T ? `) × p matrix with the ith row being ?0

`+i

, X is a (T  `) × (p` + 1) design

matrix with the ith row being X0

`+i

, and E is a (T  `) × p matrix with the ith row being ε

0

`+i

.

The matrix contains the coefficient parameters of the VAR(`) model and let Σ,` be the

corresponding innovation covariance matrix. Under a normality assumption, the likelihood ratio for

the testing problem is

The likelihood ratio test of H0 is equivalent to rejecting H0 for large values of

A commonly used statistic is Bartlett’s approximation given by

M(`) = ?(T ? ` ? 1.5 ? p`) log

which follows asymptotically (as n → ∞ and p fixed) a χ

2 distribution with p

2 degrees of freedom.

The following methodology is suggested for selecting the order `:

1. Select a positive integer P, which is the maximum VAR order that we would like to consider.

2. Setup the regression framework (5) for the VAR(P) model. That is, there are T ? P

observations (i.e., rows) in the X matrix.

3. For ` = 0, . . . , P compute the least-squares estimate of the AR coefficient matrix ?. For

` = 0, we have ? = a0. Then compute the ML estimate for Σ, ` given by

Σ,` := (1/T ? P)R

0

`R`

where R` = ? ? X? is the residual matrix of the fitted VAR(`) model.

4. For ` = 1, . . . , P , compute test statistic M(`) and its p-value, which is based on the

asymptotic χ

2

k

2 distribution.

5. Examine the test statistics sequentially starting with ` = 1. If all the p-values of the M(`)

test statistics are greater than the specified type I error for ` > m, then a VAR(m) model is

specified. This is so because the test rejects the null hypothesis A` = 0, but fails to reject

A` = 0 for ` > m.

Dale Roberts - Australian National University

Last updated: September 21, 2018

STAT3017 Final Project Page 6 of 6

Consider a bivariate time series

is the change in monthly US treasury

bills with maturity 3 months and ?

CPI

t

is the inflation rate, in percentage, of the U.S. monthly

consumer price index (CPI). This data from the Federal Reserve Bank of St. Louis. The CPI

rate is 100 times the difference of the log CPI index. The sample period is from January 1947 to

December 2012. The data are in the file m-cpib3m.txt.

(a)[1] Plot the time series ?t

.

(b)[6] Select a VAR order for ?t using the methodology (described above).

(c)[6] Drawing on your results obtained in this project and the theory discussed in class, explain

and demonstrate (e.g., simulation study) what might happen with this methodology if the

dimensionality p of the time series becomes large.

Question 9 [20 marks]

The recent paper [H] is concerned with extensions of the classical Marchenko-Pastur to the time

series case. Reproduce their simulation study which is found in Section 5 and Figure 1.

References

[A] Sugiura, Nagao (1968). Unbiasedness of some test criteria for the equality of one or two covariance matrices.

Annals of Mathematical Statistics Vol. 39, No. 5, 1686–1692.

[B] Anderson (2003). An introduction to Multivariate Statistical Analysis. Wiley.

[C] Johnson, Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall.

[D] Bai, Jiang, Yao, Zheng (2009). Corrections to LRT on large-dimensional covariance matrix by RMT. Annals of

Statistics Vol 37, No. 6B, 3822–3840.

[E] Zheng, Bai, Yao (2017). CLT for eigenvalue statistics of large-dimensional general Fisher matrices with applications.

Bernouilli 23(2), 1130–1178.

[F] Li, McLeod (1981). Distribution of the Residual Autocorrelations in Multivariate ARMA Time Series Models, J.R.

Stat. Soc. B 43, No. 2, 231–239.

[G] Tiao and Box (1981). Modelling multiple time series with applications. Journal of the American Statistical

Association, 76. 802 – 816.

[H] Liu, Aue, Paul (2015). On the Marchenko-Pastur Law for Linear Time Series. Annals of Statistics Vol. 43, No. 2,

675–712.

[I] Liu, Aue, Paul (2017). Spectral analysis of sample autocovariance matrices of a class of linear time series in

moderately high dimensions. Bernouilli 23(4A), 2181–2209.

[J] http://www4.stat.ncsu.edu/~davidian/st810a/simulation_handout.pdf

[K] https://stats.stackexchange.com/a/40874

Dale Roberts - Australian National University

Last updated: September 21, 2018


版权所有:留学生程序网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。