1. (Wine Data Set) These data are the results of a chemical analysis of wines grown in

the same region in Italy but derived from three different cultivars. The analysis determined

the quantities of 13 constituents (including Alcohol, Malic acid, Ash, Alcalinity of

ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins,

Color intensity, Hue,OD280/OD315 of diluted wines, and Proline) found in each of the

three types of wines. The sample size is 178. The dataset is available in the course site. The

main interest of this dataset is to study multiclassification of the three types of wines. Let yb

denote the predicted class of observations.

(a) Use nominal logistic regression in Section 2.3 to examine the multiclassification. The R

function is multinom. In addition, summarize the confusion table for y and yb, use macro

averaged metrics to evaluate recall, precision, F-measure, and then conduct performance

of classification.

(b) Use the methods in linear discriminant analysis and quadratic discriminant analysis to

obtain yb. In addition, summarize the confusion table for y and yb, use macro averaged

metrics to evaluate recall, precision, F-measure, and then conduct performance of classification.

(c) Use the support vector machine method to obtain yb. In addition, summarize the confusion

table for y and yb, use macro averaged metrics to evaluate recall, precision, Fmeasure,

and then conduct performance of classification.

(d) Summarize your findings in (a)-(c).

2

2. (Simulation studies) Consider the following linear model:

y = X1β1 + X2β2 + X3β3 + X4β4 ? 4√ρX5β5 + , (1)

where X = (X1, · · · , Xp) is a p-dimensional vector of covariates and each Xk is generated

from N(0, 1). The correlations of all Xk except X5 are ρ, while X5 has the correlation √ρ

with all other p ? 1 variables. Suppose that the sample size is n = 200.

(a) Show that X5 is marginally independent of y.

(b) Now, consider p = 1500 and generate the artificial data based on model (1) for 1000

repetitions. Specifically, let βi = 1 for every i = 1, · · · , 5 and set ρ = 0.7. After that, use

the SIS and iterated SIS methods to do variable selection and estimate the parameters

associated with selected covariates. Finally, summarize the estimator in the following

table:

Table 1: Simulation result for (b)

k?βk1 k?βk2 #S #FN

SIS

Iterated SIS

(c) Here we consider the scenario that is different from (b). Let p = 40 and X ～ N(0, ΣX)

with entry (j, k) in ΣX being 0.5

|j?k|

for j, k = 1, · · · , p. We generate the artificial data

based on (1) for 1000 repetition with βi = 1 for every i = 1, · · · , 5. After that, use the

lasso, adaptive lasso, and Elastic net (set α = 0.5) methods to estimate the parameters.

Finally, summarize numerical results in the following table.

Table 2: Simulation result for (c)

k?βk1 k?βk2 #S #FN

lasso

adaptive lasso

Elastic net (α = 0.5)

(d) Summarize your findings for parts (b) and (c), respectively.

Note: Let βb be the estimator, then ?β is defined as ?β = βb ? β with the ith component

being βbi ? βi

. Therefore, k?βk1 and k?βk2 are defined as

Hint: Regarding simulation studies with 1000 repetitions.

In Question 2, you are asked to use simulation studies with 1000 repetitions to estimate the

parameters. Specifically, based on the kth artificial data that are independently generated, you are

able to obtain the estimator, denoted by βb(k). As a result, with 1000 repetitions.

5

版权所有：留学生作业网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。