SIT114 2020.T1: Task 6.4HD

Contents

1 To Do 1

2 Hint 2

3 Further Reading 4

4 Artefacts 4

5 Intended Learning Outcomes 4

1 To Do

Create a single RMarkdown report where you perform what follows.

1. Just as in task 6.3D, load the Wine Quality dataset.

wines <- read.csv("winequality-all.csv",

comment="#", stringsAsFactors=FALSE)

2. Then, add a new 0/1 column named quality again (quality equal to 1 if

and only if a wine is ranked 7 or higher).

3. Perform a random train-test split of size 60-40% – create the matrices

X_train and X_test and the corresponding label vectors Y_train and

Y_test that provide the information on the wines’ quality.

4. Your task is to determine the best (see below) parameter setting for the

K-nearest neighbour classification of the quality variable based on the 11

physicochemical features. Perform the so-called grid (exhaustive) search

over all the possible combinations of the following parameters:

a. K: 1, 3, 5, 7 or 9

b. preprocessing: none (raw input data), standardised variables or robustly

standardised variables

c. metric: L2 (Euclidean) or L1 (Manhattan)

1

In other words, there are 5*3*2=30 combinations of parameters in total,

and hence – 30 different scenarios to consider.

By robust standardisation we mean: from each column, subtract

its median and then divide by median absolute deviation

(MAD, i.e., median(abs(x-median(x)))). This data preprocessing

scheme is less sensitive to outliers than the classic standardisation.

Note that the L1 metric-based K-nearest neighbour method is

not implemented in the FNN package. You need to implement it

on your own (see Chapter 3 of LMLCR).

By the best classifier we mean the one that maximises the F-measure

obtained by the so-called 5-fold cross-validation.

In Chapter 3 we discussed that it would not be fair to use the test set for

choosing of the optimal parameters (we would be overfitting to the test

set). We know that one possible way to assure the transparent evaluation

of a classifier is to perform a train-validate-test split and use the validation

set for parameter tuning.

Here we will use a different technique – one that estimates the methods’

“true” predictive performance more accurately, yet at the cost of significantly

increased run-time. Namely, in 5-fold cross-validation, we split the original

train set randomly into 5 disjoint parts: A, B, C, D, E (more or less of

the same number of observations). We use each combination of 4 chunks

as training sets and the remaining part as the validation set, on which we

compute the F-measure:

train set validation set F-measure

B, C, D, E A FA

A, C, D, E B FB

A, B, D, E C FC

A, B, C, E D FD

A, B, C, D E FE

Finally, we report the average F-measure, (FA + FB + FC + FD + FE)/5.

5. Report the best scenario (out of 30) together with the corresponding

classifier’s accuracy, precision, recall and F-measure on the test set.

Make sure that the report has a readable structure. Divide the document into

sections. Before each code chunk, explain what purpose does it serve.

Side note: If you want a real challenge (this is definitely not obligatory),

you can add another level of complexity: select the best

combination of the input variables, e.g., amongst all the possible

2

pairs or triples of columns in the dataset.

2 Hint

A grid search can be implemented based on a triply-nested for loop:

Ks <- c(1, 3, 5, 7, 9)

Ps <- c("none", "standardised", "robstandardised")

Ms <- c("l2", "l1")

for (K in Ks) {

for (preprocessing in Ps) {

for (metric in Ms) {

if (preprocessing == "standardised") {

# ...

}

else if (preprocessing == "robstandardised") {

# ...

}

else {

# ...

}

if (metric == "l2") {

# ...

}

else {

# ...

}

}

}

}

Alternatively, you can go through every row in the following matrix and process

each thus defined scenario:

expand.grid(Ks, Ps, Ms)

## Var1 Var2 Var3

## 1 1 none l2

## 2 3 none l2

## 3 5 none l2

## 4 7 none l2

## 5 9 none l2

## 6 1 standardised l2

## 7 3 standardised l2

## 8 5 standardised l2

3

## 9 7 standardised l2

## 10 9 standardised l2

## 11 1 robstandardised l2

## 12 3 robstandardised l2

## 13 5 robstandardised l2

## 14 7 robstandardised l2

## 15 9 robstandardised l2

## 16 1 none l1

## 17 3 none l1

## 18 5 none l1

## 19 7 none l1

## 20 9 none l1

## 21 1 standardised l1

## 22 3 standardised l1

## 23 5 standardised l1

## 24 7 standardised l1

## 25 9 standardised l1

## 26 1 robstandardised l1

## 27 3 robstandardised l1

## 28 5 robstandardised l1

## 29 7 robstandardised l1

## 30 9 robstandardised l1

3 Further Reading

See Section 5.1 of the book by James G et al. 2017. An introduction to statistical

learning with applications in R. Springer-Verlag. http://faculty.marshall.usc.

edu/gareth-james/ISL

4 Artefacts

Submit two files via OnTrack:

1. the Rmd file (RMarkdown report),

2. the resulting PDF file that is generated by clicking Knit Document to PDF

in RStudio; if you are unable to generate the PDF file directly, convert the

report to HTML or Word, and manually export the resulting file to PDF.

5 Intended Learning Outcomes

ULO Related

ULO1 (Methods) YES

ULO2 (Problems) YES

4

ULO Related

ULO3 (Implementation and Evaluation) YES

ULO4 (Communication) YES

ULO5 (Impact) YES

5

版权所有：留学生作业网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。