联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> javajava

日期:2018-10-03 10:03

COMP90051 Statistical Machine Learning

Project 2 Description

Due date: 12:00pm Friday, 12th October 2018 Weight: 25%1

Multi-armed bandits (MABs) are a powerful tool in statistical machine learning: they bridge decision

making, control, optimisation and learning; they address practical problems of sequential decision making while

backed by elegant theoretical guarantees; they are often easily implemented, efficient to run, and are used in

many industrial applications; and more subtly they are neither fully supervised nor unsupervised, being partial

labelled by indirect rewards. Exploitation behaviour in MABs optimises short-term rewards by acting greedily

based on current knowledge; but this must be balanced against imprecision in knowledge by exploration; and

when effectively balanced, MABs optimise for long-term reward. In this project, you will work individually to

implement several MAB learners. Some will be directly from class, while others will be more advanced and

come out of papers that you will have to read and understand yourself.

By the end of the project you should have developed

ILO1. A deeper understanding of the MAB setting and common MAB approaches;

ILO2. An appreciation of how MABs are applied;

ILO3. Demonstrable ability to implement ML approaches in code; and

ILO4. An ability to pick up recent machine learning publications in the literature, understand their focus,

contributions, and algorithms enough to be able to implement and apply them.

Overview

Through the 2000s Yahoo! Research led the way in applying MABs to problems in online advertising, information

retrieval, and media recommendation. One of their many applications was to Yahoo! News, in deciding

what news items to recommend to users based on article content, user profile, and the historical engagement

of the user with articles. Given decision making in this setting is sequential—what do we show next—and

feedback is only available for articles shown, Yahoo! researchers observed a perfect formulation for MABs like

those (-Greedy and UCB) learned about in class. Going further, however, they realised that incorporating

some element of user-article state requires contextual bandits: articles are arms; context per round incorporates

information about both user and article (arm); and {0, 1}-valued rewards represent clicks. Therefore the

per round cumulative reward represents click-through-rate (CTR) which is exactly what services like Yahoo!

News want to maximise to drive user engagement and advertising revenue. You will be implementing these

approaches, noting that you need not complete the entire project in order to pass.

Required Resources The LMS page for project 2 comprises

proj2.ipynb Jupyter notebook: skeleton in Python you must (1) rename as username.ipynb (2) flesh out

with your project solutions, (3) run on your local machine and then (4) submit in LMS after graphs/results

are produced within.

dataset.txt A text-file dataset (see below for details).

You will implement code in Python Jupyter notebooks, which after running on your machine you will submit

via LMS. Further detailed rules about what is expected with code are available towards the end of this spec. We

appreciate that while some have entered COMP90051 with little/no prior Python experience, 8/9 workshops

so far have exercised and built up basic Python and Jupyter knowledge.

1Forming a combined hurdle with project 1.

1

Part 1: Implement -Greedy and UCB [10 marks total]

Begin by implementing Python classes EpsGreedy and UCB for both -Greedy and UCB learners just as covered

in class. You should use inheritance: make your classes sub-classes of the abstract MAB base class. These classes

should have components:

? All necessary properties for storing MAB state

? __init__ constructor methods for initialising MAB state with respective signatures:

– def __init__(self, narms, epsilon, Q0) for positive integer narms, floating-point probability

epsilon, real-valued Q0 taking by default numpy.inf; and

– def __init__(self, narms, rho, Q0) for positive integer narms, positive real rho, real-valued Q0

taking by default numpy.inf.

? Additional methods (where in your implementations context will go unused)

– def play(self, tround, context) for positive integer tround, and unused (for now) context.

This should return an arm integer in {1, . . . , self.narms}; and

– def update(self, arm, reward, context) for positive integer arm no larger than property self.narms,

floating-point reward, and unused (for now) context. This method should not return anything.

Tie-breaking in play() should be completed uniformly-at-random among value-maximising arms.

Marks: graders will perform quick code reviews of your implementations. A maximum of 6 will be available

for correctness; a maximum of 4 will be available for code structure and style (primarily the former). Code

should have necessary commenting to understand interfaces, and major points of inner working, basic checks of

well-formed input, clear variable names and readable statements.

Part 2: Off-Policy Evaluation [5 marks total]

A major practical challenge for industry deployments of MAB learners has been the requirement to let the

learner loose on real data. Inevitably bandits begin with little knowledge about arm reward structure, and so

a bandit must necessarily suffer poor rewards in beginning rounds. For a company trying out and evaluating

dozens of bandits in their data science groups, this is potentially very expensive.

A breakthrough was made when it was realised that MABs can be evaluated offline or off policy. The

idea being that you collect just once a dataset of uniformly-random arm pulls and resulting rewards. Then you

evaluate any possible future bandit learner of interest on that one historical data—there is no need to run bandits

online in order to evaluate them! In this part you are to implement a Python function for offline/off-policy

evaluation.

You must implement the algorithm first described as Algorithm 3 “Policy Evaluator” from the paper:

Lihong Li, Wei Chu, John Langford, Robert E. Schapire, ‘A Contextual-Bandit Approach to Personalized

News Article Recommendation’, in Proceedings of the Nineteenth International Conference

on World Wide Web (WWW 2010), Raleigh, NC, USA, 2010.

https://arxiv.org/pdf/1003.0146.pdf

You should begin by reading Section 4 of the WWW2010 paper which describes the algorithm. You may

find it helpful to read the rest of the paper up to this point for background (skipping Sec 3.2) as this also relates

to Part 3. If you require further detail of the algorithm you may find the follow-up paper useful (particularly

Sec 3.1):

2

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. ‘Unbiased offline evaluation of contextualbandit-based

news article recommendation algorithms.’ In Proceedings of the Fourth ACM International

Conference on Web Search and Data Mining (WSDM’2011), pp. 297-306. ACM, 2011.

https://arxiv.org/pdf/1003.5956.pdf

Note that what is not made clear in the pseudo-code of Algorithm 3, is that after the bandit plays (written

as function π) an arm that matches a given log, you should not only note down the reward as if the bandit

really received this reward, but you should also update the bandit with the played arm a, reward ra, and later

in the project the context x1, . . . , xK over the K arms. Bandits that do not make use of context—such as your

Part 1 bandits—can still take context as an argument even if unused.

A second point that is implied in the pseudo-code, but may be missed, is that when asking the bandit to

play an arm, the supplied round number should not be the current round in the log file, but instead the length

of history recorded so far, plus one. That is, after playing a matching arm on the first logged event, a bandit

may play different arms for events 2, 3, and 4, and on the 5th event may for a second time play a matching

arm. For the function calls to play() for each of events 2, 3, 4, 5 you would pass as the tround argument the

value 2. You would then increment to 3 for tround from the 6th event.

Implement your function (nominally outside any Python class) with signature

def offlineEvaluate(mab, arms, rewards, contexts, nrounds=None)

for a MAB class object mab such as EpsGreedy, UCB (and the classes implemented in later project

Parts), a (numpy) array arms of values in {1, . . . , mab.narms}, an array of scalar numeric rewards of

the same length as arms, a numeric 2D array contexts with number of rows equal to the length of

arms and number of columns equal to a positive multiple of mab.narms, a positive integer nrounds

with default value None.

Here arms corresponds to the arms played by a uniformly-random policy recorded in a dataset of say M

events. While rewards corresponds to the resulting observed M rewards. In the next Part we will consider

contextual bandits, in which each arm may have a feature vector representing its state/context (and potentially

factoring in the context of the user also). So that if each of the K arms have d features, each row of contexts

will have these feature vectors flattened as the d features of arm 1 followed by the d features of arm 2, all the

way up to arm K so that we have d × K features (a multiple of K).

Finally nrounds is the desired number of matching events we would like to evaluate bandit mab on. Once

your function finds this many matching arm plays, it should stop and return the per round rewards—and not

their sum as in the WWW’2010 Algorithm 3. If it reaches the end of the logged dataset without reaching the

required number (or in the case of default None) then it should return the per round rewards recorded.

Dataset: The LMS page for project 2 contains a 2 MB dataset.txt suitable for validating MAB implementations.

You may download this file and familiarise yourself with its format:

? 10,000 lines (i.e., rows) corresponding to distinct site visits by users—events in the language of this part;

? Each row comprises 102 space-delimited columns of integers:

– Column 1: The arm played by a uniformly-random policy out of 10 arms (news articles);

– Column 2: The reward received from the arm played—1 if the user clicked 0 otherwise; and

– Columns 3–102: The 100-dim flattened context: 10 features per arm (incorporating the content of

the article and its match with the visiting user), first the features for arm 1, then arm 2, etc. up to

arm 10.

Your function should be able to run on this file where column 1 forms arms, column 2 forms rewards, and

columns 3–102 form contexts. On both classes you’ve implemented thus far. You should output the result of

running

3

mab = EpsGreedy(10, 0.05)

results_EpsGreedy = offlineEvaluate(mab, arms, rewards, contexts, 800)

print("EpsGreedy average reward ", np.mean(results_EpsGreedy))

mab = UCB(10, 1.0)

results_UCB = offlineEvaluate(mab, arms, rewards, contexts, 800)

print("UCB average reward ", np.mean(results_UCB))

Marks: Your part will be marked on the basis of reviewing of your code (as above).

Part 3: Contextual Bandits—LinUCB [5 marks total]

In this part you are to implement a third MAB learner as a third Python class. This time you are to read up

to and including Section 3.1 of the WWW’2010 paper to understand and then implement the LinUCB learner

with disjoint linear models (Algorithm 1). This is a contextual bandit—likely the first you’ve seen—however

it’s workings are a direct mashup of UCB and ridge regression both of which you’ve seen in class. Practicing

reading and implementing papers is the best way to turbo-charge your ML skills. Your class LinUCB should

have methods

? def __init__(self, narms, ndims, alpha) constructor for positive integer narms the number of arms,

positive integer ndims the number of dimensions for each arm’s context, positive real-valued alpha a

hyperparameter balancing exploration-exploitation

? def play(self, tround, context) as for your other classes. For positive integer tround, and context

being a numeric array of length self.ndims * self.narms; and

? def update(self, arm, reward, context) as for your other classes. For positive integer arm no larger

than property self.narms, floating-point reward, and context as previous.

While the idea of understanding LinUCB enough to implement it correctly may seem daunting, the WWW’2010

paper is written for a non-ML audience and is complete in its description. The pseudo-code is detailed. There

is one unfortunate typo however: pg. 3, column. 2, line 3 of the linked arXiv version should read ca rather

than ba. The pseudo-code uses the latter (correctly) as shorthand for the former times the contexts.

Note also one piece of language you may not have encountered: “design matrix” means a feature matrix in

the statistics literature.

After you have implemented your class, include and run an evaluation on the given dataset with

mab = LinUCB(10, 10, 1.0)

results_LinUCB = offlineEvaluate(mab, arms, rewards, contexts, 800)

print("LinUCB average reward ", np.mean(results_LinUCB))

Marks: Your part will be marked on the basis of reviewing of your code (as above).

Part 4: Evaluation [3 marks total]

In this part you are to delve deeper into the performance of your implemented bandit learners. This part’s first

sub-part does not necessarily require completion of Part 3.

Part 4(a) [1.5 marks]: Run offlineEvaluate on each of your Python classes with hyperparameters as

above. This time plot the running per-round cumulative reward i.e. T

?1 PT

t=1 rt,a for T = 1..800 as a function

of round T, all on one overlayed plot. Your plot will have up to 3 curves, clearly labelled.

4

Part 4(b) [1.5 marks]: How can you optimise hyperparameters? Devise a grid-search based strategy to select

the α hyperparameter in LinUCB, as Python code in your notebook. Output the result of this strategy—which

could be a graph, number, etc.

Part 5: KernelUCB [2 marks total]

You may skip this final part if short on time. So far you have built on knowledge of bandits and ridge regression.

In this part you would make use of the kernel methods part of class in a mashup of all three concepts by

implementing as a fourth Python class the KernelUCB (with online updates) that is Algorithm 1 of this paper:

Michal Valko, Nathan Korda, R′emi Munos, Ilias Flaounas, and Nello Cristianini, ‘Finite-time analysis

of kernelised contextual bandits.’ In Proceedings of the Twenty-Ninth Conference on Uncertainty

in Artificial Intelligence (UAI’13), pp. 654-663. AUAI Press, 2013.

https://arxiv.org/ftp/arxiv/papers/1309/1309.6869.pdf

You will need to judiciously skim more theoretical parts of the introduction as not relevant to understanding

the crux of the algorithm. To help: you will find Sec 2.2 sets up the MAB problem (and thefore notation in the

paper), Sec 2.3 reviews LinUCB, while the first two columns of Sec 3 plus Algorithm 1, explain KernelUCB.

Note that the notation [K11, K12; K21, K22] is block-matrix notation explaining the first rows of the matrix

by submatrices, then the final rows. That is, the algorithm involves building up a matrix by first creating

submatrices. Also the ‘RKHS H’ can be thought of as the feature space.

Implement sub-class KernelUCB with the following methods

? def __init__(self, narms, ndims, gamma, eta, kern) constructor for positive integer narms the

number of arms, positive integer ndims the number of dimensions for each arm’s context, positive realvalued

gamma and eta hyperparameters for regularisation and balancing exploration-exploitation, and

kern a kernel function from sklearn.metrics.pairwise;

? def play(self, tround, context) as for LinUCB; and

? def update(self, arm, reward, context) as for LinUCB.

Once done, demonstrate a MAB hyperparameter setting and with choice of the Gaussian/RBF kernel, a

plot like in Part 4(a) demonstrating competitive performance relative to LinUCB. Note this kernel is imported

for you as rbf kernel in the given notebook. Note also you may try different kernel hyperparameters, and

beware that while the RBF’s parameter is also called ‘gamma’ it is distinct to the ‘gamma’ in KernelUCB.

Marking: will be based on code review of your implemented class, and empirical results.

Project Submission

Submit via LMS your Jupyter notebook named username.ipynb based on proj2.ipynb (with structure preserved).

This notebook should have cells and be run prior to submission so that outputs and plots are preserved.

(We suggest opening your notebook again prior to upload to double check. We may not run your notebook;

given your environment might subtly differ to ours it is your responsibility to ensure results are contained.)

Rules. You may discuss the bandit learning deck or Python at a high-level with others, but do not collaborate

on solutions. You may consult resources to understand bandits conceptually, but do not make any use of online

code whatsoever. (We will run code comparisons against online partial implementations to enforce these rules.)

You must use the environment (Anaconda3, Python 3.5 or higher) as used in labs. (You may use your own

machine of course, but we strongly recommend you check code operation on lab machines prior to submission.

In case we run code.) You may only use the packages already imported in the provided proj2.ipynb notebook.

You should use matplotlib for plotting. You must also preserve the overall structure of the provided notebook.

Late submissions will be accepted to 6 days with -2 penalty per day.


版权所有:留学生程序网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。