ADM / 2019 / Homework_2 / README.MD

tlancian [upd] RQ4 - Clarification

a9dc986 4 days ago

4 contributors

Branch: master Find file Copy path

130 lines (73 sloc) 11.3 KB

Homework 2 - Soccer analytics

Soccer analytics is attracting an increasing interest of academia

and industry, thanks to the availability of sensing technologies

that provide high-fidelity data streams extracted from every

match.

The goal of this assignment is to perform an analysis on the

largest open collection of soccer-logs ever released, collected

by [Wyscout] (https://wyscout.com/) containing all the spatiotemporal

events (passes, shots, fouls, etc.) that occur during all

matches of the entire season 2017-2018 of seven competitions

(La Liga, Serie A, Bundesliga, Premier League, Ligue 1, FIFA

World Cup 2018, UEFA Euro Cup 2016). A match event contains

information about its position, time, outcome, player and

characteristics.

In particular, we are curious to answer to some specific

research questions (RQs) that may help us discover and

interpret meaningful patterns in data.

Raw Blame History

Before starting

Among all numerous things and good practises a data scientist

needs to do before running any analysis, there is one the is of

uttermost importance: get data and understand it!

Here you find the list of tasks you need to perform before

digging into the rich world of soccer.

Get your data! Go to this website and download the files

related to Coaches, Players, Events, Teams and Matches.

Throughout the analysis we focus only in club teams

information. So, there is no need to download/use the

files relative to the European Cup and the World Cup.

Understand your data. Read the legend of each column to

understand what it refers to. Additional information about

the labels can be found here: Coaches, Players, Events,

Teams, Matches. Please, be sure that you've understood

the data before start coding.

Handling data. The data are provided in multiple .json

files, with some of the columns present in more than one

file. For this reason, in order to answer the RQs, we kindly

suggest you to import the .json files as pandas

DataFrame object and then, based on what you want to

analyze, perform joins among the DataFrames. Here you

can find a quick useful guide. Remember, Google is your

best friend!

VERY VERY IMPORTANT

]. !!! Read the entire homework before coding anything!!!

^. My solution it's not better than yours and yours is not

better than mine. In any data analysis task, there is not a

unique way to answer to RQs. For this reason it is crucial

(necessary and mandatory) that you describe any single

decision you take and all the steps you do.

_. Once performed any exercise, comments about the

obtained results are mandatory. We are not always explicit

where to focus your comments, but we will always want

some brief sentences about your discoveries.

Research questions

Exploratory Data Analysis

General Setup: All the analysis requested from RQ1 to RQ5,

must be performed only over the Premier League dataset.

]. [RQ1] Who wants to be a Champion? During a season could

happen that a team has bad periods. For example, more

than three consecutive games lost, or it could have a

positive trend where it seems to be unbeatable. Let's

visualize this trends!

Create a plot where each point (x,y) represents the number

of points obtained by team x at game week y. In order to

show the trends, points related to the same team must be

connected to each other. Remind: in soccer each team gets

3 points for a win, 1 point for a tied game, and 0 for a loss.

Highlight the two teams that got the longest winning streak

(# of consecutive wins), and the two teams that got the

longest losing streak (# of consecutive losses).

Below you can see a similar example of what we would like

you to show us. Keep in mind that you must create this plot

for all the entire season (38 game weeks).

^. [RQ2] Is there a home-field advantage? It is generally

believed that there is an underlying home field advantage in

sport, i.e. an highest probability of winning of the home

team. Let's check for this, and see whether the outcome of

the game (win, draw, lose) is correlated to the playing side

(home or away). For 5 different teams of Premier League,

show the contingency table (outcome x side). Therefore,

perform an "overall" Chi-squared test in the following way:

build a unique contingency table, that contains all the

matches in which only one of the 5 teams previously

selected is involved, to see whether there is home field

advantage. State clearly the tested hypothesis and whether

it is accepted or rejected.

_. [RQ3] Which teams have the youngest coaches? Rank all

the teams by the age of their coach and show the 10 teams

with the youngest coaches. Remember that during a

season a team could have more coaches, in that case pick

the younger of them. Additionally, show the distirbutions of

the ages of all coaches in Premier League, using a boxplot.

(Hint: There's an attribute birthDate).

f. [RQ4] Find the top 10 players with the highest ratio

between completed passes and attempted passes. For this

task, consider all the different types of passes, and as

specified in the website, a completed pass has tag 1801

(accurate event).

In order to avoid meaningless results (e.g. players who

played few minutes, and completed 2 passes over 2,

achieving 100% ratio), select an arbitrary threshold of

minimum attempted passes, in order to consider only the

subset of players that played enough. Justify the choices

you make.

i. [RQ5] Does being a tall player mean winning more air

duels? Soccer is a physical game, and it happens often in a

match that players are involved in air duels (i.e. when two

players are contending for the ball while it is not on the

ground). Make a plot that shows the dependency between

height of the player and the ratio of air duels won with air

duels attempted. The visualization should be a scatterplot,

where each point (x,y) represent a player whose height is

equal to x, and that has a ratio of winning air duels equal to

y. Furthermore, color any point according an arbitrary

selection of categories of height (e.g. yellow: 160-165cm,

orange: 165-170cm, etc.)

Remember that the "Air Duel" is a subevent of the event

"Duel" and that an air duel is said to be won if it has the tag

"1801". Same as in RQ4, choose a threshold of minimum air

duels attempted, in order filter your data, get reliable

results, and justify your choice.

j. [RQ6] Free your mind! Go further with the EDA (Exploratory

Data Analysis) showing a new interesting result about the

dataset that you found.

Core Research Questions

[CRQ1] What are the time slots of the match with more

goals? Let's analyse and visualise the goals distribution into

9-minutes sets for all the matches. I.e., let's transform the

minute of a goal from a continuous variable in a discrete

variable (e.g. A goal scored in 5th minute, will end up in the

interval [0-9)). Remind that every match goes usually from

minute 0, to minute 90, but in football it is always added an

arbitary amount of extra-time to every half of the match,

thus consider also the intervals "45+" and "90+".

i. Make a barplot with the absolute frequency of goals in

all the time slots.

ii. Find the top 10 teams that score the most in the

interval "81-90".

iii. Show if there are players that were able to score at

least one goal in 8 different intervals.

[CRQ2] Visualize movements and passes on the pitch! Here

we try to focus our attention on the zones that a player

covers during a match. For each event, we have a pair of

coordinates, that are respectively the starting and ending

point of that event. It can be helpful to follow this link.

Knowing all the different positions where events happen, let us

be able to create different types of visualizations:

]. Considering only the match Barcelona - Real Madrid

played on the 6 May 2018:

visualize with a heatmap the zones where Cristiano

Ronaldo was more active. The events to be considered

are: passes, shoots, duels, free kicks.

compare his map with the one of Lionel Messi.

Comment the results and point out the main

differences (we are not looking for deep and technique

analysis, just show us if there are some clear

differences between the 2 plots).

Here's an example of heatmap where are shown all the starting

positions of the goals of Arsenal during the entire season.

^. Considering only the match Juventus - Napoli played on

the 22 April 2018:

visualize with arrows the starting point and ending

point of each pass done during the match by Jorginho

and Miralem Pjanic. Is there a huge difference

between the map with all the passes done and the one

with only accurate passes? Comment the results and

point out the main differences.

Here there's an example of a map with arrows.

Theoretical Question

You are given the recursive function splitSwap, which accepts

an array a, an index i, and a length n.

function splitSwap(a, l, n):

if n <= 1:

return

splitSwap(a, l, n/2)

splitSwap(a, l+ n /2, n/2)

swapList(a, l, n)

The subroutine swapList is described here:

function swapList(a, l, n):

for i = 1 to n/2:

tmp = a[l + i]

a[l + i] = a[l + n/2 + i]

a[l + n/2 + i] = tmp

]. How much running time does it take to execute splitSwap(a,

0, n)? (We want a Big O analysis.)

^. What does this algorithm do? Is it optimal? Describe the

mechanism of the algorithm in details, we do not want to

know only its final result.

Bonus

]. Repeat the entire analysis for other leagues (La Liga, Serie

A, Bundesliga and Ligue 1), aggregating the results and

highlighting the differences you find among the leagues.

^. Make nice visualization using libraries like Bokeh and

Seaborn.

版权所有：留学生作业网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。