CSCI 3022: Intro to Data Science - Spring 2020 Practicum 1?

This practicum is due on Canvas by 11:59 PM on Monday March 2. Your solutions to theoretical questions should be done in Markdown/MathJax directly below the associated question. Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.

Here are the rules:

All work, code and analysis, must be your own.

You may use your course notes, posted lecture slides, textbooks, in-class notebooks, and homework solutions as resources. You may also search online for answers to general knowledge questions like the form of a probability distribution function or how to perform a particular operation in Python/Pandas.

This is meant to be like a coding portion of your midterm exam. So, the instructional team will be much less helpful than we typically are with homework. For example, we will not check answers, help debug your code, and so on.

If something is left open-ended, it is because we want to see how you approach the kinds of problems you will encounter in the wild, where it will not always be clear what sort of tests/methods should be applied. Feel free to ask clarifying questions though.

You may NOT post to message boards or other online resources asking for help.

You may NOT copy-paste solutions from anywhere.

You may NOT collaborate with classmates or anyone else.

In short, your work must be your own. It really is that simple.

Violation of the above rules will result in an immediate academic sanction (at the very least, you will receive a 0 on this practicum or an F in the course, depending on severity), and a trip to the Honor Code Council.

By submitting this assignment, you agree to abide by the rules given above.

Name:

NOTES:

You may not use late days on the practicums nor can you drop your practicum grades.

If you have a question for us, post it as a PRIVATE message on Piazza. If we decide that the question is appropriate for the entire class, then we will add it to a Practicum clarifications thread.

Do NOT load or use any Python packages that are not available in Anaconda 3.6.

Some problems with code may be autograded. If we provide a function API do not change it. If we do not provide a function API then you're free to structure your code however you like.

Submit only this Jupyter notebook to Canvas. Do not compress it using tar, rar, zip, etc.

This should go without saying, but... For any question that asks you to calculate something, you must show all work to receive credit. Sparse or nonexistent work will receive sparse or nonexistent credit.

In [1]:

from scipy import stats

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

[30 points] Problem 1: Gambling With A Peg Legged Pirate:

You're a time traveling data scientist, and have traveled way back to the year 1654. Immediately upon arriving you're picked up by a bunch of pirates, and made to join a pirate crew. When the pirates realize that you're immensely knowledgable about probabilities and statistics, they promote you to be their captain! You rename the ship to be the "Certain Probability of Death", and set out upon the high seas. After a few days of sailing you come upon another band of buccaneers in their ship. Their captain, Peg Leg Pascal Fermat, challanges you to a gambling game, but you're not sure if you should play it. The rules for the game are below:

You and Peg Leg Pascal Fermat will take turns repeatedly rolling a 20 sided die. The die has values 1 - 20 on it.

On your turn, the rules are as follows:

If you roll the same value as what Peg Leg Pascal Fermat rolled on his last turn, you have to give him 5 dubloons. Do not perform any additional actions from the below list of rules if you rolled the same value as what Peg Leg Pascal Fermat rolled on his last turn. Otherwise:

If you roll an 8, Peg Leg Pascal Fermat will give you two gold dubloons.

If you roll a 7, Peg Leg Pascal Fermat will give you four gold dubloon.

If you roll a 15, you have to give Peg Leg Pascal Fermat one dubloon.

If you roll a 1, the game ends.

If you roll any other value, nothing happens.

On Peg Leg Pascal Fermat's turn the rules are as follows:

If Peg Leg Pascal Fermat rolls the same number as you did on your last turn, he then rolls a different 19 sided die. When he rolls again, if he rolls a 19, you must pay him 100 dubloons. If he rolls anything other than a 19, he must pay you the same number of dubloons as the value of the roll. E.g. If he rolls a 10, he gives you 10 dubloons, but if he rolls a 19 you give him 100 dubloons. When he rolls again, he does not perform any other rules from the below list.

If Peg Leg Pascal Fermat rolls a 2, he must give you one gold dubloon.

If Peg Leg Pascal Fermat rolls a 14, you must give him two dubloons.

If Peg leg Pascal Fermat rolls a 17, you must give him three dubloons.

If Peg Leg Pascal Fermat rolls a 1, the game ends.

If Peg Leg Pascal Fermat rolls a 3, he takes off his peg leg, and gives you the leg. Who knows... Maybe it will be usefull if you lose your leg?

If Peg Leg Pascal Fermat rolls a 3 and he has already given you his peg leg, he must give you 3 dubloons.

If he rolls any other values, nothing happens.

Part A: Without doing any extensive math or simulations, predict whether this game will result in your making or loosing money. Would you play it? Any logical non-empty answer will get credit here, so don't worry about if your prediction is actually right.

Part B: Luckily, when you time traveled back to 1654, you brought your laptop with you! Use Python to simulate 10,000 games following the above rules. Record your winnings or losings for each game in an array. Record the games in which you lost money with a negative value (amount you lost), and games in which you won money with a positive value (amount you won). Assume you always go first. You may write multiple functions or use multiple jupyter notebook cells to write your code, how you structure it is up to you. You may use any Numpy or Pandas functions you find useful, but may not import any additional libraries. Calculate the median amount you win or lose, and report it in markdown below. Based on the median value, would you play the game?

In [1]:

# Your code here.

Part C: Calculate a Tukey 5 Number Summary and the mean value for your array of simulated winnings/losings. Based on this information would you play the game? Are any of these metrics more useful than others? Which metrics would be important if we were deciding to play a single game? Which would be more important if we were deciding to play a very large number of games?

In [2]:

# Your code here.

Part D: After seeing the numbers in Part C, we decide to play a few games with Peg Leg Pascal Fermat (You might want to double check you simulation in Part B if the mean value doesn't come out slightly positive in Part C). You play 10 rounds and it seems like he might be cheating. You just can't win, and you're losing a ton of money! Maybe Peg Leg Pascal Fermat has a loaded die. However, you've found a .csv file buried in the sand. It has the winnings and losings of another player who was playing the same game against Peg Leg Pascal Fermat. We'll use this information to figure out if he is cheating. Read in the filePascal_Fermat_Games.csv. Each row contains the results of a game that another player played against Peg Leg Pascal Fermat. It's a little bit dirty (after all this .csv was buried in the sand). Perform the following cleaning tasks:

Read in the .csv. If you have any trouble reading in the file, open it in a text editor and take a look at it. You might find the pandas documentation for the read_csv function and some of the optional arguments useful.

Look at the two columns. One of them is useless. Drop the useless one.

Drop any strings of non-integer data.

Check to see if any of our values are floating point values. If there are any, drop them.

Drop any values that are over 1000 or under -1000.

Print the number of rows remaining.

After doing all that, you should have 9661 rows of data left.

In [3]:

# Your code here.

Part E: Create a density histogram with both our siumulated data and our data from the Pascal_Fermat_Games.csv file overlayed on the same set of axes. Ensure your plot is legible and contains all of the common labels/titles/etc. Make sure you use enough bins to make the data easly visible. To make the graph easy to read, it's fine to limit the x range to avoid showing large areas with very few occurances of data. This problem will largely be graded based on how nice and easy to interpret your plot is, so do your best.

One Annoying thing about matplotlib is how small the font on the titles/axis labels/etc. Do some googling and figure out how to change the matplotlib font sizes. CITE YOUR SOURCES IF YOU USE ANYTHING OTHER THAN THE MATPLOTLIB DOCUMENTATION PAGES.. Change the axes label font to be 14pt, the x-tick and y-tick font to be 8pt, the title font to be 16pt, and the legend font to be 12pt. Isn't that nicer?

In [4]:

# Your code here.

Part F: Print out the Tukey 5 number summary of the data from the .csv file. Based on this and the histogram above, make an argument as to whether Peg Leg Pascal Fermat is cheating or not.

In [5]:

# Your code here.

Part G: In this problem you were transported back to the year 1654. Find out why the year 1654 is important in the fields of mathematics and data science.

Back to top

[30 points] Problem 2: Sonic or Tails?

In the file flipadelphia.csv you will find the results of an experiment that was conducted by Amy, the famous hedgehog data scientist, as she was flipping a coin one sunny day in a meadow. This is no ordinary coin, however: this coin has on one side Sonic, and on the other side Tails! The two sides of this coin are above, and at this link.

In Amy's experiment she repeatedly flipped the coin until it came up Sonic. After each trial, she recorded her observed value for X=X= the number of flips required to see the first Sonic. The results are stored in flipadelphia.csv.

Amy has a lot of coins for performing cool data science experiments, and these coins have different biases (not all unique). Amy is a forgetful hedgehog, so she isn't sure which coin she was flipping. Her coins have biases of pS=.2,.3,.4,.5,.6,.7pS=.2,.3,.4,.5,.6,.7 and .8.8 , where pSpS is the probability of any given flip coming up Sonic.

Part A: Read in the data set and make a frequency histogram of the data. Be sure to label your axes appropriately, and center your bins above the integer numbers of flips (0, 1, 2, etc...). What is the name of the distribution for the random variable that Amy observed and recorded in her data table?

In [6]:

# Your code here.

Part B: Use the distribution that you identified in Part A to determine P(X=n∣pS=0.5)P(X=n∣pS=0.5) , the probability that Amy would observe the first Sonic flip on the nn -th flip, assuming that the coin is fair ( pS=0.5pS=0.5 ), for each of the nn from her 10 trials in her data set. Then, combine these to find the overall likelihood that she would observe her entire data set, assuming that the coin was fair. That is, estimate P(data∣pS=0.5)P(data∣pS=0.5) . Be sure to note any assumptions you make about how the outcome of one trial relates to the outcomes of the others.

If it helps to have some mathematical notation, consider that Amy's data set consists of the results of all 10 of her trials:

data=(X1=n1)∩(X2=n2)∩…(X10=n10)

data=(X1=n1)∩(X2=n2)∩…(X10=n10)

In [7]:

# Your code here.

Part C: Suppose before we observed Amy's data set, we believe that each of the seven possible coin biases occur with equal probability, P(pS)P(pS) . This is called the prior distribution for the coin bias, pSpS , because we have not yet taken into account Amy's data set.

Now, estimate the probability of each possible bias, given the data: P(pS∣data)P(pS∣data) . This is called the posterior distribution for the coin bias, because it is our assessment of the coin's bias after we have accounted for Amy's data.

Make a line plot of the bias along x-axis versus the posterior probability of that bias along the y-axis, and be sure to label your axes.

Comment on your plot. What appears to be the most probable value for the bias, pSpS ? This is called the maximum a posteriori estimate, because it maximizes the posterior distribution and sounds very, very fancy.

In [8]:

# Your code here.

Part D: Now suppose the prior probability distribution of the coins is not uniform. Namely, suppose these probabilities follow a triangular distribution, centered at pS=0.5pS=0.5 :

P(pS=p)={mpm(1?p)p≤0.5p>0.5

P(pS=p)={mpp≤0.5m(1?p)p>0.5

Determine what value the constant mm should have in order to make P(pS=p)P(pS=p) is a valid probability mass function. Remember, pS∈{.2,.3,…,.7,.8}pS∈{.2,.3,…,.7,.8} and is discrete.

Part E: Compare, using words, the triangular prior distribution (this part) and the uniform prior distribution (from Part C). What does each represent in terms of our prior knowledge of the coin bias?

Part F: Modify your calculation of the posterior distribution from Part C to use the new triangular prior distribution from Part D. Make a plot of the results that includes both posterior distribution using the uniform prior (from Part C) and the posterior distribution using the triangular prior (from Part D) in the same figure panel. Be sure to label your axes and include a legend.

In [9]:

# Your code here.

版权所有：留学生作业网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。