#### Review

• Homework 4 to be posted
• Exam study guide, homework solution guides – to be posted
• Today: random samples
##### Preview
• We’ve talked about Bernoulli and Binomial distributions, but there are many more useful ones
• Here’s just a quick preview of a few of them
``````df <- data.frame(x = c(25:75),
y = dhyper(25:75, 100, 100, 100))
ggplot(df, aes(x, y)) +
geom_bar(stat = "identity", position = "identity") +
theme_tufte() + ggtitle("Hypergeometric distribution with 50 trials, 100 successes and 100 failures")``````

``````df <- data.frame(x = c(0:50),
y = dgeom(0:50, .1))
ggplot(df, aes(x, y)) +
geom_bar(stat = "identity", position = "identity") +
theme_tufte() + ggtitle("Geometric distribution with p = 1/10")``````

``````df <- data.frame(x = c(0:80),
y = dnbinom(0:80, 5, 1/5))
ggplot(df, aes(x, y)) +
geom_bar(stat = "identity", position = "identity") +
theme_tufte() + ggtitle("Negative Binomial distribution with p = 1/5, x = 5")``````

``````df <- data.frame(x = c(0:25),
y = dpois(0:25, 10))
ggplot(df, aes(x, y)) +
geom_bar(stat = "identity", position = "identity") +
theme_tufte() + ggtitle(expression(paste("Poisson distribution with ", lambda, " = 10")))``````

##### When should you trust a number?
• Suppose you’re looking for a restaurant nearby, and checking the ratings on your favorite app
• The ratings are on a scale of 1 to 5 stars
• Restaurant \(A\) opened recently. It only has one rating, but it’s 5 stars
• Restaurant \(B\)’s rating is 4.8, based on 5 reviews
• Restaurant \(C\) is rated 4.7, but has 40 reviews
• So, Restaurant \(A\) is clearly the best, right? (Discuss)

• Without making any assumptions or leaps of intuition, what specifically does data tell you? It only tells you about itself, nothing more
• A single 5 star rating just means that whoever wrote that 1 review thought Restaurant \(A\) was really good
• Even if a restaurant has many 5 star reviews, that doesn’t necessarily mean you will like it
• When we interpret data like this, we intuitively make assumptions, generalizations, and predictions, maybe not even consciously
• Sometimes it’s just because we’re being lazy, like when we use stereotypes or make unnecessary assumptions to avoid having to think about things more carefully
• Sometimes it’s unavoidable, because we don’t have access to any more information and we must make a decision based on the data we have

• Statistical methods using the kind of probability models we’ve been talking about give us a systematic, rigorous way of interpreting data and drawing useful conclusions from it

• Example ratings:

``````restaurantB <- c(5,4,5,5,5)
restaurantC <- sample(c(rep(5,31), rep(4,8), 1))
restaurantB``````
``## [1] 5 4 5 5 5``
``restaurantC``
``````##  [1] 4 4 5 5 5 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 5 5 5 4 5 5 5 1 5 5 4 4 5
## [36] 5 5 5 5 5``````
``mean(restaurantB)``
``## [1] 4.8``
``mean(restaurantC)``
``## [1] 4.7``
``table(restaurantC)``
``````## restaurantC
##  1  4  5
##  1  8 31``````
• Imagine Restaurant \(C\) only had 5 ratings. Which 5 might they be?
• Let’s draw a random sample of the 50 ratings that it has
• In fact, why not do that many times?
• We can see how often its rating based on a random sample of 5 reviews is at least as good as Restaurant \(B\)
``sample(restaurantC, 5, replace = TRUE)``
``## [1] 5 5 5 4 5``
``````rand_first5 <- data.frame(rating = replicate(1000, mean(sample(restaurantC, 5, replace = TRUE))))
ggplot(rand_first5, aes(rating)) + stat_count() + theme_tufte()``````