Review

Preview
  • We’ve talked about Bernoulli and Binomial distributions, but there are many more useful ones
  • Here’s just a quick preview of a few of them
df <- data.frame(x = c(25:75),
                 y = dhyper(25:75, 100, 100, 100))
ggplot(df, aes(x, y)) + 
  geom_bar(stat = "identity", position = "identity") + 
  theme_tufte() + ggtitle("Hypergeometric distribution with 50 trials, 100 successes and 100 failures")

df <- data.frame(x = c(0:50),
                 y = dgeom(0:50, .1))
ggplot(df, aes(x, y)) + 
  geom_bar(stat = "identity", position = "identity") + 
  theme_tufte() + ggtitle("Geometric distribution with p = 1/10")

df <- data.frame(x = c(0:80),
                 y = dnbinom(0:80, 5, 1/5))
ggplot(df, aes(x, y)) + 
  geom_bar(stat = "identity", position = "identity") + 
  theme_tufte() + ggtitle("Negative Binomial distribution with p = 1/5, x = 5")

df <- data.frame(x = c(0:25),
                 y = dpois(0:25, 10))
ggplot(df, aes(x, y)) + 
  geom_bar(stat = "identity", position = "identity") + 
  theme_tufte() + ggtitle(expression(paste("Poisson distribution with ", lambda, " = 10")))

When should you trust a number?
  • Suppose you’re looking for a restaurant nearby, and checking the ratings on your favorite app
  • The ratings are on a scale of 1 to 5 stars
  • Restaurant \(A\) opened recently. It only has one rating, but it’s 5 stars
  • Restaurant \(B\)’s rating is 4.8, based on 5 reviews
  • Restaurant \(C\) is rated 4.7, but has 40 reviews
  • So, Restaurant \(A\) is clearly the best, right? (Discuss)

  • Without making any assumptions or leaps of intuition, what specifically does data tell you? It only tells you about itself, nothing more
  • A single 5 star rating just means that whoever wrote that 1 review thought Restaurant \(A\) was really good
  • Even if a restaurant has many 5 star reviews, that doesn’t necessarily mean you will like it
  • When we interpret data like this, we intuitively make assumptions, generalizations, and predictions, maybe not even consciously
  • Sometimes it’s just because we’re being lazy, like when we use stereotypes or make unnecessary assumptions to avoid having to think about things more carefully
  • Sometimes it’s unavoidable, because we don’t have access to any more information and we must make a decision based on the data we have

  • Statistical methods using the kind of probability models we’ve been talking about give us a systematic, rigorous way of interpreting data and drawing useful conclusions from it

  • Example ratings:

restaurantB <- c(5,4,5,5,5)
restaurantC <- sample(c(rep(5,31), rep(4,8), 1))
restaurantB
## [1] 5 4 5 5 5
restaurantC
##  [1] 4 4 5 5 5 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 5 5 5 4 5 5 5 1 5 5 4 4 5
## [36] 5 5 5 5 5
mean(restaurantB)
## [1] 4.8
mean(restaurantC)
## [1] 4.7
table(restaurantC)
## restaurantC
##  1  4  5 
##  1  8 31
  • Imagine Restaurant \(C\) only had 5 ratings. Which 5 might they be?
  • Let’s draw a random sample of the 50 ratings that it has
  • In fact, why not do that many times?
  • We can see how often its rating based on a random sample of 5 reviews is at least as good as Restaurant \(B\)
sample(restaurantC, 5, replace = TRUE)
## [1] 5 5 5 4 5
rand_first5 <- data.frame(rating = replicate(1000, mean(sample(restaurantC, 5, replace = TRUE))))
ggplot(rand_first5, aes(rating)) + stat_count() + theme_tufte()