- Homework 4 to be posted
- Exam study guide, homework solution guides – to be posted
- Today: random samples

- We’ve talked about Bernoulli and Binomial distributions, but there are many more useful ones
- Here’s just a quick preview of a few of them

```
df <- data.frame(x = c(25:75),
y = dhyper(25:75, 100, 100, 100))
ggplot(df, aes(x, y)) +
geom_bar(stat = "identity", position = "identity") +
theme_tufte() + ggtitle("Hypergeometric distribution with 50 trials, 100 successes and 100 failures")
```

```
df <- data.frame(x = c(0:50),
y = dgeom(0:50, .1))
ggplot(df, aes(x, y)) +
geom_bar(stat = "identity", position = "identity") +
theme_tufte() + ggtitle("Geometric distribution with p = 1/10")
```

```
df <- data.frame(x = c(0:80),
y = dnbinom(0:80, 5, 1/5))
ggplot(df, aes(x, y)) +
geom_bar(stat = "identity", position = "identity") +
theme_tufte() + ggtitle("Negative Binomial distribution with p = 1/5, x = 5")
```

```
df <- data.frame(x = c(0:25),
y = dpois(0:25, 10))
ggplot(df, aes(x, y)) +
geom_bar(stat = "identity", position = "identity") +
theme_tufte() + ggtitle(expression(paste("Poisson distribution with ", lambda, " = 10")))
```

- Suppose you’re looking for a restaurant nearby, and checking the ratings on your favorite app
- The ratings are on a scale of 1 to 5 stars
- Restaurant \(A\) opened recently. It only has one rating, but it’s 5 stars
- Restaurant \(B\)’s rating is 4.8, based on 5 reviews
- Restaurant \(C\) is rated 4.7, but has 40 reviews
So, Restaurant \(A\) is clearly the best, right? (Discuss)

- Without making any assumptions or leaps of intuition, what
*specifically*does data tell you? It only tells you about itself, nothing more - A single 5 star rating just means that whoever wrote that 1 review thought Restaurant \(A\) was really good
- Even if a restaurant has many 5 star reviews, that doesn’t necessarily mean
*you*will like it - When we interpret data like this, we intuitively make assumptions, generalizations, and predictions, maybe not even consciously
- Sometimes it’s just because we’re being lazy, like when we use stereotypes or make unnecessary assumptions to avoid having to think about things more carefully
Sometimes it’s unavoidable, because we don’t have access to any more information and we must make a decision based on the data we have

Statistical methods using the kind of probability models we’ve been talking about give us a systematic, rigorous way of interpreting data and drawing useful conclusions from it

Example ratings:

```
restaurantB <- c(5,4,5,5,5)
restaurantC <- sample(c(rep(5,31), rep(4,8), 1))
restaurantB
```

`## [1] 5 4 5 5 5`

`restaurantC`

```
## [1] 4 4 5 5 5 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 5 5 5 4 5 5 5 1 5 5 4 4 5
## [36] 5 5 5 5 5
```

`mean(restaurantB)`

`## [1] 4.8`

`mean(restaurantC)`

`## [1] 4.7`

`table(restaurantC)`

```
## restaurantC
## 1 4 5
## 1 8 31
```

- Imagine Restaurant \(C\) only had 5 ratings. Which 5 might they be?
- Let’s draw a random sample of the 50 ratings that it has
- In fact, why not do that many times?
- We can see how often its rating based on a random sample of 5 reviews is at least as good as Restaurant \(B\)

`sample(restaurantC, 5, replace = TRUE)`

`## [1] 5 5 5 4 5`

```
rand_first5 <- data.frame(rating = replicate(1000, mean(sample(restaurantC, 5, replace = TRUE))))
ggplot(rand_first5, aes(rating)) + stat_count() + theme_tufte()
```