Announcements

Recap

Parameters as “truth” or “population” values

Statistics

Interlude: normal distribution

n <- 20
Z <- rnorm(n, mean = true_mu) # I've hidden true_mu from you, it's an unknown parameter
Z
##  [1] -1.0352514  2.3692033  0.2054063  1.7285361  0.4189772  0.7611285
##  [7]  0.9860677  0.7708386  1.6217945  1.8319464  0.6873274  1.4826064
## [13]  1.0905233  2.0987816  0.7868562  0.3447476  1.2069459  0.6610843
## [19]  1.8859894  1.6652012
mean(Z)
## [1] 1.078436
Sampling distributions
  • The sample mean, or other summary statistics, are forms of data
  • When we model them as random variables, the resulting distributions are called sampling distributions
  • We have data \(x_1, x_2, \ldots, x_n\) and a sample mean \(\bar x\)
  • We assume the model \(X_1, X_2, \ldots, X_n\) are independent samples of a random variable called \(X\), and write \(\mu = E[X]\) and \(\sigma^2 = \text{Var}(X)\).
  • Now consider the random variable \[ \bar X = \frac{1}{n} \sum_{i=1}^n X_i \]

  • The distribution of \(\bar X\) is called the sampling distribution of the mean
  • Exercise: compute \(E[X]\) (what property of expectation makes this easy?)
  • Next, use this fact: \(\text{Var}(cX) = c^2 \text{Var}(X)\) and compute \(\text{Var}(\bar X)\)
  • This is where the magic happens! \[ \text{Var}(\bar X) = \frac{1}{n^2} \text{Var}\left(\sum_{i=1}^n X_i\right) = \frac{1}{n^2} \sum_{i=1}^n \text{Var}(X_i) = \frac{1}{n^2} \sum_{i=1}^n \sigma^2 = \frac{\sigma^2}{n} \]
  • The first equality uses the fact from the previous point, the second one uses linearity of variance for sums of independent random variables, the third one uses the fact that the \(X_i\) all have the same distribution (hence same variance)
  • The sample mean has less variance (or dispersion) than an individual measurement

  • Example: if \(Z_1, \ldots, Z_n \sim N(\mu, \sigma^2)\) are independent and identically distributed (iid) then \(\bar Z \sim N(\mu, \sigma^2/n)\)

Zbar <- function(n) mean(rnorm(n, mean = true_mu, sd = 2))
df <- data.frame(Z = c(rnorm(5000, mean = true_mu, sd = 2),
                       replicate(5000, Zbar(10)),
                       replicate(5000, Zbar(30))),
                 Samples = factor(c(rep(1, 5000), rep(10, 5000), rep(30, 5000))))
ggplot(df, aes(Z, fill = Samples, linetype = Samples)) + geom_density(alpha = .2) + theme_tufte()

  • Intuition: think of each individual sample as a noisy measurement of \(\mu\), the underlying truth. Sometimes the noise makes the sample larger than \(\mu\), and sometimes it makes it smaller. When we average many samples, the noise “cancels out”
  • By the way:
true_mu
## [1] 1.414214