- Will start posting suggested reading on course page
- Midterms returned tomorrow
- Homework 5 will be posted soon
- Today: parameter estimation, bias

Albert Michelson, physicist, conducted experiments to measure the speed of light. Here is a dataset of 100 measurements:

`morley$Speed`

```
## [1] 850 740 900 1070 930 850 950 980 980 880 1000 980 930 650
## [15] 760 810 1000 1000 960 960 960 940 960 940 880 800 850 880
## [29] 900 840 830 790 810 880 880 830 800 790 760 800 880 880
## [43] 880 860 720 720 620 860 970 950 880 910 850 870 840 840
## [57] 850 840 840 840 890 810 810 820 800 770 760 740 750 760
## [71] 910 920 890 860 880 720 840 850 850 780 890 840 780 810
## [85] 760 810 790 810 820 850 870 870 810 740 810 940 950 800
## [99] 810 870
```

`ggplot(morley, aes(Speed + 299000)) + geom_histogram(bins = 16) + theme_tufte()`

To deal with fewer digits, we look just at the part without the 299,000, i.e. the last 3 digits. These have mean and variance:

```
Xbar = mean(morley$Speed)
sigma2 = var(morley$Speed)
Xbar
```

`## [1] 852.4`

`sigma2`

`## [1] 6242.667`

`Xbar - 2*sqrt(sigma2/100)`

`## [1] 836.5979`

`Xbar + 2*sqrt(sigma2/100)`

`## [1] 868.2021`

- This means that if Michelson redid the same experiments and collected 100 measurements again, his new \(\bar X\) will probably be within the range above
- But does it mean the true speed of light is somewhere in that range?…

- As we start thinking about parameters in general, rather than specifically the mean (\(\mu\)), we’ll use \(\theta\) to denote the unknown parameter, a representation of the underlying truth
- We can’t measure \(\theta\) perfectly, there is some variability or randomness causing us to get different answers each time we take the measurement
- We collect data and use the plug-in principle, constructing an
**estimate**\(\hat \theta\) from the data Is \(\hat \theta\) close to the true \(\theta\)? How close? And how sure can we be about it?

- Last time we considered the specific example of estimating the mean \(\mu\)
- We take \(n\) measurements and use the mean of that sample, \(\bar x\), as an
**estimate**of \(\mu\) - To gain insight about how well this works, we need a mathematical model
- Think of the individual measurements as being samples of a random variable \(X\), and the mean of \(n\) of these as a random variable denoted \(\bar X\)
- We assume the individual measurements are independent and identically distributed (iid)
- In that case, \(E[\bar X] = E[X]\) and \(\text{Var}( \bar X) = \text{Var}(X)/n\)
If we repeated the experiment many times and plotted a histogram of all the estimates, they would be centered at \(E[X]\) and their dispersion would be smaller than the dispersion of \(X\) – how much smaller depends on the sample size

- In general, if we have an estimator \(\hat \theta\) we might wonder what would happen if we repeated the experiment many times and plotted a histogram of the values of \(\hat \theta\) we got each time
- We learned about the
*sampling distribution of the mean*but what about the sampling distribution of another estimator? And if \(\theta\) is a different parameter other than the population mean \(\mu\), how can we choose a good \(\hat \theta\) to estimate it?

- The uniform distribution \(U[0,\theta]\) from 0 to \(\theta\) is a probability distribution where every real number between 0 and \(\theta\) has equal probability of being the outcome
- Evenly distributed, flat, not like a bell curve or binomial
- The maximum possible value that it can take on is \(\theta\), this is the parameter we want to estimate
- Does it make sense to use the mean of a sample to estimate \(\theta\)?…
- (No, the mean will be near the center, not near the maximum!)

```
n <- 100
X <- runif(n, min = 0, max = theta)
qplot(X, bins = 10) + theme_tufte()
```

- Use the maximum value of the sample! \(x_{(n)} = \text{max} \{ x_1, \ldots, x_n \}\)

`max(X)`

`## [1] 2.102546`

- Does this work? Well, we need a mathematical model to answer that question
- Sampling distribution of the maximum \(X_{(n)} = \text{max} \{ X_1, \ldots, X_n \}\)
- Can think of this as a random variable representing the variability that would happen if we took a new sample and found its maximum, and then repeated that many times

```
Xn <- replicate(1000, max(runif(n, min = 0, max = theta)))
qplot(Xn, bins = 40) + theme_tufte()
```

Interesting! This is not shaped like a bell curve at all, is it?

Is it centered at the true value \(\theta\)?…

`qplot(Xn - theta, bins = 40) + theme_tufte()`

- Seems like it’s always smaller than \(\theta\)…

- We say that \(\hat \theta\) is an
**unbiased**estimate of \(\theta\) if \(E[\hat \theta] = \theta\) - For the \(U[0,\theta]\) example, it turns out (need some calculus to show) \[ E[X_{(n)}] = \frac{n}{n+1} \theta \]
- This is not unbiased. We say it’s a biased estimate.
- But we know how to fix it! Consider \(\hat \theta = \frac{n+1}{n} X_{(n)}\)

`qplot(Xn*(n+1)/n - theta, bins = 40) + theme_tufte()`

`mean(Xn) - theta`

`## [1] -0.02117188`

`mean(Xn*(n+1)/n) - theta`

`## [1] -0.0002032562`

- We were able to fix this example, but in general it is not always possible to remove bias
Bias can be a very serious problem!

- Consider the sample mean example again
- We know \(E[\bar X] = E[X]\), and \(\text{Var}(\bar X) = \text{Var}(X)/n\)
- The fact that the variance decreases in \(n\) is what makes gathering more data useful! It means \(\bar X\) will be concentrated around \(E[X]\)…
- But what if \(E[X] \neq \mu\)?!
How could this happen? Think of some examples. Usually it means there is a tendency to either overestimate or underestimate.

- Suppose that instead of the mean, the parameter we want to estimate is the variance \(\theta = \sigma^2\)
- Remember the sample standard deviation formula? It has a fraction with \(n-1\) instead of \(n\) in the denominator. Why is that?
- With some algebra, it can be shown \[ E\left[\frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2 \right] = \frac{n-1}{n} \sigma^2 \]
- This is \(< \sigma^2\), it underestimates the variance

- Around the 2010’s, the “big data” trend was really taking off
- People believe the size of datasets used by modern internet companies is so large that, basically, data = truth
- No need for a model, no need to worry about variability or error, just assume \(\bar x = \mu\) because the sample is the whole population
- Not true!
- Always remember this: even if the
*variance*of an estimator decreases as the sample size \(n\) increases,**bias is a separate issue** - No sample size is large enough to remove bias, it requires understanding
*why*the estimate is not centered at the truth. - e.g. why does it tend to overestimate, and by how much?