- Suggested reading on course page
- Homework 5 will be posted soon
- Mid-course survey also coming soon
- Today: bias variance tradeoff

- Using an estimator \(\hat \theta\) of the parameter \(\theta\). Is this a good idea?
- The bias of \(\hat \theta\) is \[ \text{Bias}(\hat \theta) = E[\hat \theta - \theta] = E[\hat \theta] - \theta \]
- We say \(\hat \theta\) is
**unbiased**for \(\theta\) if \(\text{Bias}(\hat \theta) = 0\). - Last time we considered \(U_1, \ldots U_n \sim U[0,\theta]\) and using the maximum \(U_{(n)}\) as an estimate of \(\theta\)
- For that example we found \(E[U_{(n)}] = \frac{n-1}{n}\theta < \theta\), the sample maximum
*underestimated*the true maximum.

- How good is an estimator? One way to quantify this is the mean squard error \[ \text{MSE}(\hat \theta) = E[(\hat \theta - \theta)^2] \]
- If we
*imagine*repeating the data collection many times and computing \(\hat \theta\) repeatedly, then MSE is the average squared distance of \(\hat \theta\) from the true value. - Exercise: what is the MSE of \(\bar X\) as an estimator of \(\mu = E[X]\)?
- Solving an earlier mystery: why is \(n-1\) the denominator in the sample standard deviation, instead of \(n\)? \[ E\left[\frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2\right] = \frac{n-1}{n} \text{Var}(X) \]
- Using \(n-1\) gives an
*unbiased*estimate of the variance.

- With a bit of algebra (maybe do this on the board), we can show \[ \text{MSE}(\hat \theta) = \text{Bias}(\hat \theta)^2 + \text{Var}(\hat \theta) \]
- To make MSE small, we want to make bias small and also make variance small…
- Unfortunately it is not always possible to do both. There are limits.
- Consider using the constant \(c\) as an estimator. \(\text{Bias}(c) = c - \theta\), \(\text{Var}(c) = 0\). Low variance, but high bias (remember \(\theta\) is unknown so \(c = \theta\) is not an
*estimator*) - See the dartboard figure here (maybe draw on board)

- Probability models don’t always have just one parameter, complicated situations may require more. Modern statistics often deals with
*high dimensional*problems with many parameters. - Suppose we have \(p\) parameters to estimate, \(\theta_1, \theta_2, \ldots, \theta_p\).
- e.g. Normal \(N(\mu, \sigma^2)\). Then \(\theta = (\mu, \sigma^2)\) and \(\hat \theta = (\bar X, S^2)\).
- e.g. Multivariate normal, multinomial
- e.g. volatility parameters (e.g. \(\beta\)) for \(p\) different investments
- e.g. genetic effects for \(p\) genes on a given phenotype of interest, such as risk for a certain disease
- Can think of these as a point in \(p\)-dimensional space, called a vector \(\theta = (\theta_1, \ldots, \theta_p)\).
- MSE still makes sense: \[ MSE(\hat \theta) = \sum_{j=1}^p E[(\hat \theta_i - \theta_i)^2] \]
- Just add up the MSEs for each one.

- Suppose the parameters of interest are the mean of a multivariate normal \(\theta = \mu = (\mu_1, \ldots, \mu_p)\)
- Suppose \(X\) is multivariate normal with mean \(\mu\). So \(E[X] = \mu\) is unbiased. What about MSE?
- If \(p = 1\) or 2, then \(X\) has lowest MSE…
- If \(p \geq 3\), then \(X\) no longer has the lowest MSE!
- This is sometimes called Stein’s paradox after Charles Stein.
- In
**high dimensions**, it’s usually**better to be biased** - If you find this very interesting there is a classic example about baseball you can read about. Summary: when estimating many players’ batting averages in the next season, instead of using
*each*players’ average from this season as their*own*estimate it’s better to make*all*the estimates*biased*toward the*overall average*of the players.

```
p = 100
JS <- function(x) max((1 - (p-2)/sum(x^2)),0) * x
mu = sample(c(1:5), p, replace = TRUE)
mu
```

```
## [1] 3 5 2 3 4 1 4 4 5 5 3 3 4 3 3 2 5 1 2 2 4 4 1 5 1 5 4 5 4 3 3 5 4 4 3
## [36] 5 1 3 2 5 4 1 2 5 1 2 5 1 3 1 4 4 1 2 4 3 2 4 1 5 5 5 1 1 4 3 4 2 5 3
## [71] 5 5 4 4 5 2 4 1 3 4 3 2 4 2 5 3 3 1 4 5 2 1 3 4 2 4 2 3 2 2
```

```
SEs <- replicate(10000, sum(((rnorm(p) + mu) - mu)^2))
JSSEs <- replicate(10000, sum(((JS(rnorm(p) + mu)) - mu)^2))
mean(SEs)
```

`## [1] 99.88886`

`mean(JSSEs)`

`## [1] 92.44216`

- The theme of trading off between bias and variance is something we’ll come back to later in the course.