This question will guide you through a *simulation study* in `R`

to understand the bias of a certain estimator.

`varn <- function(x) mean((x - mean(x))^2)`

The code below generates a sample of size n and calculates both the sample variance and the biased version of sample variance (the one that has n in the denominator instead of n-1). Modify the code to change n from 10 to some other value between 10 and 50.

```
# Create a sample of data by rolling a 6-sided die n times
n <- 20
data <- sample(1:6, n, replace = TRUE)
data
```

`## [1] 6 1 4 1 5 2 4 1 5 6 1 1 4 4 1 5 1 4 1 4`

```
# True variance
35/12
```

`## [1] 2.916667`

```
# Unbiased estimate of variance
var(data)
```

`## [1] 3.628947`

```
# Biased estimate of variance
varn(data)
```

`## [1] 3.4475`

Repeat the previous experiment many times and find the expected value of each variance estimator.

```
# Unbiased estimator
mean(replicate(10000, var(sample(1:6, n, replace = TRUE))))
```

`## [1] 2.911705`

```
# Biased estimator
mean(replicate(10000, varn(sample(1:6, n, replace = TRUE))))
```

`## [1] 2.76472`

After running the previous code, which estimator’s average value in the simulation is closer to the true value (roughly 2.9167)?

The **unbiased estimator**’s average value is closer to the true value.

Now go back and change n to another, larger value, and rerun all of the code. Do you notice anything about the average of the biased estimator?

`mean(replicate(10000, varn(sample(1:6, n + 10, replace = TRUE))))`

`## [1] 2.81582`

With a larger sample size, the biased estimator’s average value is now closer to the true value.

Copy the code from part (b) and paste it below here, then change the `mean`

function to be `sd`

instead.

```
# Unbiased estimator
sd(replicate(10000, var(sample(1:6, n, replace = TRUE))))
```

`## [1] 0.5904201`

```
# Biased estimator
sd(replicate(10000, varn(sample(1:6, n, replace = TRUE))))
```

`## [1] 0.5631186`

The biased estimator has a lower variability (as measured by standard deviation).

According to a Marketplace/Edison survey in April of 2017, about 23.4% of survey responders agreed with the statement “the economic system in the U.S. is fair to all Americans.” In this question we’ll use a Bernoulli probability model to analyze this number. Suppose that there were 1,000 survey respondents and 234 agreed with the above quotation. Define a Bernoulli random variable which is 1 if a person agrees and 0 otherwise. Assume the survey was done with independent sampling (with replacement), so these Bernoulli random variables are independent. Then the number of people in the sample of 1,000 who agree is a Binomial random variable.

- We have \(X_i\) i.i.d Ber(\(p\)) for \(i = 1, \ldots, 1000\).
- Let \(S_n = \sum_{i=1}^n X_i\), so \(S_n\) is Bin(\(n, p\)).

Using the fact that \(n \bar X_n = S_n\), how could you use the Binomial distribution to calculate \(P(a \leq \bar X_n \leq b)\)? How would you use `pbinom`

with the given values of \(a, b, n\), and \(p\)?

`pbinom(n*b, size = n, prob = p) - pbinom(n*a, size = n, prob = p)`

Instead of the Binomial distribution, how would we use the central limit theorem to calculate the same probabilities? Hint: your answer should use `pnorm`

and involve \(\sqrt{n}\) (and \(a, b\), and \(p\)).

**Solution**: The CLT says \(\bar X_n\) is approximately normal with mean \(p\) and standard deviation \(\sqrt{p(1-p)/n}\). In `R`

we could do:

`pnorm(b, mean = p, sd = sqrt(p*(1-p)/n)) - pnorm(a, mean = p, sd = sqrt(p*(1-p)/n))`

Now let \(n = 1000\), \(p = 0.234\), \(b = 0.250, a = 0.239\) and compute the desired probability with both methods.

```
n <- 1000
p <- 0.234
a <- 0.239
b <- 0.250
pbinom(n*b, size = n, prob = p) - pbinom(n*a, size = n, prob = p)
```

`## [1] 0.2291025`

`pnorm(b, mean = p, sd = sqrt(p*(1-p)/n)) - pnorm(a, mean = p, sd = sqrt(p*(1-p)/n))`

`## [1] 0.2383744`