- Computing coefficients in R
- Predictions and errors
- Extrapolation

- Last time we learned how to calculate sample estimates \(\hat \beta_0\) and \(\hat \beta_1\) of the regression coefficients
- Use sample standard deviations and correlation to get the slope: \[ \hat \beta_1 = r \cdot \frac{s_y}{s_x} \]
Then use sample means and slope to get the intercept: \[ \hat \beta_0 = \bar y - \hat \beta_1 \bar x \]

Calculating two variable summary in

`R`

:

```
gm2007 <- filter(gapminder, year == "2007")
gm2007 %>% summarize(
meanx = mean(gdpPercap),
meany = mean(lifeExp),
sdx = sd(gdpPercap),
sdy = sd(lifeExp),
r = cor(lifeExp, gdpPercap)
)
```

```
## # A tibble: 1 x 5
## meanx meany sdx sdy r
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11680 67.0 12860 12.1 0.679
```

- Calculating regression coefficients from two variable summary

```
beta1 <- cor(gm2007$lifeExp, gm2007$gdpPercap) * sd(gm2007$lifeExp) / sd(gm2007$gdpPercap)
beta0 <- mean(gm2007$lifeExp) - beta1 * mean(gm2007$gdpPercap)
c(beta0, beta1)
```

`## [1] 59.5656500780 0.0006371341`

- Compare to coefficients calculated by
`lm`

```
gm_model <- lm(lifeExp ~ gdpPercap, data = gm2007)
gm_model
```

```
##
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gm2007)
##
## Coefficients:
## (Intercept) gdpPercap
## 59.5656501 0.0006371
```

- Each data point has an actual outcome \(y_i\), but we can also predict the outcome using the regression model
- Predicted values are written as: \(\hat y_i = \hat \beta_0 + \hat \beta_1 x_i\)
- In other words, plug the value of the predictor variable \(x\) into the linear equation
- The error or
**residual**is \(\hat e_i = y_i - \hat y_i\) We’ll go into more detail about residuals in a future lecture, but for now let’s see an example

What’s the predicted life expectancy for the United States?

```
gm2007 %>%
filter(country == "United States") %>%
select(gdpPercap, lifeExp)
```

```
## # A tibble: 1 x 2
## gdpPercap lifeExp
## <dbl> <dbl>
## 1 42952 78.2
```

`59.6 + 0.000637 * 42951`

`## [1] 86.95979`

- What’s the residual?

`78.24 - 86.96`

`## [1] -8.72`

- Plotting all the predicted values

```
gm2007$lifeExpPrediction <- predict(gm_model)
ggplot(gm2007, aes(gdpPercap, lifeExpPrediction)) +
geom_point() + theme_tufte()
```

- Plotting all the residuals

```
gm2007$errors <- residuals(gm_model)
qplot(gm2007$errors, bins = 20) + theme_tufte()
```

```
ggplot(gm2007, aes(gdpPercap, errors)) +
geom_point() +
geom_hline(yintercept = 0) +
theme_tufte()
```

- We’ll come back to plots of residuals when we discuss regression diagnostics

- Can use the regression line to get predictions for new points, ones that are not in our original data
- A new point may not even have the outcome variable, but we can predict it using just the predictor variable
But beware! It may not work as well…

- Suppose the population of Wakanda is about 12 million
- GDP: over 90 trillion (an estimate of T’Challa’s net worth)
- GDP per capita is then at least 7,500,000
Predicted life expectancy based on this calculation…

`59.6 + 0.000637 * (7500000)`

`## [1] 4837.1`

- It can be dangerous to predict for values
**outside the range**of the predictor in the data This is called

**extrapolation**and it will probably yield larger errors than when predicting on values inside the original data- Safer, and useful: predicting an outcome for a new point when the predictor variable lies
**within the range**seen in the original data. (e.g. for`gdpPercap`

between 277 and 49,357) Useful because we may not even have the actual outcome variable for a new data point. In this case, we can’t calculate a residual

- Plugging any value of \(x\) into \(\hat \beta_0 + \hat \beta_1 x\) gives a predicted outcome, but this may be a bad idea if \(x\) is outside the range of the data
Remember that the fitted regression line comes from an equation using

**estimates**\(\hat \beta_0, \hat \beta_1\)- A few things to think about:
- What would happen if we gathered a new sample of data and calculated the estimated coefficients again?
How accurate are the estimates? Does the accuracy depend on sample size?