- Equations for a line
- Coefficient estimation
- Predictions, errors, extrapolation
- Coefficient interpretation

- What is covariance? Correlation?
- How is a regression line different from these?
- What do they have in common?
- Which variable is the outcome?
- How are the errors measured?
- Whatâ€™s special about the regression line?

Since linear regression uses straight lines, it will help to remember these facts about them

- Equation for a line using slope and intercept: \(y = mx + b\)
- We use different notation: \(y = \beta_0 + \beta_1 x\)
- \(\beta_0\) and \(\beta_1\) are often called
**coefficients**(in computer science: weights) - Suppose \((x_0, y_0)\) is a point on the line, so \(y_0 = m x_0 + b\)
Equation for a line using slope and point: \(y - y_0 = \beta_1(x - x_0)\)

- Slope measures change in \(y\) when \(x\) changes by one unit
- If \(x\) changes from \(x_0\) to \(x_0 + 1\), then \(y\) changes from \(y_0\) to \(y_0 + \beta_1\)
- This fact can be used to draw the line
Another strategy for drawing: find two points on the line and connect them. Weâ€™ll come back to this later

```
range <- data.frame(x = c(-2,0), y = c(0, 1))
ggplot(range, aes(x, y)) + ylim(c(-3,3)) + xlim(c(-3,3)) +
geom_point(size = 2) +
geom_abline(slope = 1/2, intercept = 1, color = "blue") +
geom_abline(slope = -1, intercept = -1, linetype = 2) +
theme_minimal()
```

- Data: \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\) assuming \(X\) and \(Y\) are continuous, these are \(n\) points
- Data usually doesnâ€™t fit exactly on a line (unless the
*correlation*equalsâ€¦?) - So we allow
**errors**in the equation for the line \[ y_i = \beta_0 + \beta_1 x_i + e_i \]

- Everything weâ€™ve learned so far about
**estimation**applies here! - Goal: use the data to calculate sample estimates of coefficients: \(\hat \beta_0\) and \(\hat \beta_1\)
- Previously we saw some regression lines calculated for us by the
`lm`

function in`R`

- How does that work? How can we do it ourselves?

```
gm2007 <- filter(gapminder, year == "2007")
ggplot(gm2007, aes(gdpPercap, lifeExp)) +
geom_point() + theme_minimal() +
ggtitle("How can we compute the regression line?")
```

- We learned that the slope \(\beta_1\) has the same
*sign*as the correlation between \(x\) and \(y\) (both are positive or both are negative) - Let \(r = \text{cor}(x, y)\), and \(s_x\) is the sample standard deviation of \(x\), likewise for \(s_y\)
- The exact relationship is this: \[ \hat \beta_1 = r \cdot \frac{s_y}{s_x} \]
**This is an important relationship to remember**- Weâ€™ll come back to it when we think about interpretation

- Now that we know the slope, if we knew a point on the line then we could use the slope + point equation for a line
- Let \(\bar x\) and \(\bar y\) be the sample means of the \(x\) and \(y\) variables
- The estimated regression line always passes through one interesting point: the mean of the data \((\bar x, \bar y)\)
- So if \(\hat \beta_0\) and \(\hat \beta_1\) are the sample linear regression coefficients, we know \[ \bar y = \hat \beta_0 + \hat \beta_1 \bar x \]
- We can use this to calculate \(\hat \beta_0\) if we already know \(\hat \beta_1\) and the means: \[ \hat \beta_0 = \bar y - \hat \beta_1 \bar x \]

```
gm2007 %>% summarize(
meanx = mean(gdpPercap),
meany = mean(lifeExp),
sdx = sd(gdpPercap),
sdy = sd(lifeExp),
r = cor(lifeExp, gdpPercap)
)
```

```
## # A tibble: 1 x 5
## meanx meany sdx sdy r
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11680 67.0 12860 12.1 0.679
```

- Calculating regression coefficients from two variable summary

```
beta1 <- cor(gm2007$lifeExp, gm2007$gdpPercap) * sd(gm2007$lifeExp) / sd(gm2007$gdpPercap)
beta0 <- mean(gm2007$lifeExp) - beta1 * mean(gm2007$gdpPercap)
c(beta0, beta1)
```

`## [1] 59.5656500780 0.0006371341`

- Compare to coefficients calculated by
`lm`

```
gm_model <- lm(lifeExp ~ gdpPercap, data = gm2007)
gm_model
```

```
##
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gm2007)
##
## Coefficients:
## (Intercept) gdpPercap
## 59.5656501 0.0006371
```

- Each data point has an actual outcome \(y_i\), but we can also predict the outcome using the regression model
- Predicted values are written as: \(\hat y_i = \hat \beta_0 + \hat \beta_1 x_i\)
- In other words, plug the value of the predictor variable \(x\) into the linear equation
- The error or
**residual**is \(\hat e_i = y_i - \hat y_i\) Weâ€™ll go into more detail about residuals in a future lecture, but for now letâ€™s see an example

Whatâ€™s the predicted life expectancy for the United States?

```
gm2007 %>%
filter(country == "United States") %>%
select(gdpPercap, lifeExp)
```

```
## # A tibble: 1 x 2
## gdpPercap lifeExp
## <dbl> <dbl>
## 1 42952 78.2
```

`59.6 + 0.000637 * 42951`

`## [1] 86.95979`

- Whatâ€™s the residual?

`78.24 - 86.96`

`## [1] -8.72`

- Plotting all the predicted values

```
gm2007$lifeExpPrediction <- predict(gm_model)
ggplot(gm2007, aes(gdpPercap, lifeExpPrediction)) +
geom_point() + theme_tufte()
```

- Plotting all the residuals

```
gm2007$errors <- residuals(gm_model)
qplot(gm2007$errors, bins = 20) + theme_tufte()
```