- Equations for a line
- Coefficient estimation
- Coefficient interpretation

- What is covariance? Correlation?
- How is a regression line different from these?
- What do they have in common?
- Which variable is the outcome?
- How are the errors measured?
- What’s special about the regression line?

Since linear regression uses straight lines, it will help to remember these facts about them

- Equation for a line using slope and intercept: \(y = mx + b\)
- We use different notation: \(y = \beta_0 + \beta_1 x\)
- \(\beta_0\) and \(\beta_1\) are often called
**coefficients**(in computer science: weights) - Suppose \((x_0, y_0)\) is a point on the line, so \(y_0 = m x_0 + b\)
- Equation for a line using slope and point: \(y - y_0 = \beta_1(x - x_0)\)
What point can be used in this formula to get the slope and intercept version?

- Slope measures change in \(y\) when \(x\) changes by one unit
- If \(x\) changes from \(x_0\) to \(x_0 + 1\), then \(y\) changes from \(y_0\) to \(y_0 + \beta_1\)
- This fact can be used to draw the line
Another strategy for drawing: find two points on the line and connect them. We’ll come back to this later

```
range <- data.frame(x = c(-2,0), y = c(0, 1))
ggplot(range, aes(x, y)) + ylim(c(-3,3)) + xlim(c(-3,3)) +
geom_point(size = 2) +
geom_abline(slope = 1/2, intercept = 1, color = "blue") +
geom_abline(slope = -1, intercept = -1, linetype = 2) +
theme_minimal()
```

- Data: \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\) assuming \(X\) and \(Y\) are continuous, these are \(n\) points
- Data usually doesn’t fit exactly on a line (unless the
*correlation*equals…?) - So we allow
**errors**in the equation for the line \[ y_i = \beta_0 + \beta_1 x_i + e_i \]

- Everything we’ve learned so far about
**estimation**applies here! - Goal: use the data to calculate sample estimates of coefficients: \(\hat \beta_0\) and \(\hat \beta_1\)
- Previously we saw some regression lines calculated for us by the
`lm`

function in`R`

- How does that work? How can we do it ourselves?

```
gm2007 <- filter(gapminder, year == "2007")
ggplot(gm2007, aes(gdpPercap, lifeExp)) +
geom_point() + theme_minimal() +
ggtitle("How can we compute the regression line?")
```

- Last time we learned that the slope \(\beta_1\) has the same
*sign*as the correlation between \(x\) and \(y\) (both are positive or both are negative) - Let \(r = \text{cor}(x, y)\), and \(s_x\) is the sample standard deviation of \(x\), likewise for \(s_y\)
- The exact relationship is this: \[ \hat \beta_1 = r \cdot \frac{s_y}{s_x} \]
**This is an important relationship to remember**- We’ll come back to it when we think about interpretation

- Now that we know the slope, if we knew a point on the line then we could use the slope + point equation for a line
- Let \(\bar x\) and \(\bar y\) be the sample means of the \(x\) and \(y\) variables
- The estimated regression line always passes through one interesting point: the mean of the data \((\bar x, \bar y)\)
- So if \(\hat \beta_0\) and \(\hat \beta_1\) are the sample linear regression coefficients, we know \[ \bar y = \hat \beta_0 + \hat \beta_1 \bar x \]
- We can use this to calculate \(\hat \beta_0\) if we already know \(\hat \beta_1\) and the means: \[ \hat \beta_0 = \bar y - \hat \beta_1 \bar x \]

- Fitted regression line passes through the center of the data \((\bar x, \bar y)\)
- The slope \(\hat \beta_1\) tells us how the predicted outcome \(\hat y\) changes if the predictor variable \(x\) changes
- If \(x\) increase by one unit, then \(\hat y\) increases by \(\hat \beta_1\) units
- (Note that \(x\) and \(y\) may use different units, e.g. 2007 inflation-adjusted US dollars per person for GDP per capita and years for life expectancy) \[ \hat \beta_1 = r \cdot \frac{s_y}{s_x} \]
- If \(x\) increases by one standard deviation \(s_x\), then the predicted outcome \(\hat y\) increases by \(r\) times the standard deviation \(s_y\)
- Plugging any value of \(x\) into \(\hat \beta_0 + \hat \beta_1 x\) gives a predicted outcome, but this may be a bad idea if \(x\) is outside the range of the data
Remember that the fitted regression line comes from an equation using

**estimates**\(\hat \beta_0, \hat \beta_1\)- A few things to think about:
- What would happen if we gathered a new sample of data and calculated the estimated coefficients again?
How accurate are the estimates? Does the accuracy depend on sample size?