#### Outline

• Equations for a line
• Coefficient estimation
• Predictions, errors, extrapolation
• Coefficient interpretation

## Review

• What is covariance? Correlation?
• How is a regression line different from these?
• What do they have in common?
• Which variable is the outcome?
• How are the errors measured?
• What’s special about the regression line?

## Equations for a line

• Since linear regression uses straight lines, it will help to remember these facts about them

• Equation for a line using slope and intercept: $$y = mx + b$$
• We use different notation: $$y = \beta_0 + \beta_1 x$$
• $$\beta_0$$ and $$\beta_1$$ are often called coefficients (in computer science: weights)
• Suppose $$(x_0, y_0)$$ is a point on the line, so $$y_0 = m x_0 + b$$
• Equation for a line using slope and point: $$y - y_0 = \beta_1(x - x_0)$$

• Slope measures change in $$y$$ when $$x$$ changes by one unit
• If $$x$$ changes from $$x_0$$ to $$x_0 + 1$$, then $$y$$ changes from $$y_0$$ to $$y_0 + \beta_1$$
• This fact can be used to draw the line
• Another strategy for drawing: find two points on the line and connect them. We’ll come back to this later

range <- data.frame(x = c(-2,0), y = c(0, 1))
ggplot(range, aes(x, y)) + ylim(c(-3,3)) + xlim(c(-3,3)) +
geom_point(size = 2) +
geom_abline(slope = 1/2, intercept = 1, color = "blue") +
geom_abline(slope = -1, intercept = -1, linetype = 2) +
theme_minimal()

## Coefficient estimation

• Data: $$(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$$ assuming $$X$$ and $$Y$$ are continuous, these are $$n$$ points
• Data usually doesn’t fit exactly on a line (unless the correlation equals…?)
• So we allow errors in the equation for the line $y_i = \beta_0 + \beta_1 x_i + e_i$

### The coefficients $$\beta_0, \beta_1$$ and the errors $$e_i$$ are unknown

• Everything we’ve learned so far about estimation applies here!
• Goal: use the data to calculate sample estimates of coefficients: $$\hat \beta_0$$ and $$\hat \beta_1$$
• Previously we saw some regression lines calculated for us by the lm function in R
• How does that work? How can we do it ourselves?
gm2007 <- filter(gapminder, year == "2007")
ggplot(gm2007, aes(gdpPercap, lifeExp)) +
geom_point() + theme_minimal() +
ggtitle("How can we compute the regression line?")

### Slope: calculating from sample correlations and standard deviations

• We learned that the slope $$\beta_1$$ has the same sign as the correlation between $$x$$ and $$y$$ (both are positive or both are negative)
• Let $$r = \text{cor}(x, y)$$, and $$s_x$$ is the sample standard deviation of $$x$$, likewise for $$s_y$$
• The exact relationship is this: $\hat \beta_1 = r \cdot \frac{s_y}{s_x}$
• This is an important relationship to remember
• We’ll come back to it when we think about interpretation

### Intercept: calculating from sample means

• Now that we know the slope, if we knew a point on the line then we could use the slope + point equation for a line
• Let $$\bar x$$ and $$\bar y$$ be the sample means of the $$x$$ and $$y$$ variables
• The estimated regression line always passes through one interesting point: the mean of the data $$(\bar x, \bar y)$$
• So if $$\hat \beta_0$$ and $$\hat \beta_1$$ are the sample linear regression coefficients, we know $\bar y = \hat \beta_0 + \hat \beta_1 \bar x$
• We can use this to calculate $$\hat \beta_0$$ if we already know $$\hat \beta_1$$ and the means: $\hat \beta_0 = \bar y - \hat \beta_1 \bar x$
gm2007 %>% summarize(
meanx = mean(gdpPercap),
meany = mean(lifeExp),
sdx = sd(gdpPercap),
sdy = sd(lifeExp),
r = cor(lifeExp, gdpPercap)
)
## # A tibble: 1 x 5
##   meanx meany   sdx   sdy     r
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11680  67.0 12860  12.1 0.679
• Calculating regression coefficients from two variable summary
beta1 <- cor(gm2007$lifeExp, gm2007$gdpPercap) * sd(gm2007$lifeExp) / sd(gm2007$gdpPercap)
beta0 <- mean(gm2007$lifeExp) - beta1 * mean(gm2007$gdpPercap)
c(beta0, beta1)
## [1] 59.5656500780  0.0006371341
• Compare to coefficients calculated by lm
gm_model <- lm(lifeExp ~ gdpPercap, data = gm2007)
gm_model
##
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gm2007)
##
## Coefficients:
## (Intercept)    gdpPercap
##  59.5656501    0.0006371

## Prediction and errors

### For points in the data

• Each data point has an actual outcome $$y_i$$, but we can also predict the outcome using the regression model
• Predicted values are written as: $$\hat y_i = \hat \beta_0 + \hat \beta_1 x_i$$
• In other words, plug the value of the predictor variable $$x$$ into the linear equation
• The error or residual is $$\hat e_i = y_i - \hat y_i$$
• We’ll go into more detail about residuals in a future lecture, but for now let’s see an example

• What’s the predicted life expectancy for the United States?

gm2007 %>%
filter(country == "United States") %>%
select(gdpPercap, lifeExp)
## # A tibble: 1 x 2
##   gdpPercap lifeExp
##       <dbl>   <dbl>
## 1     42952    78.2
59.6 + 0.000637 * 42951
## [1] 86.95979
• What’s the residual?
78.24 - 86.96
## [1] -8.72
• Plotting all the predicted values
gm2007$lifeExpPrediction <- predict(gm_model) ggplot(gm2007, aes(gdpPercap, lifeExpPrediction)) + geom_point() + theme_tufte() • Plotting all the residuals gm2007$errors <- residuals(gm_model)
qplot(gm2007\$errors, bins = 20) + theme_tufte()