Outline

Review

Equations for a line

range <- data.frame(x = c(-2,0), y = c(0, 1))
ggplot(range, aes(x, y)) + ylim(c(-3,3)) + xlim(c(-3,3)) +
  geom_point(size = 2) +
  geom_abline(slope = 1/2, intercept = 1, color = "blue") +
  geom_abline(slope = -1, intercept = -1, linetype = 2) +
  theme_minimal()

Coefficient estimation

The coefficients \(\beta_0, \beta_1\) and the errors \(e_i\) are unknown

  • Everything we’ve learned so far about estimation applies here!
  • Goal: use the data to calculate sample estimates of coefficients: \(\hat \beta_0\) and \(\hat \beta_1\)
  • Previously we saw some regression lines calculated for us by the lm function in R
  • How does that work? How can we do it ourselves?
gm2007 <- filter(gapminder, year == "2007")
ggplot(gm2007, aes(gdpPercap, lifeExp)) +
  geom_point() + theme_minimal() +
  ggtitle("How can we compute the regression line?")

Slope: calculating from sample correlations and standard deviations

  • We learned that the slope \(\beta_1\) has the same sign as the correlation between \(x\) and \(y\) (both are positive or both are negative)
  • Let \(r = \text{cor}(x, y)\), and \(s_x\) is the sample standard deviation of \(x\), likewise for \(s_y\)
  • The exact relationship is this: \[ \hat \beta_1 = r \cdot \frac{s_y}{s_x} \]
  • This is an important relationship to remember
  • We’ll come back to it when we think about interpretation

Intercept: calculating from sample means

  • Now that we know the slope, if we knew a point on the line then we could use the slope + point equation for a line
  • Let \(\bar x\) and \(\bar y\) be the sample means of the \(x\) and \(y\) variables
  • The estimated regression line always passes through one interesting point: the mean of the data \((\bar x, \bar y)\)
  • So if \(\hat \beta_0\) and \(\hat \beta_1\) are the sample linear regression coefficients, we know \[ \bar y = \hat \beta_0 + \hat \beta_1 \bar x \]
  • We can use this to calculate \(\hat \beta_0\) if we already know \(\hat \beta_1\) and the means: \[ \hat \beta_0 = \bar y - \hat \beta_1 \bar x \]
gm2007 %>% summarize(
  meanx = mean(gdpPercap),
  meany = mean(lifeExp),
  sdx = sd(gdpPercap),
  sdy = sd(lifeExp),
  r = cor(lifeExp, gdpPercap)
)
## # A tibble: 1 x 5
##   meanx meany   sdx   sdy     r
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11680  67.0 12860  12.1 0.679
  • Calculating regression coefficients from two variable summary
beta1 <- cor(gm2007$lifeExp, gm2007$gdpPercap) * sd(gm2007$lifeExp) / sd(gm2007$gdpPercap)
beta0 <- mean(gm2007$lifeExp) - beta1 * mean(gm2007$gdpPercap)
c(beta0, beta1)
## [1] 59.5656500780  0.0006371341
  • Compare to coefficients calculated by lm
gm_model <- lm(lifeExp ~ gdpPercap, data = gm2007)
gm_model
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gm2007)
## 
## Coefficients:
## (Intercept)    gdpPercap  
##  59.5656501    0.0006371

Prediction and errors

For points in the data

  • Each data point has an actual outcome \(y_i\), but we can also predict the outcome using the regression model
  • Predicted values are written as: \(\hat y_i = \hat \beta_0 + \hat \beta_1 x_i\)
  • In other words, plug the value of the predictor variable \(x\) into the linear equation
  • The error or residual is \(\hat e_i = y_i - \hat y_i\)
  • We’ll go into more detail about residuals in a future lecture, but for now let’s see an example

  • What’s the predicted life expectancy for the United States?

gm2007 %>% 
  filter(country == "United States") %>%
  select(gdpPercap, lifeExp)
## # A tibble: 1 x 2
##   gdpPercap lifeExp
##       <dbl>   <dbl>
## 1     42952    78.2
59.6 + 0.000637 * 42951
## [1] 86.95979
  • What’s the residual?
78.24 - 86.96
## [1] -8.72
  • Plotting all the predicted values
gm2007$lifeExpPrediction <- predict(gm_model)
ggplot(gm2007, aes(gdpPercap, lifeExpPrediction)) +
  geom_point() + theme_tufte()

  • Plotting all the residuals
gm2007$errors <- residuals(gm_model)
qplot(gm2007$errors, bins = 20) + theme_tufte()