Outline

Multiple regression

candy_model <- lm(winpercent ~ sugarpercent + pricepercent, data = candy_rankings)
summary(candy_model)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent + pricepercent, data = candy_rankings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.880  -8.575  -0.759   8.142  37.502 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    39.787      3.438  11.574 < 0.0000000000000002 ***
## sugarpercent    6.731      5.664   1.188              0.23808    
## pricepercent   15.586      5.605   2.781              0.00672 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.86 on 82 degrees of freedom
## Multiple R-squared:  0.1342, Adjusted R-squared:  0.113 
## F-statistic: 6.353 on 2 and 82 DF,  p-value: 0.002722

Interpreting coefficients

  • What is the meaning of the coefficient \(\beta_j\) of the predictor variable \(x_j\)?
  • Something about the change in \(y\) if variable \(x_j\) changes, right?
  • Be careful when interpreting coefficients in multiple regression
  • \(\beta_j\) is the expected change in \(y\) if we increase \(x_j\) by one unit while holding all the other predictor variables constant
  • Think carefully about what it means to hold the other predictor variables constant, especially if any other predictors are correlated with \(x_j\)

  • Consider:

  • Suppose you had a model with GDP per capita as one predictor and GDP as another predictor. What would it mean to hold GDP per capita constant and increase GDP? What else would be changing in that case? What would it mean to hold GDP constant while increasing GDP per capita? What if we also include population as a third predictor, is it even meaningful in that case to hold two of them constant while changing the other?

Categorical predictors

candy_model <- lm(winpercent ~ sugarpercent + chocolate, data = candy_rankings)
summary(candy_model)
## 
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy_rankings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.6981  -8.5153  -0.4489   7.7602  27.4820 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     38.262      2.552  14.990 < 0.0000000000000002 ***
## sugarpercent     8.567      4.355   1.967               0.0525 .  
## chocolateTRUE   18.273      2.469   7.401       0.000000000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.4181 
## F-statistic: 31.18 on 2 and 82 DF,  p-value: 0.000000000085
candy <- candy_rankings
candy$predicted <- predict(candy_model)
ggplot(candy, aes(sugarpercent, winpercent,
                  color = chocolate)) + 
  geom_point() +
  geom_line(aes(sugarpercent, predicted, linetype = chocolate))

Categories relative to what baseline group?

  • If the model has an overall intercept, shared by all points, then it cannot have a separate intercept for each categorical group. If there are \(L\) groups, it can only have \(L-1\) intercepts.
  • In the above example, candies with chocolate (chocolateTRUE) and without are two groups, but there is only one coefficient for chocolateTRUE, not one for chocolateFALSE
  • This is because of a problem in linear algebra called “collinearity.” Without going into details, you can imagine it’s a problem if you put the same predictor variable in the model multiple times, with each copy getting its own separate coefficient. Now the model doesn’t know how to distribute the true coefficient for that variable between its different copies
  • To get around this, you can delete the overall intercept term. In R you just put \(-1\) in the formula with your predictors, like this:
candy_model <- lm(winpercent ~ -1 + sugarpercent + chocolate, data = candy_rankings)
summary(candy_model)
## 
## Call:
## lm(formula = winpercent ~ -1 + sugarpercent + chocolate, data = candy_rankings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.6981  -8.5153  -0.4489   7.7602  27.4820 
## 
## Coefficients:
##                Estimate Std. Error t value            Pr(>|t|)    
## sugarpercent      8.567      4.355   1.967              0.0525 .  
## chocolateFALSE   38.262      2.552  14.990 <0.0000000000000002 ***
## chocolateTRUE    56.535      2.894  19.535 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.22 on 82 degrees of freedom
## Multiple R-squared:  0.9557, Adjusted R-squared:  0.9541 
## F-statistic: 590.2 on 3 and 82 DF,  p-value: < 0.00000000000000022
  • Now we can see, for example, that the average win percent for all candies without chocolate is about 38%
  • (We’ll revisit this problem, collinearity, when we talk about model selection)

Non-linear transformations of predictors

n <- 1000
X <- runif(n, min = -2, max = 2)
Y <- 1 + 2 * X - 1.7 * X^2 + rnorm(n)
nldata <- data.frame(Y = Y, X = X)
ggplot(nldata, aes(X, Y)) + 
  geom_point(alpha = .5) +
  stat_smooth(method = "lm", se = F)