Outline

The \(F\)-test for nested models

full_model <- lm(winpercent ~ chocolate + 
             fruity + caramel + hard + bar + 
             sugarpercent + pricepercent,
           data = candy_rankings)
sub_model <- lm(winpercent ~ chocolate + sugarpercent,
           data = candy_rankings)
chocolate_model <- lm(winpercent ~ chocolate, data = candy_rankings)
summary(full_model)
## 
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + hard + 
##     bar + sugarpercent + pricepercent, data = candy_rankings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.068  -6.598  -0.437   7.850  26.406 
## 
## Coefficients:
##               Estimate Std. Error t value         Pr(>|t|)    
## (Intercept)      34.85       4.00    8.72 0.00000000000041 ***
## chocolateTRUE    21.22       4.00    5.30 0.00000106982941 ***
## fruityTRUE        7.56       3.83    1.98            0.052 .  
## caramelTRUE       1.83       3.61    0.51            0.614    
## hardTRUE         -6.39       3.50   -1.83            0.072 .  
## barTRUE           3.00       3.77    0.80            0.428    
## sugarpercent      9.40       4.73    1.98            0.051 .  
## pricepercent     -3.34       5.52   -0.61            0.547    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.1 on 77 degrees of freedom
## Multiple R-squared:  0.479,  Adjusted R-squared:  0.431 
## F-statistic: 10.1 on 7 and 77 DF,  p-value: 0.0000000068
summary(sub_model)
## 
## Call:
## lm(formula = winpercent ~ chocolate + sugarpercent, data = candy_rankings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.698  -8.515  -0.449   7.760  27.482 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      38.26       2.55   14.99 < 0.0000000000000002 ***
## chocolateTRUE    18.27       2.47    7.40        0.00000000011 ***
## sugarpercent      8.57       4.35    1.97                0.053 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.2 on 82 degrees of freedom
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.418 
## F-statistic: 31.2 on 2 and 82 DF,  p-value: 0.000000000085

Hypothesis testing approach

  • We may be interested in choosing between the models using a hypothesis test, e.g. for scientific purposes
  • The null hypothesis is that the smaller, simpler model is good enough
  • The alternative hypothesis is that at least one of the additional variables in the larger model provides a significant improvement
  • The residual sum of squares (RSS) is \[ \sum_{i=1}^n (y_i - \hat y_i)^2 \]
  • Let’s write \(RSS_f\) for the residual sum of squares when predictions \(\hat y\) are given by the full model, and \(RSS_s\) for the sub model
  • Note: \(RSS_f \leq RSS_s\)
  • The \(F\) statistic for comparing these two models is \[ F = \frac{\left(\frac{RSS_s - RSS_f}{p - k}\right)}{\left(\frac{RSS_f}{n-p}\right)} \]
  • The observed value is compared to an \(F_{p-k, n - p}\) distribution to find a \(p\)-value

  • In R, after fitting both models, use the anova() function on them like this:

anova(sub_model, full_model)
## Analysis of Variance Table
## 
## Model 1: winpercent ~ chocolate + sugarpercent
## Model 2: winpercent ~ chocolate + fruity + caramel + hard + bar + sugarpercent + 
##     pricepercent
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1     82 10331                         
## 2     77  9483  5       847 1.38   0.24
  • Advantage: hypothesis test framing may be a natural way to answer the scientific or practical question that the regression models are being used to study
  • Disadvantage: only for nested models, maybe hypothesis testing isn’t relevant and we just want the “best” model

Automatic model selection algorithms

Create a sequence of models with increasing complexity

  • General idea: start with no variables and add one at a time, each time picking the “best” out of the remaining predictors
  • Special cases of this idea, each with a different definition of “best” for the next variable to add: forward stepwise, lasso, elastic net, and others
  • Now we just have to pick one model out of a (possibly small) list of models

Penalize complexity

  • Increasing complexity will make the model fit closer and closer to the data, reducing the RSS
  • It’s never impressive to make a model fit more closely to data unless there are some other constraints on the model that make it useful
  • Common approach: penalize complexity. Instead of just trying to minimize RSS (fit) by itself, try to minimize \[ \text{fit error} + \text{complexity} \]
  • Typically the fit error is measured by RSS and the complexity is quantified in some way related to the number of variables in the model
  • For example, one method, called AIC, is to pick the model with the lowest \[ n\log(RSS) + 2k \] where \(k\) is the number of variables in the model.

  • This code below shows output for a sequence of models. At the end is a summary of the model chosen by the AIC method.

candy <- candy_rankings %>% select(-competitorname)
full_model <- lm(winpercent ~ ., data = candy)
const_model <- lm(winpercent ~ 1, data = candy)
selected_model <- step(const_model,
     scope = list(lower = const_model, upper = full_model),
     direction = "forward")
## Start:  AIC=458.1
## winpercent ~ 1
## 
##                    Df Sum of Sq   RSS AIC
## + chocolate         1      7369 10818 416
## + bar               1      3362 14825 443
## + peanutyalmondy    1      3001 15186 445
## + fruity            1      2639 15548 447
## + pricepercent      1      2169 16018 449
## + crispedricewafer  1      1917 16270 451
## + hard              1      1752 16435 451
## + pluribus          1      1114 17073 455
## + sugarpercent      1       955 17232 456
## + caramel           1       828 17359 456
## + nougat            1       723 17464 457
## <none>                          18187 458
## 
## Step:  AIC=415.9
## winpercent ~ chocolate
## 
##                    Df Sum of Sq   RSS AIC
## + peanutyalmondy    1       583 10236 413
## + sugarpercent      1       488 10331 414
## + fruity            1       336 10482 415
## <none>                          10818 416
## + crispedricewafer  1       238 10581 416
## + hard              1       172 10646 417
## + bar               1        70 10749 417
## + caramel           1        57 10761 417
## + nougat            1        27 10792 418
## + pluribus          1        20 10798 418
## + pricepercent      1        14 10804 418
## 
## Step:  AIC=413.2
## winpercent ~ chocolate + peanutyalmondy
## 
##                    Df Sum of Sq   RSS AIC
## + fruity            1       547  9689 411
## + sugarpercent      1       434  9802 412
## + crispedricewafer  1       391  9845 412
## <none>                          10236 413
## + hard              1       122 10114 414
## + caramel           1        73 10163 415
## + bar               1        52 10184 415
## + pluribus          1         5 10231 415
## + nougat            1         4 10232 415
## + pricepercent      1         0 10236 415
## 
## Step:  AIC=410.6
## winpercent ~ chocolate + peanutyalmondy + fruity
## 
##                    Df Sum of Sq  RSS AIC
## + crispedricewafer  1       450 9239 409
## + sugarpercent      1       366 9323 409
## + hard              1       261 9428 410
## <none>                          9689 411
## + caramel           1       215 9474 411
## + bar               1       106 9583 412
## + nougat            1        20 9669 412
## + pluribus          1        14 9675 412
## + pricepercent      1         3 9685 413
## 
## Step:  AIC=408.5
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer
## 
##                Df Sum of Sq  RSS AIC
## + sugarpercent  1       326 8913 407
## + hard          1       242 8997 408
## <none>                      9239 409
## + caramel       1       146 9092 409
## + nougat        1        72 9167 410
## + bar           1        17 9222 410
## + pricepercent  1         8 9231 410
## + pluribus      1         1 9238 411
## 
## Step:  AIC=407.5
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer + 
##     sugarpercent
## 
##                Df Sum of Sq  RSS AIC
## + hard          1       327 8586 406
## <none>                      8913 407
## + pricepercent  1        84 8828 409
## + caramel       1        67 8846 409
## + nougat        1        42 8871 409
## + bar           1        11 8901 409
## + pluribus      1         7 8906 409
## 
## Step:  AIC=406.3
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer + 
##     sugarpercent + hard
## 
##                Df Sum of Sq  RSS AIC
## <none>                      8586 406
## + pricepercent  1     125.1 8461 407
## + caramel       1      61.2 8524 408
## + nougat        1      31.6 8554 408
## + pluribus      1      30.6 8555 408
## + bar           1       6.0 8580 408
summary(selected_model)
## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity + 
##     crispedricewafer + sugarpercent + hard, data = candy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.503  -6.703   0.567   5.880  24.011 
## 
## Coefficients:
##                      Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)             32.94       3.52    9.36 0.000000000000021 ***
## chocolateTRUE           19.15       3.59    5.34 0.000000897709318 ***
## peanutyalmondyTRUE       9.48       3.45    2.75            0.0074 ** 
## fruityTRUE               8.88       3.56    2.49            0.0147 *  
## crispedricewaferTRUE     8.39       4.48    1.87            0.0653 .  
## sugarpercent             7.98       4.13    1.93            0.0569 .  
## hardTRUE                -5.67       3.29   -1.72            0.0887 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.5 on 78 degrees of freedom
## Multiple R-squared:  0.528,  Adjusted R-squared:  0.492 
## F-statistic: 14.5 on 6 and 78 DF,  p-value: 0.0000000000462
  • Check: is the adjusted R-squared for this model better than the one for sub_model back near the top of these notes? How much better?

Complexity and bias-variance tradeoff

gm2007 <- gapminder %>% filter(year == "2007")
ggplot(gm2007, aes(gdpPercap, lifeExp)) + geom_point() +
  stat_smooth(se = F, method = "loess", span = 1) +
  stat_smooth(se = F, method = "loess", span = 0.2, linetype = 5)