#### Outline

• Comparing two nested models with the $$F$$-test
• Automatic model selection algorithms
• Complexity and bias-variance tradeoff

## The $$F$$-test for nested models

• A method that uses hypothesis tests to compare linear models
• The models must be nested: the larger of the two models contains all of the predictor variables in the smaller one, plus some extra variables
• Motivating question: do we really need this larger model, or would the smaller, simpler one be sufficient?
• Larger model, often called the “full” model, with $$p$$ predictors: $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_q x_{ik} + \cdots + \beta_p x_{ip} + e_i$
• Simpler model, often called the “sub” model, with $$k < p$$ predictors: $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_q x_{ik} + e_i$
• The sub-model omits variables $$k+1, k+2, \ldots, p$$
• (We can always rearrange the variables, so it doesn’t matter if we’re leaving out the first, last, or variables in the middle)
full_model <- lm(winpercent ~ chocolate +
fruity + caramel + hard + bar +
sugarpercent + pricepercent,
data = candy_rankings)
sub_model <- lm(winpercent ~ chocolate + sugarpercent,
data = candy_rankings)
chocolate_model <- lm(winpercent ~ chocolate, data = candy_rankings)
summary(full_model)
##
## Call:
## lm(formula = winpercent ~ chocolate + fruity + caramel + hard +
##     bar + sugarpercent + pricepercent, data = candy_rankings)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -24.068  -6.598  -0.437   7.850  26.406
##
## Coefficients:
##               Estimate Std. Error t value         Pr(>|t|)
## (Intercept)      34.85       4.00    8.72 0.00000000000041 ***
## chocolateTRUE    21.22       4.00    5.30 0.00000106982941 ***
## fruityTRUE        7.56       3.83    1.98            0.052 .
## caramelTRUE       1.83       3.61    0.51            0.614
## hardTRUE         -6.39       3.50   -1.83            0.072 .
## barTRUE           3.00       3.77    0.80            0.428
## sugarpercent      9.40       4.73    1.98            0.051 .
## pricepercent     -3.34       5.52   -0.61            0.547
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.1 on 77 degrees of freedom
## Multiple R-squared:  0.479,  Adjusted R-squared:  0.431
## F-statistic: 10.1 on 7 and 77 DF,  p-value: 0.0000000068
summary(sub_model)
##
## Call:
## lm(formula = winpercent ~ chocolate + sugarpercent, data = candy_rankings)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -23.698  -8.515  -0.449   7.760  27.482
##
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)
## (Intercept)      38.26       2.55   14.99 < 0.0000000000000002 ***
## chocolateTRUE    18.27       2.47    7.40        0.00000000011 ***
## sugarpercent      8.57       4.35    1.97                0.053 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.2 on 82 degrees of freedom
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.418
## F-statistic: 31.2 on 2 and 82 DF,  p-value: 0.000000000085
• How could we compare and choose between these models?

• (Could compare residual standard error or adjusted R-squared)

### Hypothesis testing approach

• We may be interested in choosing between the models using a hypothesis test, e.g. for scientific purposes
• The null hypothesis is that the smaller, simpler model is good enough
• The alternative hypothesis is that at least one of the additional variables in the larger model provides a significant improvement
• The residual sum of squares (RSS) is $\sum_{i=1}^n (y_i - \hat y_i)^2$
• Let’s write $$RSS_f$$ for the residual sum of squares when predictions $$\hat y$$ are given by the full model, and $$RSS_s$$ for the sub model
• Note: $$RSS_f \leq RSS_s$$
• The $$F$$ statistic for comparing these two models is $F = \frac{\left(\frac{RSS_s - RSS_f}{p - k}\right)}{\left(\frac{RSS_f}{n-p}\right)}$
• The observed value is compared to an $$F_{p-k, n - p}$$ distribution to find a $$p$$-value

• In R, after fitting both models, use the anova() function on them like this:

anova(sub_model, full_model)
## Analysis of Variance Table
##
## Model 1: winpercent ~ chocolate + sugarpercent
## Model 2: winpercent ~ chocolate + fruity + caramel + hard + bar + sugarpercent +
##     pricepercent
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1     82 10331
## 2     77  9483  5       847 1.38   0.24
• Advantage: hypothesis test framing may be a natural way to answer the scientific or practical question that the regression models are being used to study
• Disadvantage: only for nested models, maybe hypothesis testing isn’t relevant and we just want the “best” model

## Automatic model selection algorithms

• Suppose there are multiple predictor variables, possibly many of them, and we don’t know which ones to include in the linear model
• How about letting the computer pick them for us?
• General definition: models with more variables are considered more complex

### Create a sequence of models with increasing complexity

• General idea: start with no variables and add one at a time, each time picking the “best” out of the remaining predictors
• Special cases of this idea, each with a different definition of “best” for the next variable to add: forward stepwise, lasso, elastic net, and others
• Now we just have to pick one model out of a (possibly small) list of models

### Penalize complexity

• Increasing complexity will make the model fit closer and closer to the data, reducing the RSS
• It’s never impressive to make a model fit more closely to data unless there are some other constraints on the model that make it useful
• Common approach: penalize complexity. Instead of just trying to minimize RSS (fit) by itself, try to minimize $\text{fit error} + \text{complexity}$
• Typically the fit error is measured by RSS and the complexity is quantified in some way related to the number of variables in the model
• For example, one method, called AIC, is to pick the model with the lowest $n\log(RSS) + 2k$ where $$k$$ is the number of variables in the model.

• This code below shows output for a sequence of models. At the end is a summary of the model chosen by the AIC method.

candy <- candy_rankings %>% select(-competitorname)
full_model <- lm(winpercent ~ ., data = candy)
const_model <- lm(winpercent ~ 1, data = candy)
selected_model <- step(const_model,
scope = list(lower = const_model, upper = full_model),
direction = "forward")
## Start:  AIC=458.1
## winpercent ~ 1
##
##                    Df Sum of Sq   RSS AIC
## + chocolate         1      7369 10818 416
## + bar               1      3362 14825 443
## + peanutyalmondy    1      3001 15186 445
## + fruity            1      2639 15548 447
## + pricepercent      1      2169 16018 449
## + crispedricewafer  1      1917 16270 451
## + hard              1      1752 16435 451
## + pluribus          1      1114 17073 455
## + sugarpercent      1       955 17232 456
## + caramel           1       828 17359 456
## + nougat            1       723 17464 457
## <none>                          18187 458
##
## Step:  AIC=415.9
## winpercent ~ chocolate
##
##                    Df Sum of Sq   RSS AIC
## + peanutyalmondy    1       583 10236 413
## + sugarpercent      1       488 10331 414
## + fruity            1       336 10482 415
## <none>                          10818 416
## + crispedricewafer  1       238 10581 416
## + hard              1       172 10646 417
## + bar               1        70 10749 417
## + caramel           1        57 10761 417
## + nougat            1        27 10792 418
## + pluribus          1        20 10798 418
## + pricepercent      1        14 10804 418
##
## Step:  AIC=413.2
## winpercent ~ chocolate + peanutyalmondy
##
##                    Df Sum of Sq   RSS AIC
## + fruity            1       547  9689 411
## + sugarpercent      1       434  9802 412
## + crispedricewafer  1       391  9845 412
## <none>                          10236 413
## + hard              1       122 10114 414
## + caramel           1        73 10163 415
## + bar               1        52 10184 415
## + pluribus          1         5 10231 415
## + nougat            1         4 10232 415
## + pricepercent      1         0 10236 415
##
## Step:  AIC=410.6
## winpercent ~ chocolate + peanutyalmondy + fruity
##
##                    Df Sum of Sq  RSS AIC
## + crispedricewafer  1       450 9239 409
## + sugarpercent      1       366 9323 409
## + hard              1       261 9428 410
## <none>                          9689 411
## + caramel           1       215 9474 411
## + bar               1       106 9583 412
## + nougat            1        20 9669 412
## + pluribus          1        14 9675 412
## + pricepercent      1         3 9685 413
##
## Step:  AIC=408.5
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer
##
##                Df Sum of Sq  RSS AIC
## + sugarpercent  1       326 8913 407
## + hard          1       242 8997 408
## <none>                      9239 409
## + caramel       1       146 9092 409
## + nougat        1        72 9167 410
## + bar           1        17 9222 410
## + pricepercent  1         8 9231 410
## + pluribus      1         1 9238 411
##
## Step:  AIC=407.5
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer +
##     sugarpercent
##
##                Df Sum of Sq  RSS AIC
## + hard          1       327 8586 406
## <none>                      8913 407
## + pricepercent  1        84 8828 409
## + caramel       1        67 8846 409
## + nougat        1        42 8871 409
## + bar           1        11 8901 409
## + pluribus      1         7 8906 409
##
## Step:  AIC=406.3
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer +
##     sugarpercent + hard
##
##                Df Sum of Sq  RSS AIC
## <none>                      8586 406
## + pricepercent  1     125.1 8461 407
## + caramel       1      61.2 8524 408
## + nougat        1      31.6 8554 408
## + pluribus      1      30.6 8555 408
## + bar           1       6.0 8580 408
summary(selected_model)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity +
##     crispedricewafer + sugarpercent + hard, data = candy)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -21.503  -6.703   0.567   5.880  24.011
##
## Coefficients:
##                      Estimate Std. Error t value          Pr(>|t|)
## (Intercept)             32.94       3.52    9.36 0.000000000000021 ***
## chocolateTRUE           19.15       3.59    5.34 0.000000897709318 ***
## peanutyalmondyTRUE       9.48       3.45    2.75            0.0074 **
## fruityTRUE               8.88       3.56    2.49            0.0147 *
## crispedricewaferTRUE     8.39       4.48    1.87            0.0653 .
## sugarpercent             7.98       4.13    1.93            0.0569 .
## hardTRUE                -5.67       3.29   -1.72            0.0887 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.5 on 78 degrees of freedom
## Multiple R-squared:  0.528,  Adjusted R-squared:  0.492
## F-statistic: 14.5 on 6 and 78 DF,  p-value: 0.0000000000462
• Check: is the adjusted R-squared for this model better than the one for sub_model back near the top of these notes? How much better?

## Complexity and bias-variance tradeoff

• In general, models with greater complexity will have lower bias, but higher variance in their predictions
• See the last plot near the bottom of this page

• Example: for a single non-linear relationship (instead of having multiple variables), the degree of nonlinearity is a measure of complexity
• A more “wiggly” curve comes from a more complex model–it can wiggle around to get closer to points in the data, making it less biased but with more variance
• The dashed line below has higher complexity (more variance, less bias)

gm2007 <- gapminder %>% filter(year == "2007")
ggplot(gm2007, aes(gdpPercap, lifeExp)) + geom_point() +
stat_smooth(se = F, method = "loess", span = 1) +
stat_smooth(se = F, method = "loess", span = 0.2, linetype = 5)