Outline

The \(F\)-test for nested models

full_model <- lm(winpercent ~ chocolate + 
             fruity + caramel + hard + bar + 
             sugarpercent + pricepercent,
           data = candy_rankings)
sub_model <- lm(winpercent ~ chocolate + sugarpercent,
           data = candy_rankings)
chocolate_model <- lm(winpercent ~ chocolate, data = candy_rankings)
tidy(full_model)
##            term  estimate std.error  statistic      p.value
## 1   (Intercept) 34.849651  3.995614  8.7219761 4.121255e-13
## 2 chocolateTRUE 21.222221  4.004069  5.3001635 1.069829e-06
## 3    fruityTRUE  7.563380  3.828542  1.9755248 5.179194e-02
## 4   caramelTRUE  1.829913  3.614236  0.5063069 6.140878e-01
## 5      hardTRUE -6.389220  3.496633 -1.8272494 7.153689e-02
## 6       barTRUE  3.001409  3.766306  0.7969105 4.279534e-01
## 7  sugarpercent  9.396004  4.734918  1.9844066 5.077349e-02
## 8  pricepercent -3.337411  5.516023 -0.6050393 5.469319e-01
tidy(sub_model)
##            term estimate std.error statistic      p.value
## 1   (Intercept) 38.26211  2.552430 14.990465 3.308889e-25
## 2 chocolateTRUE 18.27331  2.468991  7.401124 1.058930e-10
## 3  sugarpercent  8.56662  4.354581  1.967266 5.253395e-02
rbind(glance(full_model), glance(sub_model))
##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.4785593     0.4311556 11.09783  10.09540 6.800303e-09  8 -320.9825
## 2 0.4319631     0.4181085 11.22438  31.17841 8.500396e-11  3 -324.6201
##        AIC      BIC  deviance df.residual
## 1 659.9651 681.9489  9483.461          77
## 2 657.2403 667.0109 10330.907          82

Hypothesis testing approach

  • We may be interested in choosing between the models using a hypothesis test, e.g. for scientific purposes
  • The null hypothesis is that the smaller, simpler model is good enough
  • The alternative hypothesis is that at least one of the additional variables in the larger model provides a significant improvement
  • The residual sum of squares (RSS) is \[ \sum_{i=1}^n (y_i - \hat y_i)^2 \]
  • Let’s write \(RSS_f\) for the residual sum of squares when predictions \(\hat y\) are given by the full model, and \(RSS_s\) for the sub model
  • Note: \(RSS_f \leq RSS_s\)
  • The \(F\) statistic for comparing these two models is \[ F = \frac{\left(\frac{RSS_s - RSS_f}{p - k}\right)}{\left(\frac{RSS_f}{n-p}\right)} \]
  • The observed value is compared to an \(F_{p-k, n - p}\) distribution to find a \(p\)-value

  • In R, after fitting both models, use the anova() function on them like this:

anova(sub_model, full_model)
## Analysis of Variance Table
## 
## Model 1: winpercent ~ chocolate + sugarpercent
## Model 2: winpercent ~ chocolate + fruity + caramel + hard + bar + sugarpercent + 
##     pricepercent
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     82 10330.9                           
## 2     77  9483.5  5    847.45 1.3762 0.2426
  • Advantage: hypothesis test framing may be a natural way to answer the scientific or practical question that the regression models are being used to study
  • Disadvantage: only for nested models, maybe hypothesis testing isn’t relevant and we just want the “best” model

Optimization and model selection algorithms

\[ \text{minimize}_{\beta_0, \beta_1} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

\[ \text{minimize}_{\beta_0} \sum_{i=1}^n (y_i - \beta_0)^2 \]

\[ \text{minimize}_{f} \sum_{i=1}^n (y_i - f(x_i))^2 \]

\[ y_i = f(x_i) + e_i \]

\[ \text{minimize}_{f_1, f_2, f_3} \sum_{i=1}^n (y_i - f_3[f_2(f_1[x_i])])^2 \] - (If you compose linear functions, the resulting function is also linear, just with different coefficients) - (If you compose simple non-linear functions, the resulting function can be a very complicated non-linear one)

Model selection algorithms

  • Suppose there are multiple predictor variables, possibly many of them, and we don’t know which ones to include in the linear model
  • How about letting the computer pick them for us?
  • General definition: models with more variables are considered more complex

Create a sequence of models with increasing complexity

  • General idea: start with no variables and add one at a time, each time picking the “best” out of the remaining predictors
  • Special cases of this idea, each with a different definition of “best” for the next variable to add: forward stepwise, lasso, elastic net, and others
  • Now we just have to pick one model out of a (possibly small) list of models

Penalize complexity

  • Increasing complexity will make the model fit closer and closer to the data, reducing the RSS
  • It’s never impressive to make a model fit more closely to data unless there are some other constraints on the model that make it useful
  • Common approach: penalize complexity. Instead of just trying to minimize RSS (fit) by itself, try to minimize \[ \text{fit error} + \text{complexity} \]
  • Typically the fit error is measured by RSS and the complexity is quantified in some way related to the number of variables in the model
  • For example, one method, called AIC, is to pick the model with the lowest \[ n\log(RSS) + 2k \] where \(k\) is the number of variables in the model.

  • This code below shows output for a sequence of models. At the end is a summary of the model chosen by the AIC method.

candy <- candy_rankings %>% select(-competitorname)
full_model <- lm(winpercent ~ ., data = candy)
const_model <- lm(winpercent ~ 1, data = candy)
selected_model <- step(const_model,
     scope = list(lower = const_model, upper = full_model),
     direction = "forward")
## Start:  AIC=458.09
## winpercent ~ 1
## 
##                    Df Sum of Sq   RSS    AIC
## + chocolate         1    7368.5 10818 415.94
## + bar               1    3361.7 14825 442.72
## + peanutyalmondy    1    3000.7 15186 444.77
## + fruity            1    2639.2 15548 446.77
## + pricepercent      1    2168.8 16018 449.30
## + crispedricewafer  1    1917.2 16270 450.63
## + hard              1    1752.1 16435 451.48
## + pluribus          1    1113.6 17073 454.72
## + sugarpercent      1     955.0 17232 455.51
## + caramel           1     828.4 17359 456.13
## + nougat            1     722.9 17464 456.65
## <none>                          18187 458.09
## 
## Step:  AIC=415.94
## winpercent ~ chocolate
## 
##                    Df Sum of Sq   RSS    AIC
## + peanutyalmondy    1    582.51 10236 413.24
## + sugarpercent      1    487.59 10331 414.02
## + fruity            1    336.12 10482 415.26
## <none>                          10818 415.94
## + crispedricewafer  1    237.84 10581 416.05
## + hard              1    172.00 10646 416.58
## + bar               1     69.75 10749 417.39
## + caramel           1     57.34 10761 417.49
## + nougat            1     26.82 10792 417.73
## + pluribus          1     20.06 10798 417.78
## + pricepercent      1     14.16 10804 417.83
## 
## Step:  AIC=413.24
## winpercent ~ chocolate + peanutyalmondy
## 
##                    Df Sum of Sq     RSS    AIC
## + fruity            1    547.28  9688.7 410.57
## + sugarpercent      1    434.23  9801.7 411.55
## + crispedricewafer  1    390.81  9845.2 411.93
## <none>                          10236.0 413.24
## + hard              1    122.32 10113.7 414.21
## + caramel           1     72.63 10163.4 414.63
## + bar               1     52.29 10183.7 414.80
## + pluribus          1      5.44 10230.5 415.19
## + nougat            1      4.19 10231.8 415.20
## + pricepercent      1      0.04 10235.9 415.24
## 
## Step:  AIC=410.57
## winpercent ~ chocolate + peanutyalmondy + fruity
## 
##                    Df Sum of Sq    RSS    AIC
## + crispedricewafer  1    449.97 9238.7 408.52
## + sugarpercent      1    365.51 9323.2 409.30
## + hard              1    260.55 9428.2 410.25
## <none>                          9688.7 410.57
## + caramel           1    214.82 9473.9 410.66
## + bar               1    105.95 9582.8 411.63
## + nougat            1     19.77 9668.9 412.39
## + pluribus          1     14.01 9674.7 412.44
## + pricepercent      1      3.48 9685.2 412.53
## 
## Step:  AIC=408.52
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer
## 
##                Df Sum of Sq    RSS    AIC
## + sugarpercent  1    326.01 8912.7 407.47
## + hard          1    242.01 8996.7 408.27
## <none>                      9238.7 408.52
## + caramel       1    146.27 9092.5 409.17
## + nougat        1     72.08 9166.7 409.86
## + bar           1     17.04 9221.7 410.37
## + pricepercent  1      8.19 9230.6 410.45
## + pluribus      1      0.69 9238.0 410.52
## 
## Step:  AIC=407.47
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer + 
##     sugarpercent
## 
##                Df Sum of Sq    RSS    AIC
## + hard          1    327.07 8585.7 406.29
## <none>                      8912.7 407.47
## + pricepercent  1     84.45 8828.3 408.66
## + caramel       1     67.21 8845.5 408.83
## + nougat        1     41.81 8870.9 409.07
## + bar           1     11.39 8901.3 409.36
## + pluribus      1      6.61 8906.1 409.41
## 
## Step:  AIC=406.29
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer + 
##     sugarpercent + hard
## 
##                Df Sum of Sq    RSS    AIC
## <none>                      8585.7 406.29
## + pricepercent  1   125.077 8460.6 407.04
## + caramel       1    61.164 8524.5 407.68
## + nougat        1    31.646 8554.0 407.98
## + pluribus      1    30.632 8555.0 407.99
## + bar           1     6.049 8579.6 408.23
summary(selected_model)
## 
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity + 
##     crispedricewafer + sugarpercent + hard, data = candy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.5030  -6.7026   0.5668   5.8802  24.0107 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            32.941      3.518   9.365 2.12e-14 ***
## chocolateTRUE          19.147      3.587   5.338 8.98e-07 ***
## peanutyalmondyTRUE      9.483      3.446   2.752  0.00737 ** 
## fruityTRUE              8.881      3.561   2.494  0.01473 *  
## crispedricewaferTRUE    8.385      4.484   1.870  0.06525 .  
## sugarpercent            7.979      4.129   1.932  0.05693 .  
## hardTRUE               -5.669      3.289  -1.724  0.08871 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 78 degrees of freedom
## Multiple R-squared:  0.5279, Adjusted R-squared:  0.4916 
## F-statistic: 14.54 on 6 and 78 DF,  p-value: 4.622e-11
  • Check: is the adjusted R-squared for this model better than the one for sub_model back near the top of these notes? How much better?
rbind(glance(selected_model), glance(sub_model))
##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.5279241     0.4916106 10.49155  14.53794 4.621890e-11  7 -316.7557
## 2 0.4319631     0.4181085 11.22438  31.17841 8.500396e-11  3 -324.6201
##        AIC      BIC  deviance df.residual
## 1 649.5113 669.0525  8585.661          78
## 2 657.2403 667.0109 10330.907          82

Overfitting, model complexity, bias-variance tradeoff

gm2007 <- gapminder %>% filter(year == "2007")
ggplot(gm2007, aes(gdpPercap, lifeExp)) + geom_point() +
  stat_smooth(se = F, method = "loess", span = 1) +
  stat_smooth(se = F, method = "loess", span = 0.2, linetype = 5) 

What if we don’t know how to penalize complexity?

  • In a previous example, we used a method called AIC which is a measure of how good the model is that penalizes how many variables it has
  • But what if we aren’t using a linear model, but some other machine learning method where there is no classical statistical theory developing an analogous version of the AIC?
  • There is a general method called validation that works for almost every kind of model
  • First, we’ll go into more detail about why too much complexity is bad

Overfitting

Validation

n <- 300
p <- 100
beta <- c(rep(1, 3), rep(0, p - 3))
X <- matrix(rnorm(n*p), nrow = n)
y <- X %*% beta + rnorm(n)
data <- data.frame(y = y, X = X)
train <- data %>% sample_frac(0.5)
test <- setdiff(data, train)
simple_model <- lm(y ~ X.1 + X.2 + X.3, data = train)
full_model <- lm(y ~ ., data = train) # the . means "every variable in the data"
rbind(glance(simple_model), glance(full_model))
##   r.squared adj.r.squared     sigma  statistic      p.value  df    logLik
## 1 0.7882191     0.7838674 0.9699262 181.130554 5.311113e-49   4 -206.2333
## 2 0.9334304     0.7975742 0.9386669   6.870722 1.197464e-11 101 -119.4355
##        AIC      BIC  deviance df.residual
## 1 422.4667 437.5199 137.35051         146
## 2 442.8709 749.9557  43.17368          49
errors <- data.frame(
  model = c(rep("simple", n/2), rep("full", n/2)),
  test_error = c((predict(simple_model, newdata = test) - test$y)^2,
                 (predict(full_model, newdata = test) - test$y)^2))

errors %>% group_by(model) %>% summarize(
  median = median(test_error),
  mean = mean(test_error)
)
## # A tibble: 2 x 3
##   model  median  mean
##   <fct>   <dbl> <dbl>
## 1 full    1.23  2.87 
## 2 simple  0.458 0.914

Cross-validation

x_train <- as.matrix(train[,-1])
y_train <- train[,1]
x_test <- as.matrix(test[,-1])
y_test <- test[,1]
model <- cv.glmnet(x = x_train, y = y_train, nfolds = 5)
autoplot(model)

coef(model)
## 101 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 0.0680559
## X.1         0.7057829
## X.2         0.6983142
## X.3         0.8568229
## X.4         .        
## X.5         .        
## X.6         .        
## X.7         .        
## X.8         .        
## X.9         .        
## X.10        .        
## X.11        .        
## X.12        .        
## X.13        .        
## X.14        .        
## X.15        .        
## X.16        .        
## X.17        .        
## X.18        .        
## X.19        .        
## X.20        .        
## X.21        .        
## X.22        .        
## X.23        .        
## X.24        .        
## X.25        .        
## X.26        .        
## X.27        .        
## X.28        .        
## X.29        .        
## X.30        .        
## X.31        .        
## X.32        .        
## X.33        .        
## X.34        .        
## X.35        .        
## X.36        .        
## X.37        .        
## X.38        .        
## X.39        .        
## X.40        .        
## X.41        .        
## X.42        .        
## X.43        .        
## X.44        .        
## X.45        .        
## X.46        .        
## X.47        .        
## X.48        .        
## X.49        .        
## X.50        .        
## X.51        .        
## X.52        .        
## X.53        .        
## X.54        .        
## X.55        .        
## X.56        .        
## X.57        .        
## X.58        .        
## X.59        .        
## X.60        .        
## X.61        .        
## X.62        .        
## X.63        .        
## X.64        .        
## X.65        .        
## X.66        .        
## X.67        .        
## X.68        .        
## X.69        .        
## X.70        .        
## X.71        .        
## X.72        .        
## X.73        .        
## X.74        .        
## X.75        .        
## X.76        .        
## X.77        .        
## X.78        .        
## X.79        .        
## X.80        .        
## X.81        .        
## X.82        .        
## X.83        .        
## X.84        .        
## X.85        .        
## X.86        .        
## X.87        .        
## X.88        .        
## X.89        .        
## X.90        .        
## X.91        .        
## X.92        .        
## X.93        .        
## X.94        .        
## X.95        .        
## X.96        .        
## X.97        .        
## X.98        .        
## X.99        .        
## X.100       .

Summary

  • Many cutting edge methods in ML/AI rely on optimization, just like linear regression
  • More complex models have more parameters and/or more complex types of functions to predict the outcome variable
  • Too much model complexity can lead to overfitting–for example, including too many predictor variables can make it seem like the model does a better job of prediction…
  • Test or validation prediction accuracy is a higher standard
  • Keep this image in mind, and remember bias-variance tradeoff
  • Cross-validation is a very useful method (when the sample size is large enough) to automatically pick models with low test error