#### Outline

• Comparing two nested models with the $$F$$-test
• Optimization and model selection algorithms
• Overfitting, model complexity, bias-variance tradeoff
• Validation and cross-validation

## The $$F$$-test for nested models

• A method that uses hypothesis tests to compare linear models
• The models must be nested: the larger of the two models contains all of the predictor variables in the smaller one, plus some extra variables
• Motivating question: do we really need this larger model, or would the smaller, simpler one be sufficient?
• Larger model, often called the “full” model, with $$p$$ predictors: $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_q x_{ik} + \cdots + \beta_p x_{ip} + e_i$
• Simpler model, often called the “sub” model, with $$k < p$$ predictors: $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_q x_{ik} + e_i$
• The sub-model omits variables $$k+1, k+2, \ldots, p$$
• (We can always rearrange the variables, so it doesn’t matter if we’re leaving out the first, last, or variables in the middle)
full_model <- lm(winpercent ~ chocolate +
fruity + caramel + hard + bar +
sugarpercent + pricepercent,
data = candy_rankings)
sub_model <- lm(winpercent ~ chocolate + sugarpercent,
data = candy_rankings)
chocolate_model <- lm(winpercent ~ chocolate, data = candy_rankings)
tidy(full_model)
##            term  estimate std.error  statistic      p.value
## 1   (Intercept) 34.849651  3.995614  8.7219761 4.121255e-13
## 2 chocolateTRUE 21.222221  4.004069  5.3001635 1.069829e-06
## 3    fruityTRUE  7.563380  3.828542  1.9755248 5.179194e-02
## 4   caramelTRUE  1.829913  3.614236  0.5063069 6.140878e-01
## 5      hardTRUE -6.389220  3.496633 -1.8272494 7.153689e-02
## 6       barTRUE  3.001409  3.766306  0.7969105 4.279534e-01
## 7  sugarpercent  9.396004  4.734918  1.9844066 5.077349e-02
## 8  pricepercent -3.337411  5.516023 -0.6050393 5.469319e-01
tidy(sub_model)
##            term estimate std.error statistic      p.value
## 1   (Intercept) 38.26211  2.552430 14.990465 3.308889e-25
## 2 chocolateTRUE 18.27331  2.468991  7.401124 1.058930e-10
## 3  sugarpercent  8.56662  4.354581  1.967266 5.253395e-02
rbind(glance(full_model), glance(sub_model))
##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.4785593     0.4311556 11.09783  10.09540 6.800303e-09  8 -320.9825
## 2 0.4319631     0.4181085 11.22438  31.17841 8.500396e-11  3 -324.6201
##        AIC      BIC  deviance df.residual
## 1 659.9651 681.9489  9483.461          77
## 2 657.2403 667.0109 10330.907          82
• How could we compare and choose between these models?

• (Could compare residual standard error or adjusted R-squared)

### Hypothesis testing approach

• We may be interested in choosing between the models using a hypothesis test, e.g. for scientific purposes
• The null hypothesis is that the smaller, simpler model is good enough
• The alternative hypothesis is that at least one of the additional variables in the larger model provides a significant improvement
• The residual sum of squares (RSS) is $\sum_{i=1}^n (y_i - \hat y_i)^2$
• Let’s write $$RSS_f$$ for the residual sum of squares when predictions $$\hat y$$ are given by the full model, and $$RSS_s$$ for the sub model
• Note: $$RSS_f \leq RSS_s$$
• The $$F$$ statistic for comparing these two models is $F = \frac{\left(\frac{RSS_s - RSS_f}{p - k}\right)}{\left(\frac{RSS_f}{n-p}\right)}$
• The observed value is compared to an $$F_{p-k, n - p}$$ distribution to find a $$p$$-value

• In R, after fitting both models, use the anova() function on them like this:

anova(sub_model, full_model)
## Analysis of Variance Table
##
## Model 1: winpercent ~ chocolate + sugarpercent
## Model 2: winpercent ~ chocolate + fruity + caramel + hard + bar + sugarpercent +
##     pricepercent
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     82 10330.9
## 2     77  9483.5  5    847.45 1.3762 0.2426
• Advantage: hypothesis test framing may be a natural way to answer the scientific or practical question that the regression models are being used to study
• Disadvantage: only for nested models, maybe hypothesis testing isn’t relevant and we just want the “best” model

## Optimization and model selection algorithms

• Optimization is an area of math that develops and analyzes methods for finding maximums or minimums of functions
• Much of it is based on calculus, but not all
• (It’s extremely useful and one of the common prereqs for many of the most successful researchers or users of machine learning / AI / or even statistics. Along with linear algebra, if you have the chance to study optimization and it aligns with your goals I strongly recommend it)

• We have already thought of linear regression as an optimization problem: find the best line (or plane/hyperplane for multiple regression), where best means smallest sum of squared errors

$\text{minimize}_{\beta_0, \beta_1} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2$

• This notation means to pick the parameters $$\beta_0$$ and $$\beta_1$$ to minimize the sum of squared errors
• The sample mean can also be thought of as the solution to an optimization problem: find the best constant to minimize the squared errors

$\text{minimize}_{\beta_0} \sum_{i=1}^n (y_i - \beta_0)^2$

• Many methods in statistics and machine learning can be described as optimization problems, with possibly more complicated kinds of functions

$\text{minimize}_{f} \sum_{i=1}^n (y_i - f(x_i))^2$

• The resulting model predicts $$y$$ using the function $$f(x)$$:

$y_i = f(x_i) + e_i$

• So far we have used a linear function: $$f(x) = \beta_0 + \beta_1 x$$
• A method that has recently become very popular, deep learning, constructs complicated non-linear functions by composing many “layers” of simple non-linear functions

$\text{minimize}_{f_1, f_2, f_3} \sum_{i=1}^n (y_i - f_3[f_2(f_1[x_i])])^2$ - (If you compose linear functions, the resulting function is also linear, just with different coefficients) - (If you compose simple non-linear functions, the resulting function can be a very complicated non-linear one)

• The issues we talk about today are not just issues with linear models, but with all kinds of similar methods in stats/ML/data science/AI
• Common elements: data, modeling assumptions about what kinds of functions are reasonable, and algorithms to find a function in that class that “best” fits the data
• Whenever we’re trying to get the “best” of something, there is a danger of overfitting…

### Model selection algorithms

• Suppose there are multiple predictor variables, possibly many of them, and we don’t know which ones to include in the linear model
• How about letting the computer pick them for us?
• General definition: models with more variables are considered more complex

### Create a sequence of models with increasing complexity

• General idea: start with no variables and add one at a time, each time picking the “best” out of the remaining predictors
• Special cases of this idea, each with a different definition of “best” for the next variable to add: forward stepwise, lasso, elastic net, and others
• Now we just have to pick one model out of a (possibly small) list of models

### Penalize complexity

• Increasing complexity will make the model fit closer and closer to the data, reducing the RSS
• It’s never impressive to make a model fit more closely to data unless there are some other constraints on the model that make it useful
• Common approach: penalize complexity. Instead of just trying to minimize RSS (fit) by itself, try to minimize $\text{fit error} + \text{complexity}$
• Typically the fit error is measured by RSS and the complexity is quantified in some way related to the number of variables in the model
• For example, one method, called AIC, is to pick the model with the lowest $n\log(RSS) + 2k$ where $$k$$ is the number of variables in the model.

• This code below shows output for a sequence of models. At the end is a summary of the model chosen by the AIC method.

candy <- candy_rankings %>% select(-competitorname)
full_model <- lm(winpercent ~ ., data = candy)
const_model <- lm(winpercent ~ 1, data = candy)
selected_model <- step(const_model,
scope = list(lower = const_model, upper = full_model),
direction = "forward")
## Start:  AIC=458.09
## winpercent ~ 1
##
##                    Df Sum of Sq   RSS    AIC
## + chocolate         1    7368.5 10818 415.94
## + bar               1    3361.7 14825 442.72
## + peanutyalmondy    1    3000.7 15186 444.77
## + fruity            1    2639.2 15548 446.77
## + pricepercent      1    2168.8 16018 449.30
## + crispedricewafer  1    1917.2 16270 450.63
## + hard              1    1752.1 16435 451.48
## + pluribus          1    1113.6 17073 454.72
## + sugarpercent      1     955.0 17232 455.51
## + caramel           1     828.4 17359 456.13
## + nougat            1     722.9 17464 456.65
## <none>                          18187 458.09
##
## Step:  AIC=415.94
## winpercent ~ chocolate
##
##                    Df Sum of Sq   RSS    AIC
## + peanutyalmondy    1    582.51 10236 413.24
## + sugarpercent      1    487.59 10331 414.02
## + fruity            1    336.12 10482 415.26
## <none>                          10818 415.94
## + crispedricewafer  1    237.84 10581 416.05
## + hard              1    172.00 10646 416.58
## + bar               1     69.75 10749 417.39
## + caramel           1     57.34 10761 417.49
## + nougat            1     26.82 10792 417.73
## + pluribus          1     20.06 10798 417.78
## + pricepercent      1     14.16 10804 417.83
##
## Step:  AIC=413.24
## winpercent ~ chocolate + peanutyalmondy
##
##                    Df Sum of Sq     RSS    AIC
## + fruity            1    547.28  9688.7 410.57
## + sugarpercent      1    434.23  9801.7 411.55
## + crispedricewafer  1    390.81  9845.2 411.93
## <none>                          10236.0 413.24
## + hard              1    122.32 10113.7 414.21
## + caramel           1     72.63 10163.4 414.63
## + bar               1     52.29 10183.7 414.80
## + pluribus          1      5.44 10230.5 415.19
## + nougat            1      4.19 10231.8 415.20
## + pricepercent      1      0.04 10235.9 415.24
##
## Step:  AIC=410.57
## winpercent ~ chocolate + peanutyalmondy + fruity
##
##                    Df Sum of Sq    RSS    AIC
## + crispedricewafer  1    449.97 9238.7 408.52
## + sugarpercent      1    365.51 9323.2 409.30
## + hard              1    260.55 9428.2 410.25
## <none>                          9688.7 410.57
## + caramel           1    214.82 9473.9 410.66
## + bar               1    105.95 9582.8 411.63
## + nougat            1     19.77 9668.9 412.39
## + pluribus          1     14.01 9674.7 412.44
## + pricepercent      1      3.48 9685.2 412.53
##
## Step:  AIC=408.52
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer
##
##                Df Sum of Sq    RSS    AIC
## + sugarpercent  1    326.01 8912.7 407.47
## + hard          1    242.01 8996.7 408.27
## <none>                      9238.7 408.52
## + caramel       1    146.27 9092.5 409.17
## + nougat        1     72.08 9166.7 409.86
## + bar           1     17.04 9221.7 410.37
## + pricepercent  1      8.19 9230.6 410.45
## + pluribus      1      0.69 9238.0 410.52
##
## Step:  AIC=407.47
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer +
##     sugarpercent
##
##                Df Sum of Sq    RSS    AIC
## + hard          1    327.07 8585.7 406.29
## <none>                      8912.7 407.47
## + pricepercent  1     84.45 8828.3 408.66
## + caramel       1     67.21 8845.5 408.83
## + nougat        1     41.81 8870.9 409.07
## + bar           1     11.39 8901.3 409.36
## + pluribus      1      6.61 8906.1 409.41
##
## Step:  AIC=406.29
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer +
##     sugarpercent + hard
##
##                Df Sum of Sq    RSS    AIC
## <none>                      8585.7 406.29
## + pricepercent  1   125.077 8460.6 407.04
## + caramel       1    61.164 8524.5 407.68
## + nougat        1    31.646 8554.0 407.98
## + pluribus      1    30.632 8555.0 407.99
## + bar           1     6.049 8579.6 408.23
summary(selected_model)
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity +
##     crispedricewafer + sugarpercent + hard, data = candy)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -21.5030  -6.7026   0.5668   5.8802  24.0107
##
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)            32.941      3.518   9.365 2.12e-14 ***
## chocolateTRUE          19.147      3.587   5.338 8.98e-07 ***
## peanutyalmondyTRUE      9.483      3.446   2.752  0.00737 **
## fruityTRUE              8.881      3.561   2.494  0.01473 *
## crispedricewaferTRUE    8.385      4.484   1.870  0.06525 .
## sugarpercent            7.979      4.129   1.932  0.05693 .
## hardTRUE               -5.669      3.289  -1.724  0.08871 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.49 on 78 degrees of freedom
## Multiple R-squared:  0.5279, Adjusted R-squared:  0.4916
## F-statistic: 14.54 on 6 and 78 DF,  p-value: 4.622e-11
• Check: is the adjusted R-squared for this model better than the one for sub_model back near the top of these notes? How much better?
rbind(glance(selected_model), glance(sub_model))
##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.5279241     0.4916106 10.49155  14.53794 4.621890e-11  7 -316.7557
## 2 0.4319631     0.4181085 11.22438  31.17841 8.500396e-11  3 -324.6201
##        AIC      BIC  deviance df.residual
## 1 649.5113 669.0525  8585.661          78
## 2 657.2403 667.0109 10330.907          82

## Overfitting, model complexity, bias-variance tradeoff

• In general, models with greater complexity will have lower bias, but higher variance in their predictions

• Example: for a single non-linear relationship (instead of having multiple variables), the degree of nonlinearity is a measure of complexity
• A more “wiggly” curve comes from a more complex model–it can wiggle around to get closer to points in the data, making it less biased but with more variance
• The dashed line below has higher complexity (more variance, less bias)

gm2007 <- gapminder %>% filter(year == "2007")
ggplot(gm2007, aes(gdpPercap, lifeExp)) + geom_point() +
stat_smooth(se = F, method = "loess", span = 1) +
stat_smooth(se = F, method = "loess", span = 0.2, linetype = 5) 

### What if we don’t know how to penalize complexity?

• In a previous example, we used a method called AIC which is a measure of how good the model is that penalizes how many variables it has
• But what if we aren’t using a linear model, but some other machine learning method where there is no classical statistical theory developing an analogous version of the AIC?
• There is a general method called validation that works for almost every kind of model
• First, we’ll go into more detail about why too much complexity is bad

## Overfitting

• Spurious correlations
• History example
• Math example

• In our previous example, model complexity was measured using the number of predictor variables
• But it may not always be easy to figure out how to measure or penalize model complexity, so it would be nice to have a more general method to avoid overfitting
• Validation and cross-validation are very general, practical, popular, and especially appropriate if there is a large sample of data

## Validation

• Split the data into two sets: a training set and a test set
• Use optimization on the training set to fit models
• Measure the accuracy of fitted models on the test set
• Plot this test accuracy or test error rate as a function of model complexity
• Choose complexity that gives the best test error

• In this example I create a fake data set (simulation) with sample size $$n = 300$$ and $$p = 100$$ predictor variables. The outcome variable $$y_i = x_{i1} + x_{i2} + x_{i3} + e_i$$ only depends on the first 3 out of the 100 predictors.

n <- 300
p <- 100
beta <- c(rep(1, 3), rep(0, p - 3))
X <- matrix(rnorm(n*p), nrow = n)
y <- X %*% beta + rnorm(n)
data <- data.frame(y = y, X = X)
• Split the data:
train <- data %>% sample_frac(0.5)
test <- setdiff(data, train)
• Fit models on training data
simple_model <- lm(y ~ X.1 + X.2 + X.3, data = train)
full_model <- lm(y ~ ., data = train) # the . means "every variable in the data"
rbind(glance(simple_model), glance(full_model))
##   r.squared adj.r.squared     sigma  statistic      p.value  df    logLik
## 1 0.7882191     0.7838674 0.9699262 181.130554 5.311113e-49   4 -206.2333
## 2 0.9334304     0.7975742 0.9386669   6.870722 1.197464e-11 101 -119.4355
##        AIC      BIC  deviance df.residual
## 1 422.4667 437.5199 137.35051         146
## 2 442.8709 749.9557  43.17368          49
• Which model is better, the one with 3 predictors or the one with 100 predictors including those 3?

• Evaluate models on the test data:

errors <- data.frame(
model = c(rep("simple", n/2), rep("full", n/2)),
test_error = c((predict(simple_model, newdata = test) - test$y)^2, (predict(full_model, newdata = test) - test$y)^2))

errors %>% group_by(model) %>% summarize(
median = median(test_error),
mean = mean(test_error)
)
## # A tibble: 2 x 3
##   model  median  mean
##   <fct>   <dbl> <dbl>
## 1 full    1.23  2.87
## 2 simple  0.458 0.914
• This was kind of cheating, because I knew the first 3 variables were the right ones to use in the simple model
• What if we don’t know that? Can we use a model selection algorithm to pick the right variables?

## Cross-validation

• Split the data into $$K$$ sets, often $$K = 5$$ or 10, for example.
• Use one of the sets as a test set, fitting models on the remaining $$K-1$$ sets and measuring their accuracy on this test set
• Repeat the above for each of the $$K$$ sets
• There are now $$K$$ measurements of accuracy for each model
• Average these $$K$$ measurements
• Plot this average as a function of model complexity

• The code below uses the glmnet library, which has a function to use cross-validation to pick variables automatically using a method called the lasso

x_train <- as.matrix(train[,-1])
y_train <- train[,1]
x_test <- as.matrix(test[,-1])
y_test <- test[,1]
model <- cv.glmnet(x = x_train, y = y_train, nfolds = 5)
• Let’s look at a plot of the cross-validation model accuracy as a function of complexity
autoplot(model)

• What are the fitted coefficients of the best model?
coef(model)
## 101 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 0.0680559
## X.1         0.7057829
## X.2         0.6983142
## X.3         0.8568229
## X.4         .
## X.5         .
## X.6         .
## X.7         .
## X.8         .
## X.9         .
## X.10        .
## X.11        .
## X.12        .
## X.13        .
## X.14        .
## X.15        .
## X.16        .
## X.17        .
## X.18        .
## X.19        .
## X.20        .
## X.21        .
## X.22        .
## X.23        .
## X.24        .
## X.25        .
## X.26        .
## X.27        .
## X.28        .
## X.29        .
## X.30        .
## X.31        .
## X.32        .
## X.33        .
## X.34        .
## X.35        .
## X.36        .
## X.37        .
## X.38        .
## X.39        .
## X.40        .
## X.41        .
## X.42        .
## X.43        .
## X.44        .
## X.45        .
## X.46        .
## X.47        .
## X.48        .
## X.49        .
## X.50        .
## X.51        .
## X.52        .
## X.53        .
## X.54        .
## X.55        .
## X.56        .
## X.57        .
## X.58        .
## X.59        .
## X.60        .
## X.61        .
## X.62        .
## X.63        .
## X.64        .
## X.65        .
## X.66        .
## X.67        .
## X.68        .
## X.69        .
## X.70        .
## X.71        .
## X.72        .
## X.73        .
## X.74        .
## X.75        .
## X.76        .
## X.77        .
## X.78        .
## X.79        .
## X.80        .
## X.81        .
## X.82        .
## X.83        .
## X.84        .
## X.85        .
## X.86        .
## X.87        .
## X.88        .
## X.89        .
## X.90        .
## X.91        .
## X.92        .
## X.93        .
## X.94        .
## X.95        .
## X.96        .
## X.97        .
## X.98        .
## X.99        .
## X.100       .
• Cross-validation and the lasso found the right variables to include in the model! Only the first 3, and the rest of the variables are not included

• As we saw before, the model using only the first three variables has better test error than the model which uses all 100 variables

### Summary

• Many cutting edge methods in ML/AI rely on optimization, just like linear regression
• More complex models have more parameters and/or more complex types of functions to predict the outcome variable
• Too much model complexity can lead to overfitting–for example, including too many predictor variables can make it seem like the model does a better job of prediction…
• Test or validation prediction accuracy is a higher standard
• Keep this image in mind, and remember bias-variance tradeoff
• Cross-validation is a very useful method (when the sample size is large enough) to automatically pick models with low test error