- Comparing two nested models with the \(F\)-test
- Optimization and model selection algorithms
- Overfitting, model complexity, bias-variance tradeoff
- Validation and cross-validation

- A method that uses hypothesis tests to compare linear models
- The models must be nested: the larger of the two models contains all of the predictor variables in the smaller one, plus some extra variables
- Motivating question: do we really need this larger model, or would the smaller, simpler one be sufficient?
- Larger model, often called the â€śfullâ€ť model, with \(p\) predictors: \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_q x_{ik} + \cdots + \beta_p x_{ip} + e_i \]
- Simpler model, often called the â€śsubâ€ť model, with \(k < p\) predictors: \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_q x_{ik} + e_i \]
- The sub-model omits variables \(k+1, k+2, \ldots, p\)
- (We can always rearrange the variables, so it doesnâ€™t matter if weâ€™re leaving out the first, last, or variables in the middle)

```
full_model <- lm(winpercent ~ chocolate +
fruity + caramel + hard + bar +
sugarpercent + pricepercent,
data = candy_rankings)
sub_model <- lm(winpercent ~ chocolate + sugarpercent,
data = candy_rankings)
chocolate_model <- lm(winpercent ~ chocolate, data = candy_rankings)
tidy(full_model)
```

```
## term estimate std.error statistic p.value
## 1 (Intercept) 34.849651 3.995614 8.7219761 4.121255e-13
## 2 chocolateTRUE 21.222221 4.004069 5.3001635 1.069829e-06
## 3 fruityTRUE 7.563380 3.828542 1.9755248 5.179194e-02
## 4 caramelTRUE 1.829913 3.614236 0.5063069 6.140878e-01
## 5 hardTRUE -6.389220 3.496633 -1.8272494 7.153689e-02
## 6 barTRUE 3.001409 3.766306 0.7969105 4.279534e-01
## 7 sugarpercent 9.396004 4.734918 1.9844066 5.077349e-02
## 8 pricepercent -3.337411 5.516023 -0.6050393 5.469319e-01
```

`tidy(sub_model)`

```
## term estimate std.error statistic p.value
## 1 (Intercept) 38.26211 2.552430 14.990465 3.308889e-25
## 2 chocolateTRUE 18.27331 2.468991 7.401124 1.058930e-10
## 3 sugarpercent 8.56662 4.354581 1.967266 5.253395e-02
```

`rbind(glance(full_model), glance(sub_model))`

```
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.4785593 0.4311556 11.09783 10.09540 6.800303e-09 8 -320.9825
## 2 0.4319631 0.4181085 11.22438 31.17841 8.500396e-11 3 -324.6201
## AIC BIC deviance df.residual
## 1 659.9651 681.9489 9483.461 77
## 2 657.2403 667.0109 10330.907 82
```

How could we compare and choose between these models?

(Could compare residual standard error or adjusted R-squared)

- We may be interested in choosing between the models using a hypothesis test, e.g.Â for scientific purposes
- The null hypothesis is that the smaller, simpler model is good enough
- The alternative hypothesis is that
**at least one**of the additional variables in the larger model provides a significant improvement - The residual sum of squares (
**RSS**) is \[ \sum_{i=1}^n (y_i - \hat y_i)^2 \] - Letâ€™s write \(RSS_f\) for the residual sum of squares when predictions \(\hat y\) are given by the full model, and \(RSS_s\) for the sub model
- Note: \(RSS_f \leq RSS_s\)
- The \(F\) statistic for comparing these two models is \[ F = \frac{\left(\frac{RSS_s - RSS_f}{p - k}\right)}{\left(\frac{RSS_f}{n-p}\right)} \]
The observed value is compared to an \(F_{p-k, n - p}\) distribution to find a \(p\)-value

In

`R`

, after fitting both models, use the`anova()`

function on them like this:

`anova(sub_model, full_model)`

```
## Analysis of Variance Table
##
## Model 1: winpercent ~ chocolate + sugarpercent
## Model 2: winpercent ~ chocolate + fruity + caramel + hard + bar + sugarpercent +
## pricepercent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 10330.9
## 2 77 9483.5 5 847.45 1.3762 0.2426
```

- Advantage: hypothesis test framing may be a natural way to answer the scientific or practical question that the regression models are being used to study
- Disadvantage: only for nested models, maybe hypothesis testing isnâ€™t relevant and we just want the â€śbestâ€ť model

- Optimization is an area of math that develops and analyzes methods for finding maximums or minimums of functions
- Much of it is based on calculus, but not all
(Itâ€™s extremely useful and one of the common prereqs for many of the most successful researchers or users of machine learning / AI / or even statistics. Along with linear algebra, if you have the chance to study optimization and it aligns with your goals I strongly recommend it)

We have already thought of linear regression as an optimization problem: find the best line (or plane/hyperplane for multiple regression), where best means smallest sum of squared errors

\[ \text{minimize}_{\beta_0, \beta_1} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

- This notation means to pick the parameters \(\beta_0\) and \(\beta_1\) to minimize the sum of squared errors
- The sample mean can also be thought of as the solution to an optimization problem: find the best constant to minimize the squared errors

\[ \text{minimize}_{\beta_0} \sum_{i=1}^n (y_i - \beta_0)^2 \]

- Many methods in statistics and machine learning can be described as optimization problems, with possibly more complicated kinds of functions

\[ \text{minimize}_{f} \sum_{i=1}^n (y_i - f(x_i))^2 \]

- The resulting model predicts \(y\) using the function \(f(x)\):

\[ y_i = f(x_i) + e_i \]

- So far we have used a linear function: \(f(x) = \beta_0 + \beta_1 x\)
- A method that has recently become very popular, deep learning, constructs complicated non-linear functions by composing many â€ślayersâ€ť of simple non-linear functions

\[ \text{minimize}_{f_1, f_2, f_3} \sum_{i=1}^n (y_i - f_3[f_2(f_1[x_i])])^2 \] - (If you compose linear functions, the resulting function is also linear, just with different coefficients) - (If you compose simple non-linear functions, the resulting function can be a very complicated non-linear one)

- The issues we talk about today are not just issues with linear models, but with all kinds of similar methods in stats/ML/data science/AI
- Common elements: data, modeling assumptions about what kinds of functions are reasonable, and algorithms to find a function in that class that â€śbestâ€ť fits the data
- Whenever weâ€™re trying to get the â€śbestâ€ť of something, there is a danger of overfittingâ€¦

- Suppose there are multiple predictor variables, possibly many of them, and we donâ€™t know which ones to include in the linear model
- How about letting the computer pick them for us?
- General definition: models with more variables are considered more complex

- General idea: start with no variables and add one at a time, each time picking the â€śbestâ€ť out of the remaining predictors
- Special cases of this idea, each with a different definition of â€śbestâ€ť for the next variable to add: forward stepwise, lasso, elastic net, and others
- Now we just have to pick one model out of a (possibly small) list of models

- Increasing complexity will make the model fit closer and closer to the data, reducing the RSS
- Itâ€™s never impressive to make a model fit more closely to data unless there are some other constraints on the model that make it useful
- Common approach: penalize complexity. Instead of just trying to minimize RSS (fit) by itself, try to minimize \[ \text{fit error} + \text{complexity} \]
- Typically the fit error is measured by RSS and the complexity is quantified in some way related to the number of variables in the model
For example, one method, called AIC, is to pick the model with the lowest \[ n\log(RSS) + 2k \] where \(k\) is the number of variables in the model.

This code below shows output for a sequence of models. At the end is a summary of the model chosen by the AIC method.

```
candy <- candy_rankings %>% select(-competitorname)
full_model <- lm(winpercent ~ ., data = candy)
const_model <- lm(winpercent ~ 1, data = candy)
selected_model <- step(const_model,
scope = list(lower = const_model, upper = full_model),
direction = "forward")
```

```
## Start: AIC=458.09
## winpercent ~ 1
##
## Df Sum of Sq RSS AIC
## + chocolate 1 7368.5 10818 415.94
## + bar 1 3361.7 14825 442.72
## + peanutyalmondy 1 3000.7 15186 444.77
## + fruity 1 2639.2 15548 446.77
## + pricepercent 1 2168.8 16018 449.30
## + crispedricewafer 1 1917.2 16270 450.63
## + hard 1 1752.1 16435 451.48
## + pluribus 1 1113.6 17073 454.72
## + sugarpercent 1 955.0 17232 455.51
## + caramel 1 828.4 17359 456.13
## + nougat 1 722.9 17464 456.65
## <none> 18187 458.09
##
## Step: AIC=415.94
## winpercent ~ chocolate
##
## Df Sum of Sq RSS AIC
## + peanutyalmondy 1 582.51 10236 413.24
## + sugarpercent 1 487.59 10331 414.02
## + fruity 1 336.12 10482 415.26
## <none> 10818 415.94
## + crispedricewafer 1 237.84 10581 416.05
## + hard 1 172.00 10646 416.58
## + bar 1 69.75 10749 417.39
## + caramel 1 57.34 10761 417.49
## + nougat 1 26.82 10792 417.73
## + pluribus 1 20.06 10798 417.78
## + pricepercent 1 14.16 10804 417.83
##
## Step: AIC=413.24
## winpercent ~ chocolate + peanutyalmondy
##
## Df Sum of Sq RSS AIC
## + fruity 1 547.28 9688.7 410.57
## + sugarpercent 1 434.23 9801.7 411.55
## + crispedricewafer 1 390.81 9845.2 411.93
## <none> 10236.0 413.24
## + hard 1 122.32 10113.7 414.21
## + caramel 1 72.63 10163.4 414.63
## + bar 1 52.29 10183.7 414.80
## + pluribus 1 5.44 10230.5 415.19
## + nougat 1 4.19 10231.8 415.20
## + pricepercent 1 0.04 10235.9 415.24
##
## Step: AIC=410.57
## winpercent ~ chocolate + peanutyalmondy + fruity
##
## Df Sum of Sq RSS AIC
## + crispedricewafer 1 449.97 9238.7 408.52
## + sugarpercent 1 365.51 9323.2 409.30
## + hard 1 260.55 9428.2 410.25
## <none> 9688.7 410.57
## + caramel 1 214.82 9473.9 410.66
## + bar 1 105.95 9582.8 411.63
## + nougat 1 19.77 9668.9 412.39
## + pluribus 1 14.01 9674.7 412.44
## + pricepercent 1 3.48 9685.2 412.53
##
## Step: AIC=408.52
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer
##
## Df Sum of Sq RSS AIC
## + sugarpercent 1 326.01 8912.7 407.47
## + hard 1 242.01 8996.7 408.27
## <none> 9238.7 408.52
## + caramel 1 146.27 9092.5 409.17
## + nougat 1 72.08 9166.7 409.86
## + bar 1 17.04 9221.7 410.37
## + pricepercent 1 8.19 9230.6 410.45
## + pluribus 1 0.69 9238.0 410.52
##
## Step: AIC=407.47
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer +
## sugarpercent
##
## Df Sum of Sq RSS AIC
## + hard 1 327.07 8585.7 406.29
## <none> 8912.7 407.47
## + pricepercent 1 84.45 8828.3 408.66
## + caramel 1 67.21 8845.5 408.83
## + nougat 1 41.81 8870.9 409.07
## + bar 1 11.39 8901.3 409.36
## + pluribus 1 6.61 8906.1 409.41
##
## Step: AIC=406.29
## winpercent ~ chocolate + peanutyalmondy + fruity + crispedricewafer +
## sugarpercent + hard
##
## Df Sum of Sq RSS AIC
## <none> 8585.7 406.29
## + pricepercent 1 125.077 8460.6 407.04
## + caramel 1 61.164 8524.5 407.68
## + nougat 1 31.646 8554.0 407.98
## + pluribus 1 30.632 8555.0 407.99
## + bar 1 6.049 8579.6 408.23
```

`summary(selected_model)`

```
##
## Call:
## lm(formula = winpercent ~ chocolate + peanutyalmondy + fruity +
## crispedricewafer + sugarpercent + hard, data = candy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5030 -6.7026 0.5668 5.8802 24.0107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.941 3.518 9.365 2.12e-14 ***
## chocolateTRUE 19.147 3.587 5.338 8.98e-07 ***
## peanutyalmondyTRUE 9.483 3.446 2.752 0.00737 **
## fruityTRUE 8.881 3.561 2.494 0.01473 *
## crispedricewaferTRUE 8.385 4.484 1.870 0.06525 .
## sugarpercent 7.979 4.129 1.932 0.05693 .
## hardTRUE -5.669 3.289 -1.724 0.08871 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.49 on 78 degrees of freedom
## Multiple R-squared: 0.5279, Adjusted R-squared: 0.4916
## F-statistic: 14.54 on 6 and 78 DF, p-value: 4.622e-11
```

- Check: is the adjusted R-squared for this model better than the one for
`sub_model`

back near the top of these notes? How much better?

`rbind(glance(selected_model), glance(sub_model))`

```
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.5279241 0.4916106 10.49155 14.53794 4.621890e-11 7 -316.7557
## 2 0.4319631 0.4181085 11.22438 31.17841 8.500396e-11 3 -324.6201
## AIC BIC deviance df.residual
## 1 649.5113 669.0525 8585.661 78
## 2 657.2403 667.0109 10330.907 82
```

- In general, models with greater complexity will have lower bias, but higher variance in their predictions
See the last plot near the bottom of this page

- Example: for a single non-linear relationship (instead of having multiple variables), the degree of nonlinearity is a measure of complexity
- A more â€świgglyâ€ť curve comes from a more complex modelâ€“it can wiggle around to get closer to points in the data, making it less biased but with more variance
The dashed line below has higher complexity (more variance, less bias)

```
gm2007 <- gapminder %>% filter(year == "2007")
ggplot(gm2007, aes(gdpPercap, lifeExp)) + geom_point() +
stat_smooth(se = F, method = "loess", span = 1) +
stat_smooth(se = F, method = "loess", span = 0.2, linetype = 5)
```

- In a previous example, we used a method called AIC which is a measure of how good the model is that penalizes how many variables it has
- But what if we arenâ€™t using a linear model, but some other machine learning method where there is no classical statistical theory developing an analogous version of the AIC?
- There is a general method called validation that works for almost every kind of model
- First, weâ€™ll go into more detail about why too much complexity is bad

- Google flu trends
- Spurious correlations
- History example
- In our previous example, model complexity was measured using the number of predictor variables
- But it may not always be easy to figure out how to measure or penalize model complexity, so it would be nice to have a more general method to avoid overfitting
Validation and cross-validation are very general, practical, popular, and especially appropriate if there is a large sample of data

- Split the data into two sets: a
**training**set and a**test**set - Use optimization on the training set to fit models
- Measure the accuracy of fitted models on the test set
- Plot this test accuracy or test error rate as a function of model complexity
Choose complexity that gives the best test error

In this example I create a fake data set (simulation) with sample size \(n = 300\) and \(p = 100\) predictor variables. The outcome variable \(y_i = x_{i1} + x_{i2} + x_{i3} + e_i\) only depends on the first 3 out of the 100 predictors.

```
n <- 300
p <- 100
beta <- c(rep(1, 3), rep(0, p - 3))
X <- matrix(rnorm(n*p), nrow = n)
y <- X %*% beta + rnorm(n)
data <- data.frame(y = y, X = X)
```

- Split the data:

```
train <- data %>% sample_frac(0.5)
test <- setdiff(data, train)
```

- Fit models on training data

```
simple_model <- lm(y ~ X.1 + X.2 + X.3, data = train)
full_model <- lm(y ~ ., data = train) # the . means "every variable in the data"
rbind(glance(simple_model), glance(full_model))
```

```
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.7882191 0.7838674 0.9699262 181.130554 5.311113e-49 4 -206.2333
## 2 0.9334304 0.7975742 0.9386669 6.870722 1.197464e-11 101 -119.4355
## AIC BIC deviance df.residual
## 1 422.4667 437.5199 137.35051 146
## 2 442.8709 749.9557 43.17368 49
```

Which model is better, the one with 3 predictors or the one with 100 predictors

*including*those 3?Evaluate models on the test data:

```
errors <- data.frame(
model = c(rep("simple", n/2), rep("full", n/2)),
test_error = c((predict(simple_model, newdata = test) - test$y)^2,
(predict(full_model, newdata = test) - test$y)^2))
errors %>% group_by(model) %>% summarize(
median = median(test_error),
mean = mean(test_error)
)
```

```
## # A tibble: 2 x 3
## model median mean
## <fct> <dbl> <dbl>
## 1 full 1.23 2.87
## 2 simple 0.458 0.914
```

- This was kind of cheating, because I knew the first 3 variables were the right ones to use in the simple model
- What if we donâ€™t know that? Can we use a model selection algorithm to pick the right variables?

- Split the data into \(K\) sets, often \(K = 5\) or 10, for example.
- Use one of the sets as a test set, fitting models on the remaining \(K-1\) sets and measuring their accuracy on this test set
- Repeat the above for each of the \(K\) sets
- There are now \(K\) measurements of accuracy for each model
- Average these \(K\) measurements
Plot this average as a function of model complexity

The code below uses the

`glmnet`

library, which has a function to use cross-validation to pick variables automatically using a method called the lasso

```
x_train <- as.matrix(train[,-1])
y_train <- train[,1]
x_test <- as.matrix(test[,-1])
y_test <- test[,1]
model <- cv.glmnet(x = x_train, y = y_train, nfolds = 5)
```

- Letâ€™s look at a plot of the cross-validation model accuracy as a function of complexity

`autoplot(model)`

- What are the fitted coefficients of the best model?

`coef(model)`

```
## 101 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.0680559
## X.1 0.7057829
## X.2 0.6983142
## X.3 0.8568229
## X.4 .
## X.5 .
## X.6 .
## X.7 .
## X.8 .
## X.9 .
## X.10 .
## X.11 .
## X.12 .
## X.13 .
## X.14 .
## X.15 .
## X.16 .
## X.17 .
## X.18 .
## X.19 .
## X.20 .
## X.21 .
## X.22 .
## X.23 .
## X.24 .
## X.25 .
## X.26 .
## X.27 .
## X.28 .
## X.29 .
## X.30 .
## X.31 .
## X.32 .
## X.33 .
## X.34 .
## X.35 .
## X.36 .
## X.37 .
## X.38 .
## X.39 .
## X.40 .
## X.41 .
## X.42 .
## X.43 .
## X.44 .
## X.45 .
## X.46 .
## X.47 .
## X.48 .
## X.49 .
## X.50 .
## X.51 .
## X.52 .
## X.53 .
## X.54 .
## X.55 .
## X.56 .
## X.57 .
## X.58 .
## X.59 .
## X.60 .
## X.61 .
## X.62 .
## X.63 .
## X.64 .
## X.65 .
## X.66 .
## X.67 .
## X.68 .
## X.69 .
## X.70 .
## X.71 .
## X.72 .
## X.73 .
## X.74 .
## X.75 .
## X.76 .
## X.77 .
## X.78 .
## X.79 .
## X.80 .
## X.81 .
## X.82 .
## X.83 .
## X.84 .
## X.85 .
## X.86 .
## X.87 .
## X.88 .
## X.89 .
## X.90 .
## X.91 .
## X.92 .
## X.93 .
## X.94 .
## X.95 .
## X.96 .
## X.97 .
## X.98 .
## X.99 .
## X.100 .
```

Cross-validation and the lasso found the right variables to include in the model! Only the first 3, and the rest of the variables are not included

As we saw before, the model using only the first three variables has better test error than the model which uses all 100 variables

- Many cutting edge methods in ML/AI rely on optimization, just like linear regression
- More complex models have more parameters and/or more complex types of functions to predict the outcome variable
- Too much model complexity can lead to overfittingâ€“for example, including too many predictor variables can make it seem like the model does a better job of predictionâ€¦
- Test or validation prediction accuracy is a higher standard
- Keep this image in mind, and remember bias-variance tradeoff
- Cross-validation is a very useful method (when the sample size is large enough) to automatically pick models with low test error