- Optimization: math behind machine learning/AI
- Statistical problem: overfitting
- Solution: validation / testing / data splitting
- Cross-validation: uses all the data for training and testing

- Optimization is an area of math that develops and analyzes methods for finding maximums or minimums of functions
- Much of it is based on calculus, but not all
(Itâ€™s extremely useful and one of the common prereqs for many of the most successful researchers or users of machine learning / AI / or even statistics. Along with linear algebra, if you have the chance to study optimization and it aligns with your goals I strongly recommend it)

We have already thought of linear regression as an optimization problem: find the best line (or plane/hyperplane for multiple regression), where best means smallest sum of squared errors

\[ \text{minimize}_{\beta_0, \beta_1} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

- This notation means to pick the parameters \(\beta_0\) and \(\beta_1\) to minimize the sum of squared errors
- The sample mean can also be thought of as the solution to an optimization problem: find the best constant to minimize the squared errors

\[ \text{minimize}_{\beta_0} \sum_{i=1}^n (y_i - \beta_0)^2 \]

- Many methods in statistics and machine learning can be described as optimization problems, with possibly more complicated kinds of functions

\[ \text{minimize}_{f} \sum_{i=1}^n (y_i - f(x_i))^2 \]

- The resulting model predicts \(y\) using the function \(f(x)\):

\[ y_i = f(x_i) + e_i \]

- So far we have used a linear function: \(f(x) = \beta_0 + \beta_1 x\)
- A method that has recently become very popular, deep learning, constructs complicated non-linear functions by composing many â€ślayersâ€ť of simple non-linear functions

\[ \text{minimize}_{f_1, f_2, f_3} \sum_{i=1}^n (y_i - f_3[f_2(f_1[x_i])])^2 \] - (If you compose linear functions, the resulting function is also linear, just with different coefficients) - (If you compose simple non-linear functions, the resulting function can be a very complicated non-linear one)

- The issues we talk about today are not just issues with linear models, but with all kinds of similar methods in stats/ML/data science/AI
- Common elements: data, modeling assumptions about what kinds of functions are reasonable, and algorithms to find a function in that class that â€śbestâ€ť fits the data
- Whenever weâ€™re trying to get the â€śbestâ€ť of something, there is a danger of overfittingâ€¦

- Google flu trends https://gking.harvard.edu/files/gking/files/0314policyforumff.pdf
- History example: https://xkcd.com/1122/
Math example: https://i.stack.imgur.com/2AUPV.jpg

- Last time we talked about one approach to prevent overfitting: penalizing complexity
- Changes the optimization problem from minimizing the fit error only to minimizing that plus a penalty for the complexity of the model
- In our previous example, model complexity was measured using the number of predictor variables
- But it may not always be easy to figure out how to measure or penalize model complexity, so it would be nice to have a more general method to avoid overfitting
The methods weâ€™ll study today are very general, practical, popular, and especially appropriate if there is a large sample of data

- Split the data into two sets: a
**training**set and a**test**set - Use optimization on the training set to fit models
- Measure the accuracy of fitted models on the test set
- Plot this test accuracy or test error rate as a function of model complexity
Choose complexity that gives the best test error

Split the data:

```
n <- 300
p <- 100
beta <- c(rep(1, 3), rep(0, p - 3))
X <- matrix(rnorm(n*p), nrow = n)
y <- X %*% beta + rnorm(n)
data <- data.frame(y = y, X = X)
split <- sample(1:n, n/2, replace = FALSE)
train <- data[split, ]
test <- data[-split, ]
```

- Fit models on training data

```
simple_model <- lm(y ~ X.1 + X.2 + X.3, data = train)
full_model <- lm(y ~ ., data = train) # the . means "every variable in the data"
# Number of predictor variables in each model
c(summary(simple_model)$df[1],
summary(full_model)$df[1])
```

`## [1] 4 101`

```
# Adjusted R-squared
c(summary(simple_model)$adj.r,
summary(full_model)$adj.r)
```

`## [1] 0.7839 0.7976`

Which model is better, the one with 3 predictors or the one with 100 predictors

*including*those 3?Evaluate models on the test data:

```
errors <- data.frame(
model = c(rep("simple", n/2), rep("full", n/2)),
test_error = c((predict(simple_model, newdata = test) - test$y)^2,
(predict(full_model, newdata = test) - test$y)^2))
errors %>% group_by(model) %>% summarize(
median = median(test_error),
mean = mean(test_error)
)
```

```
## # A tibble: 2 x 3
## model median mean
## <fct> <dbl> <dbl>
## 1 full 1.23 2.87
## 2 simple 0.458 0.914
```

- This was kind of cheating, because I knew the first 3 variables were the right ones to use in the simple model
- What if we donâ€™t know that? Can we use a model selection algorithm to pick the right variables?

- Split the data into \(K\) sets, often \(K = 5\) or 10, for example.
- Use one of the sets as a test set, fitting models on the remaining \(K-1\) sets and measuring their accuracy on this test set
- Repeat the above for each of the \(K\) sets
- There are now \(K\) measurements of accuracy for each model
- Average these \(K\) measurements
Plot this average as a function of model complexity

The code below uses the

`glmnet`

library, which has a function to use cross-validation to pick variables automatically using a method called the lasso

```
x_train <- X[split, ]
y_train <- y[split]
x_test <- X[-split, ]
y_test <- y[-split]
model <- cv.glmnet(x_train, y_train, nfolds = 5)
```

- Letâ€™s look at a plot of the cross-validation model accuracy as a function of complexity

`autoplot(model)`

- What are the fitted coefficients of the best model?

`coef(model)`

```
## 101 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.06806
## V1 0.70578
## V2 0.69831
## V3 0.85682
## V4 .
## V5 .
## V6 .
## V7 .
## V8 .
## V9 .
## V10 .
## V11 .
## V12 .
## V13 .
## V14 .
## V15 .
## V16 .
## V17 .
## V18 .
## V19 .
## V20 .
## V21 .
## V22 .
## V23 .
## V24 .
## V25 .
## V26 .
## V27 .
## V28 .
## V29 .
## V30 .
## V31 .
## V32 .
## V33 .
## V34 .
## V35 .
## V36 .
## V37 .
## V38 .
## V39 .
## V40 .
## V41 .
## V42 .
## V43 .
## V44 .
## V45 .
## V46 .
## V47 .
## V48 .
## V49 .
## V50 .
## V51 .
## V52 .
## V53 .
## V54 .
## V55 .
## V56 .
## V57 .
## V58 .
## V59 .
## V60 .
## V61 .
## V62 .
## V63 .
## V64 .
## V65 .
## V66 .
## V67 .
## V68 .
## V69 .
## V70 .
## V71 .
## V72 .
## V73 .
## V74 .
## V75 .
## V76 .
## V77 .
## V78 .
## V79 .
## V80 .
## V81 .
## V82 .
## V83 .
## V84 .
## V85 .
## V86 .
## V87 .
## V88 .
## V89 .
## V90 .
## V91 .
## V92 .
## V93 .
## V94 .
## V95 .
## V96 .
## V97 .
## V98 .
## V99 .
## V100 .
```

Cross-validation and the lasso found the right variables to include in the model! Only the first 3, and the rest of the variables are not included

As we saw before, the model using only the first three variables has better test error than the model which uses all 100 variables

- Many cutting edge methods in ML/AI rely on optimization, just like linear regression
- More complex models have more parameters and/or more complex types of functions to predict the outcome variable
- Too much model complexity can lead to overfittingâ€“for example, including too many predictor variables can make it seem like the model does a better job of predictionâ€¦
- Test or validation prediction accuracy is a higher standard
- Keep this image in mind, and remember bias-variance tradeoff
- Cross-validation is a very useful method (when the sample size is large enough) to automatically pick models with low test error