- Multiple regression
- Categorical predictors
- Non-linear transformations of predictors

- Regression is not just about lines in 2-dimensions!
- What if we have more than one predictor variable?
- More than one “slope” coefficient
If you like to think geometrically: instead of a line, we now have a \(p\) dimensional plane to predict \(y\). The actual values of \(y\) can be above or below the plane, and the errors of the predictions are given by the vertical distance from the actual \(y\) to the plane

- Now we need two indexes, \(1 \leq i \leq n\) for observations (rows of spreadsheet) and \(1 \leq j \leq p\) for variables (columns of spreadsheet)
- For individual \(i\), the value of predictor variable \(j\) is \(x_{ij}\)
- Linear model: \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + e_i \]
- (This next point is optional)
If you’re like me and can’t get enough matrix/vector notation, this could also be written as \[ y = X \beta + e \] where \(X\) is the “design matrix” of predictor variabes, \(y\) is the vector of outcomes, \(\beta\) the vector of parameters, and \(e\) the vector of errors

- For example, in the
`candy_rankings`

data, there are \(n = 85\) observations in the sample, and if we use both`sugarpercent`

and`pricepercent`

as predictor variables then \(p = 2\), and for the \(i\)th candy we have \[ \text{winpercent}_i = \beta_0 + \beta_1 \text{sugarpercent}_i + \beta_2 \text{pricepercent}_i + e_i \] All the \(x_{ij}\) values are known from the data, and the unknown parameters are \(\beta_j\) for \(j = 0,1, \ldots, p\)

- Formulas to calculate the coefficients now require
**linear algebra**, so we won’t go into them (but if you plan to take a more quantitative career path I strongly recommend trying to learn as much linear algebra as you can) - Aside from formulas to calculate coefficients, almost everything else we’ve learned so far about linear models works as straightforwardly as you could hope
- After getting coefficients \(\hat \beta_0, \hat \beta_1, \ldots, \hat \beta_p\), the predicted the outcome variable \(\hat y_i\) is \[ \hat y_i = \hat \beta_0 + \hat \beta_1 x_{i1} + \cdots + \hat \beta_p x_{ip} \] and the residual is \[ y_i - \hat y_i \]
Hypothesis tests for coefficients from the

`summary()`

function:

```
candy_model <- lm(winpercent ~ sugarpercent + pricepercent, data = candy_rankings)
summary(candy_model)
```

```
##
## Call:
## lm(formula = winpercent ~ sugarpercent + pricepercent, data = candy_rankings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.88 -8.57 -0.76 8.14 37.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.79 3.44 11.57 <0.0000000000000002 ***
## sugarpercent 6.73 5.66 1.19 0.2381
## pricepercent 15.59 5.60 2.78 0.0067 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.9 on 82 degrees of freedom
## Multiple R-squared: 0.134, Adjusted R-squared: 0.113
## F-statistic: 6.35 on 2 and 82 DF, p-value: 0.00272
```

- What is the meaning of the coefficient \(\beta_j\) of the predictor variable \(x_j\)?
- Something about the change in \(y\) if variable \(x_j\) changes, right?
**Be careful when interpreting coefficients in multiple regression**- \(\beta_j\) is the expected change in \(y\) if we increase \(x_j\) by one unit
*while holding all the other predictor variables constant* Think carefully about what it means to hold the other predictor variables constant, especially if any other predictors are correlated with \(x_j\)

Consider:

Suppose you had a model with GDP per capita as one predictor and GDP as another predictor. What would it mean to hold GDP per capita constant and increase GDP? What else would be changing in that case? What would it mean to hold GDP constant while increasing GDP per capita? What if we also include population as a third predictor, is it even meaningful in that case to hold two of them constant while changing the other?

- It’s common to have variables in your data that are not continuous, but categorical
Pretty simple to include in the linear model. There’s one complication to watch out for, which we’ll come back to after seeing an example

- Interpret coefficients for categorical variables as intercepts for subsets of the data corresponding to certain categories
- The
`chocolate`

variable in the`candy_rankings`

data is`TRUE`

for candies that have chocolate in them and`FALSE`

otherwise The estimated coefficient for

`chocolateTRUE`

below shows that candies with chocolate have on average about 18% more popularity

```
candy_model <- lm(winpercent ~ sugarpercent + chocolate, data = candy_rankings)
summary(candy_model)
```

```
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy_rankings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.698 -8.515 -0.449 7.760 27.482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.26 2.55 14.99 < 0.0000000000000002 ***
## sugarpercent 8.57 4.35 1.97 0.053 .
## chocolateTRUE 18.27 2.47 7.40 0.00000000011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.2 on 82 degrees of freedom
## Multiple R-squared: 0.432, Adjusted R-squared: 0.418
## F-statistic: 31.2 on 2 and 82 DF, p-value: 0.000000000085
```

- Common to display information about categorical variables in plots using color, line types, point types, or facets (different panels for each group)
- The plot below uses color and linetype, and demonstrates interpretation of coefficient for chocolate variable as giving a different intercept for candies that have chocolate

```
candy <- candy_rankings
candy$predicted <- predict(candy_model)
ggplot(candy, aes(sugarpercent, winpercent,
color = chocolate)) +
geom_point() +
geom_line(aes(sugarpercent, predicted, linetype = chocolate))
```

- If the model has an overall intercept, shared by all points, then it cannot have a separate intercept for each categorical group. If there are \(L\) groups, it can only have \(L-1\) intercepts.
- In the above example, candies with chocolate (
`chocolateTRUE`

) and without are two groups, but there is only one coefficient for`chocolateTRUE`

, not one for`chocolateFALSE`

- This is because of a problem in linear algebra called “collinearity.” Without going into details, you can imagine it’s a problem if you put the same predictor variable in the model multiple times, with each copy getting its own separate coefficient. Now the model doesn’t know how to distribute the true coefficient for that variable between its different copies
- To get around this, you can delete the overall intercept term. In
`R`

you just put \(-1\) in the formula with your predictors, like this:

```
candy_model <- lm(winpercent ~ -1 + sugarpercent + chocolate, data = candy_rankings)
summary(candy_model)
```

```
##
## Call:
## lm(formula = winpercent ~ -1 + sugarpercent + chocolate, data = candy_rankings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.698 -8.515 -0.449 7.760 27.482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## sugarpercent 8.57 4.35 1.97 0.053 .
## chocolateFALSE 38.26 2.55 14.99 <0.0000000000000002 ***
## chocolateTRUE 56.54 2.89 19.53 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.2 on 82 degrees of freedom
## Multiple R-squared: 0.956, Adjusted R-squared: 0.954
## F-statistic: 590 on 3 and 82 DF, p-value: <0.0000000000000002
```

- Now we can see, for example, that the average win percent for all candies without chocolate is about 38%
- (We’ll revisit this problem, collinearity, when we talk about model selection)

- Suppose there is a non-linear relationship between the outcome and one of the predictors:

```
n <- 1000
X <- runif(n, min = -2, max = 2)
Y <- 1 + 2 * X - 1.7 * X^2 + rnorm(n)
nldata <- data.frame(Y = Y, X = X)
ggplot(nldata, aes(X, Y)) +
geom_point(alpha = .5) +
stat_smooth(method = "lm", se = F)
```