#### Outline

• Multiple regression
• Categorical predictors
• Non-linear transformations of predictors

## Multiple regression

• Regression is not just about lines in 2-dimensions!
• What if we have more than one predictor variable?
• More than one “slope” coefficient
• If you like to think geometrically: instead of a line, we now have a $$p$$ dimensional plane to predict $$y$$. The actual values of $$y$$ can be above or below the plane, and the errors of the predictions are given by the vertical distance from the actual $$y$$ to the plane

• Now we need two indexes, $$1 \leq i \leq n$$ for observations (rows of spreadsheet) and $$1 \leq j \leq p$$ for variables (columns of spreadsheet)
• For individual $$i$$, the value of predictor variable $$j$$ is $$x_{ij}$$
• Linear model: $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + e_i$
• (This next point is optional)
• If you’re like me and can’t get enough matrix/vector notation, this could also be written as $y = X \beta + e$ where $$X$$ is the “design matrix” of predictor variabes, $$y$$ is the vector of outcomes, $$\beta$$ the vector of parameters, and $$e$$ the vector of errors

• For example, in the candy_rankings data, there are $$n = 85$$ observations in the sample, and if we use both sugarpercent and pricepercent as predictor variables then $$p = 2$$, and for the $$i$$th candy we have $\text{winpercent}_i = \beta_0 + \beta_1 \text{sugarpercent}_i + \beta_2 \text{pricepercent}_i + e_i$
• All the $$x_{ij}$$ values are known from the data, and the unknown parameters are $$\beta_j$$ for $$j = 0,1, \ldots, p$$

• Formulas to calculate the coefficients now require linear algebra, so we won’t go into them (but if you plan to take a more quantitative career path I strongly recommend trying to learn as much linear algebra as you can)
• Aside from formulas to calculate coefficients, almost everything else we’ve learned so far about linear models works as straightforwardly as you could hope
• After getting coefficients $$\hat \beta_0, \hat \beta_1, \ldots, \hat \beta_p$$, the predicted the outcome variable $$\hat y_i$$ is $\hat y_i = \hat \beta_0 + \hat \beta_1 x_{i1} + \cdots + \hat \beta_p x_{ip}$ and the residual is $y_i - \hat y_i$
• Hypothesis tests for coefficients from the summary() function:

candy_model <- lm(winpercent ~ sugarpercent + pricepercent, data = candy_rankings)
summary(candy_model)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + pricepercent, data = candy_rankings)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -33.88  -8.57  -0.76   8.14  37.50
##
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)
## (Intercept)     39.79       3.44   11.57 <0.0000000000000002 ***
## sugarpercent     6.73       5.66    1.19              0.2381
## pricepercent    15.59       5.60    2.78              0.0067 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.9 on 82 degrees of freedom
## Multiple R-squared:  0.134,  Adjusted R-squared:  0.113
## F-statistic: 6.35 on 2 and 82 DF,  p-value: 0.00272

### Interpreting coefficients

• What is the meaning of the coefficient $$\beta_j$$ of the predictor variable $$x_j$$?
• Something about the change in $$y$$ if variable $$x_j$$ changes, right?
• Be careful when interpreting coefficients in multiple regression
• $$\beta_j$$ is the expected change in $$y$$ if we increase $$x_j$$ by one unit while holding all the other predictor variables constant
• Think carefully about what it means to hold the other predictor variables constant, especially if any other predictors are correlated with $$x_j$$

• Consider:

• Suppose you had a model with GDP per capita as one predictor and GDP as another predictor. What would it mean to hold GDP per capita constant and increase GDP? What else would be changing in that case? What would it mean to hold GDP constant while increasing GDP per capita? What if we also include population as a third predictor, is it even meaningful in that case to hold two of them constant while changing the other?

## Categorical predictors

• It’s common to have variables in your data that are not continuous, but categorical
• Pretty simple to include in the linear model. There’s one complication to watch out for, which we’ll come back to after seeing an example

• Interpret coefficients for categorical variables as intercepts for subsets of the data corresponding to certain categories
• The chocolate variable in the candy_rankings data is TRUE for candies that have chocolate in them and FALSE otherwise
• The estimated coefficient for chocolateTRUE below shows that candies with chocolate have on average about 18% more popularity

candy_model <- lm(winpercent ~ sugarpercent + chocolate, data = candy_rankings)
summary(candy_model)
##
## Call:
## lm(formula = winpercent ~ sugarpercent + chocolate, data = candy_rankings)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -23.698  -8.515  -0.449   7.760  27.482
##
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)
## (Intercept)      38.26       2.55   14.99 < 0.0000000000000002 ***
## sugarpercent      8.57       4.35    1.97                0.053 .
## chocolateTRUE    18.27       2.47    7.40        0.00000000011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.2 on 82 degrees of freedom
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.418
## F-statistic: 31.2 on 2 and 82 DF,  p-value: 0.000000000085
• Common to display information about categorical variables in plots using color, line types, point types, or facets (different panels for each group)
• The plot below uses color and linetype, and demonstrates interpretation of coefficient for chocolate variable as giving a different intercept for candies that have chocolate
candy <- candy_rankings
candy\$predicted <- predict(candy_model)
ggplot(candy, aes(sugarpercent, winpercent,
color = chocolate)) +
geom_point() +
geom_line(aes(sugarpercent, predicted, linetype = chocolate))

### Categories relative to what baseline group?

• If the model has an overall intercept, shared by all points, then it cannot have a separate intercept for each categorical group. If there are $$L$$ groups, it can only have $$L-1$$ intercepts.
• In the above example, candies with chocolate (chocolateTRUE) and without are two groups, but there is only one coefficient for chocolateTRUE, not one for chocolateFALSE
• This is because of a problem in linear algebra called “collinearity.” Without going into details, you can imagine it’s a problem if you put the same predictor variable in the model multiple times, with each copy getting its own separate coefficient. Now the model doesn’t know how to distribute the true coefficient for that variable between its different copies
• To get around this, you can delete the overall intercept term. In R you just put $$-1$$ in the formula with your predictors, like this:
candy_model <- lm(winpercent ~ -1 + sugarpercent + chocolate, data = candy_rankings)
summary(candy_model)
##
## Call:
## lm(formula = winpercent ~ -1 + sugarpercent + chocolate, data = candy_rankings)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -23.698  -8.515  -0.449   7.760  27.482
##
## Coefficients:
##                Estimate Std. Error t value            Pr(>|t|)
## sugarpercent       8.57       4.35    1.97               0.053 .
## chocolateFALSE    38.26       2.55   14.99 <0.0000000000000002 ***
## chocolateTRUE     56.54       2.89   19.53 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.2 on 82 degrees of freedom
## Multiple R-squared:  0.956,  Adjusted R-squared:  0.954
## F-statistic:  590 on 3 and 82 DF,  p-value: <0.0000000000000002
• Now we can see, for example, that the average win percent for all candies without chocolate is about 38%
• (We’ll revisit this problem, collinearity, when we talk about model selection)

## Non-linear transformations of predictors

• Suppose there is a non-linear relationship between the outcome and one of the predictors:
n <- 1000
X <- runif(n, min = -2, max = 2)
Y <- 1 + 2 * X - 1.7 * X^2 + rnorm(n)
nldata <- data.frame(Y = Y, X = X)
ggplot(nldata, aes(X, Y)) +
geom_point(alpha = .5) +
stat_smooth(method = "lm", se = F)