- Classification: categorical outcome variables
- Detour/diversion: a method for missing values (for a predictor variable)
- Logistic regression

- So far we have only considered methods for predicting continuous outcome variables
- We’ve seen that categorical predictor variables can easily be included in such models
But what about categorical

*outcome*variables?Precision medicine example: use DNA measurements from tumors to determine the type of cancer, for a dataset with measurements from many patients who each have one of several types of cancer. After training a good model, new patients could have a culture of cancer cells taken and sequenced, the model would then “classify” the type of cancer, and the best treatment for that type can then be pursued

Loan default example: use predictor variables like credit rating and income of a borrower to predict which loans will default and which ones will not

Voting example: use a persons’ online data, e.g. facebook likes, to predict which party they will vote for (or perhaps whether they will bother voting at all)

Image classification: input an image, and classify what kind of object appears in the image (demo TF classify?)

Can you think of some examples? The categorical outcome variable could be binary, with only two possibilities, or it could have more categories, perhaps many!

Linear regression no longer applies! What could possibly be the meaning of an intercept or slope when the outcome is a category?

- Good news: just like for regression, there are also many methods for classification.
- Linear regression is probably the simplest method for predicting continuous outcomes
Today we’ll study logistic regression, which is probably the simplest method for classification of a binary outcome variable

- We’ll use this dataset about passengers on the Titanic (passenger ship which sank in 1912, killing over 1,500 of the roughly 2,200 people on board)
The outcome variable we wish to classify is

`Survived`

– which is self-explanatory…

`head(titanic_train)`

```
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
```

Lots of statistical software,

`R`

included, tends to ignore missing values by default, and may not even give you a warning or error to tell you that it has done this! This is bad. It relies on an assumption that the missing values are missing*randomly*, i.e. the fact that they are missing does not tell you anything about the outcome variable… that assumption is rarely true.- Missing data is a very widespread problem in the real world
- The
`flights`

and`gapminder`

and`fivethirtyeight`

datasets we’ve used in this class are somewhat unrealistic in this respect. This is generally what happens in most classes, because dealing with missing data is a bit irritating and could take up a lot of our time But I don’t want to send you into the world totally helpless if you encounter missing data, so I’ll show you one possible way of dealing with missing values in a predictor variable

```
# Count number of missing values, if any, for each variable in the data
titanic_train %>%
summarise_if(function(var) any(is.na(var)),
function(var) sum(is.na(var)))
```

```
## Age
## 1 177
```

- The
`Age`

variable has 177 missing values, out of a sample of size 891 - No other variables have any missing values
- (The code above can be used on other data sets where there are multiple variables with missing values and it would count the number of missing values for each)
- (
`dplyr`

in the`tidyverse`

is so cool, srsly)

```
# leave out variables we don't want
# create a new variable to indicate missing Age
# set missing values of Age to 0
titanic <- titanic_train %>%
select(-PassengerId, -Name, -Ticket, -Cabin, -Embarked) %>%
mutate(ageNA = as.numeric(is.na(Age))) %>%
replace_na(list(Age = 0))
head(titanic)
```

```
## Survived Pclass Sex Age SibSp Parch Fare ageNA
## 1 0 3 male 22 1 0 7.2500 0
## 2 1 1 female 38 1 0 71.2833 0
## 3 1 3 female 26 0 0 7.9250 0
## 4 1 1 female 35 1 0 53.1000 0
## 5 0 3 male 35 0 0 8.0500 0
## 6 0 3 male 0 0 0 8.4583 1
```

- Now the coefficient for the
`Age`

variable will have to be interpreted carefully: it quantifies the relationship between`Age`

and the predicted outcome variable*conditional*on`Age`

not being missing. This means the relationship does not apply to the whole sample, but only the subset of the sample without missing data. - The coefficient for our new categorical predictor,
`ageNA`

, quantifies how the predicted outcome changes for people whose`Age`

variable is missing from the data. It’s an overall shift, like an intercept, for this subset of the sample

- When the outcome variable has only two categories, we can call it binary, and name those categories
`TRUE`

/`FALSE`

or 0/1. These are just names, so they don’t change anything about how the model will work So we’ll consider an outcome variable \(Y\) which is either 0 or 1, but everything we say here applies if the names are different. The

`Survived`

variable in the`titanic`

data is 0 if the person died and 1 if they survived- In logistic regression, the goal is to predict \(P(Y_i = 1 | X_i = x_i)\), where \(X_i\) is the predictor variable
- The original outcome is either 0 or 1, but the predicted outcome will be a probability, so it can be anything between 0 and 1
- There’s one complication: to get a linear model, we need to first do a non-linear transformation of \(P(Y = 1|X = x)\), the specific function we use to transform it is called the “logit” or “log-odds” function
Let’s use \(p(x) = P(Y = 1 | X = x)\) as a shorthand to simplify the formula, then we can write the logistic regression model as \[ \ln \left( \frac{p(x_i)}{1-p(x_i)} \right) = \beta_0 + \beta_1 x_i \]

- The left hand side is the logit function, the right hand side is linear just like in regression
- Just like with linear regression, we can also have more predictor variables on the right hand side, each getting their own coefficient
- I’m not going to test you on these details. But you should be aware of logistic regression, understand generally what is its purpose and how to interpret it
- Once you already know how to fit, summarize, and interpret linear models in
`R`

, it’s easy to do the same for logistic regression! - Let’s resume the
`titanic`

example To fit a logistic regression model in

`R`

, use the`glm`

function**and remember to specify**`family = binomial`

```
fit <- glm(Survived ~ ., data = titanic, family = binomial)
summary(fit)
```

```
##
## Call:
## glm(formula = Survived ~ ., family = binomial, data = titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7113 -0.6110 -0.4287 0.6115 2.4606
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.963812 0.534151 9.293 < 2e-16 ***
## Pclass -1.072628 0.140386 -7.641 2.16e-14 ***
## Sexmale -2.767107 0.199524 -13.869 < 2e-16 ***
## Age -0.039494 0.007795 -5.066 4.05e-07 ***
## SibSp -0.352819 0.110555 -3.191 0.00142 **
## Parch -0.122595 0.118718 -1.033 0.30176
## Fare 0.002874 0.002368 1.214 0.22484
## ageNA -1.317566 0.320926 -4.106 4.03e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 788.33 on 883 degrees of freedom
## AIC: 804.33
##
## Number of Fisher Scoring iterations: 5
```

- Difference: there is no longer an \(R^2\), instead, for models like logistic regression (called generalized linear models or GLMs) the analogous measure is called “deviance.” This could be used to compare or choose models, with lower deviance being considered a good thing
Difference: it’s a little more complicated to interpret the coefficients. Usually you want to

*exponentiate*them firstConsider the coefficient for the categorical predictor

`Sexmale`

which is -2.7671072. To exponentiate it we can use the function`exp()`

like this:

`exp(coef(fit)[3])`

```
## Sexmale
## 0.06284354
```

- The interpretation of this (exponentiated) coefficient is that the predicted
*odds*of survival for males are about 0.06 times the odds of survival for non-males,**holding all other variables constant** - (Remember the bolded phrase above? It’s the same as for coefficients in linear regression, and it’s one of the most important things to remember when interpreting coefficients!)
- The
*odds*is just the ratio of probability of surviving divided by probability of not-surviving \[ P(Y = 1 | X)/P(Y = 0 | X) \] When does the odds equal 1? What does it mean if the odds is less than 1, or greater than 1?

To restate the interpretation of the coefficient: predicted odds of survival for males, holding other predictors constant, equals about 0.6 times the predicted odds of survival for non-males, holding other predictors constant

```
sv <- titanic %>% group_by(Sex) %>%
summarize(Survival = mean(Survived)) %>%
mutate(Odds = Survival/(1-Survival))
sv
```

```
## # A tibble: 2 x 3
## Sex Survival Odds
## <chr> <dbl> <dbl>
## 1 female 0.742 2.88
## 2 male 0.189 0.233
```

`sv$Odds[2] / sv$Odds[1]`

`## [1] 0.08096732`

- OMG, why is this about 0.08 instead of about 0.06, like we expected from the model?…

`titanic %>% group_by(Sex) %>% summarize_if(is.numeric, funs(mean = mean))`

```
## # A tibble: 2 x 8
## Sex Survived_mean Pclass_mean Age_mean SibSp_mean Parch_mean Fare_mean
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fema… 0.742 2.16 23.2 0.694 0.650 44.5
## 2 male 0.189 2.39 24.1 0.430 0.236 25.5
## # ... with 1 more variable: ageNA_mean <dbl>
```

- Because “all other things being equal” is not true on average for the sample, there are differences in the other predictor variables between males and non-males

`exp(coef(fit))`

```
## (Intercept) Pclass Sexmale Age SibSp
## 143.13841375 0.34210842 0.06284354 0.96127588 0.70270421
## Parch Fare ageNA
## 0.88462151 1.00287837 0.26778625
```

What does this mean? What’s the relationship between the odds of survival for people whose age was missing in this data, compared to those whose age was not missing?

What about the meaning of the

`Age`

coefficient?

- Just like with linear models, we can use the
`predict()`

function - Tip: for logistic regression, to get predicted probability of \(Y = 1\) instead of predicted log-odds, specify
`type = "response"`

```
titanic$predicted <- predict(fit, type = "response")
tplot <- ggplot(titanic, aes(factor(Survived), predicted))
tplot + geom_boxplot()
```

`tplot + geom_boxplot(aes(fill = Sex))`

- How do we get to an actual classification of 0 or 1? “Thresholding” the predicted probability
- Pick a cutoff \(c\), and if the predicted survival probability is higher than \(c\) then classify that observation as 1, otherwise classify as 0
- Can choose \(c\) to tradeoff between false positives and false negatives
This is useful if one of those kinds of errors is costlier than the other

- Getting predicted
*probabilities*out of this model is a cool advantage of logistic regression compared to some other classification methods which don’t output probabilities Why? Because I can also tell you, for each observation that I classify, how

*confident*I am about that classification. Useful for identifying the cases that are harder to classify than others–ones with predicted probabilities close to the cutoff \(c\)–this information could be useful for further scientific/business decisions about what kind of data to try to gather in the future to improve predictions, for example

`autoplot(fit)`