Outline

Classification of categorical outcome variables

head(titanic_train)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

Missing values

Check if you have missing values

# Count number of missing values, if any, for each variable in the data
titanic_train %>% 
  summarise_if(function(var) any(is.na(var)),
               function(var) sum(is.na(var)))
##   Age
## 1 177
  • The Age variable has 177 missing values, out of a sample of size 891
  • No other variables have any missing values
  • (The code above can be used on other data sets where there are multiple variables with missing values and it would count the number of missing values for each)
  • (dplyr in the tidyverse is so cool, srsly)

Create a new categorical predictor to indicate missing values

# leave out variables we don't want
# create a new variable to indicate missing Age
# set missing values of Age to 0
titanic <- titanic_train %>% 
  select(-PassengerId, -Name, -Ticket, -Cabin, -Embarked) %>% 
  mutate(ageNA = as.numeric(is.na(Age))) %>% 
  replace_na(list(Age = 0)) 
head(titanic)
##   Survived Pclass    Sex Age SibSp Parch    Fare ageNA
## 1        0      3   male  22     1     0  7.2500     0
## 2        1      1 female  38     1     0 71.2833     0
## 3        1      3 female  26     0     0  7.9250     0
## 4        1      1 female  35     1     0 53.1000     0
## 5        0      3   male  35     0     0  8.0500     0
## 6        0      3   male   0     0     0  8.4583     1
  • Now the coefficient for the Age variable will have to be interpreted carefully: it quantifies the relationship between Age and the predicted outcome variable conditional on Age not being missing. This means the relationship does not apply to the whole sample, but only the subset of the sample without missing data.
  • The coefficient for our new categorical predictor, ageNA, quantifies how the predicted outcome changes for people whose Age variable is missing from the data. It’s an overall shift, like an intercept, for this subset of the sample

Logistic regression

fit <- glm(Survived ~ ., data = titanic, family = binomial)
summary(fit)
## 
## Call:
## glm(formula = Survived ~ ., family = binomial, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7113  -0.6110  -0.4287   0.6115   2.4606  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.963812   0.534151   9.293  < 2e-16 ***
## Pclass      -1.072628   0.140386  -7.641 2.16e-14 ***
## Sexmale     -2.767107   0.199524 -13.869  < 2e-16 ***
## Age         -0.039494   0.007795  -5.066 4.05e-07 ***
## SibSp       -0.352819   0.110555  -3.191  0.00142 ** 
## Parch       -0.122595   0.118718  -1.033  0.30176    
## Fare         0.002874   0.002368   1.214  0.22484    
## ageNA       -1.317566   0.320926  -4.106 4.03e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  788.33  on 883  degrees of freedom
## AIC: 804.33
## 
## Number of Fisher Scoring iterations: 5
exp(coef(fit)[3])
##    Sexmale 
## 0.06284354
sv <- titanic %>% group_by(Sex) %>% 
  summarize(Survival = mean(Survived)) %>% 
  mutate(Odds = Survival/(1-Survival))
sv
## # A tibble: 2 x 3
##   Sex    Survival  Odds
##   <chr>     <dbl> <dbl>
## 1 female    0.742 2.88 
## 2 male      0.189 0.233
sv$Odds[2] / sv$Odds[1]
## [1] 0.08096732
titanic %>% group_by(Sex) %>% summarize_if(is.numeric, funs(mean = mean))
## # A tibble: 2 x 8
##   Sex   Survived_mean Pclass_mean Age_mean SibSp_mean Parch_mean Fare_mean
##   <chr>         <dbl>       <dbl>    <dbl>      <dbl>      <dbl>     <dbl>
## 1 fema…         0.742        2.16     23.2      0.694      0.650      44.5
## 2 male          0.189        2.39     24.1      0.430      0.236      25.5
## # ... with 1 more variable: ageNA_mean <dbl>

What about the other variables? ageNA?

exp(coef(fit))
##  (Intercept)       Pclass      Sexmale          Age        SibSp 
## 143.13841375   0.34210842   0.06284354   0.96127588   0.70270421 
##        Parch         Fare        ageNA 
##   0.88462151   1.00287837   0.26778625
  • What does this mean? What’s the relationship between the odds of survival for people whose age was missing in this data, compared to those whose age was not missing?

  • What about the meaning of the Age coefficient?

Predicting the outcome variable

  • Just like with linear models, we can use the predict() function
  • Tip: for logistic regression, to get predicted probability of \(Y = 1\) instead of predicted log-odds, specify type = "response"
titanic$predicted <- predict(fit, type = "response")
tplot <- ggplot(titanic, aes(factor(Survived), predicted))
tplot + geom_boxplot()

tplot + geom_boxplot(aes(fill = Sex))

  • How do we get to an actual classification of 0 or 1? “Thresholding” the predicted probability
  • Pick a cutoff \(c\), and if the predicted survival probability is higher than \(c\) then classify that observation as 1, otherwise classify as 0
  • Can choose \(c\) to tradeoff between false positives and false negatives
  • This is useful if one of those kinds of errors is costlier than the other

  • Getting predicted probabilities out of this model is a cool advantage of logistic regression compared to some other classification methods which don’t output probabilities
  • Why? Because I can also tell you, for each observation that I classify, how confident I am about that classification. Useful for identifying the cases that are harder to classify than others–ones with predicted probabilities close to the cutoff \(c\)–this information could be useful for further scientific/business decisions about what kind of data to try to gather in the future to improve predictions, for example

Diagnostic plots

autoplot(fit)