Why it’s bad

  • Stops us from including all the levels of different categorical variables, making it harder to interpret coefficients for several categorical predictors in the same model

  • Increases uncertainty about coefficients for the collinear variables (higher standard errors, wider confidence intervals)

  • Test error will likely be worse compared to training error (overfitting)

What to do

  • Avoid redundancy. If your data contains several variables which are essentially different ways of measuring the same thing, don’t use all of them as predictors

  • Other methods are available but outside the scope of this class: principal components regression, ridge regression

Simpson’s paradox and ecological correlation

What to do

  • Regression, including every possible confounding variable as a predictor

  • e.g. to determine effect of life jackets on survival, need to control for weather by including it as a predictor varible in some way

  • How could we control for a grouping variable, like states? (Include categorical predictor, for example)

Association is not causation

What to do

  • Keep in mind the Bradford Hill criteria like strength (a large enough effect size to actually explain the outcomes), consistency (multiple studies from different sources, using different kinds of data and analysis), plausible mechanism

  • Use methods designed specifically for causal inference (a more advanced topic you can learn more about now if you’re interested)