Why it’s bad

  • If the collinearity is very strong, it may not even be possible to calculate the coefficients properly. There will be large numerical error.

  • Stops us from including all the levels of different categorical variables, making it harder to interpret coefficients for several categorical predictors in the same model

  • Increases uncertainty about coefficients for the collinear variables (higher standard errors, wider confidence intervals)

  • Test error will likely be worse compared to training error (overfitting)

What to do

  • Avoid redundancy. If your data contains several variables which are essentially different ways of measuring the same thing, don’t use all of them as predictors
  • Other methods are available but outside the scope of this class: principal components regression, ridge regression

Simpson’s paradox and ecological correlation

Association is not causation