#### Outline

• Collinearity
• Simpson’s paradox and ecological regressions
• Association is not causation
• Measurement

## Collinearity

• Suppose we are doing a regression $$y \sim x_1 + x_2 + x_3$$
• Collinearity is a potential problem among the predictor variables
• Occurs when two predictor variables are highly correlated with each other $$x_1 \sim x_2$$
• Or when some subset of predictor variables can be used to predict one of the other predictor variables $$x_3 \sim x_1 + x_2$$

• Stops us from including all the levels of different categorical variables, making it harder to interpret coefficients for several categorical predictors in the same model

• Increases uncertainty about coefficients for the collinear variables (higher standard errors, wider confidence intervals)

• Test error will likely be worse compared to training error (overfitting)

### What to do

• Avoid redundancy. If your data contains several variables which are essentially different ways of measuring the same thing, don’t use all of them as predictors

• Other methods are available but outside the scope of this class: principal components regression, ridge regression

## Simpson’s paradox and ecological correlation

• Ecological correlation refers to when data has been aggregated in some way before being used for a correlation/regression analysis. For example, comparing data at the level of states or countries, rather than at the individual level

• Ecological correlation can be misleading because of “Simpson’s paradox,” a term to describe when the relationship within each group is different than in the overall population

• For example, it is possible there is a positive correlation between $$Y$$ and $$X$$ in the country overall, but within every state the correlation between $$Y$$ and $$X$$ is negative. How is this possible? (draw pictures)

• Simpson’s paradox can also happen outside of ecological correlation, when instead of a grouping/aggregating variable (like state or country) there is some other kind of confounding variable that reverses the direction of the relationship

• e.g. The Navy noticed that, for sailors who fell overboard, there was a negative correlation between survival and wearing a life jacket. Why? Because they usually wore life jackets during bad weather, but not during good weather, so those who fell over without jackets were falling into calm water, and those who fell with jackets were falling during storms. In this case the weather is a confounding variable

### What to do

• Regression, including every possible confounding variable as a predictor

• e.g. to determine effect of life jackets on survival, need to control for weather by including it as a predictor varible in some way

• How could we control for a grouping variable, like states? (Include categorical predictor, for example)

## Association is not causation

• Chocolate and Nobel prize example seems funny because it’s obviously wrong, but what about life expectancy and GDP?

• What about some genetic measurement and some health outcome?

• A business decision and the subsequent performance of the firm?

### What to do

• Keep in mind the Bradford Hill criteria like strength (a large enough effect size to actually explain the outcomes), consistency (multiple studies from different sources, using different kinds of data and analysis), plausible mechanism