- Collinearity
- Simpson’s paradox and ecological regressions
- Association is not causation
- Measurement

- Suppose we are doing a regression \(y \sim x_1 + x_2 + x_3\)
- Collinearity is a potential problem among the predictor variables
- Occurs when two predictor variables are highly correlated with each other \(x_1 \sim x_2\)
- Or when some subset of predictor variables can be used to predict one of the other predictor variables \(x_3 \sim x_1 + x_2\)

Stops us from including all the levels of different categorical variables, making it harder to interpret coefficients for several categorical predictors in the same model

Increases uncertainty about coefficients for the collinear variables (higher standard errors, wider confidence intervals)

Test error will likely be worse compared to training error (overfitting)

Avoid redundancy. If your data contains several variables which are essentially different ways of measuring the same thing, don’t use all of them as predictors

Other methods are available but outside the scope of this class: principal components regression, ridge regression

Ecological correlation refers to when data has been aggregated in some way before being used for a correlation/regression analysis. For example, comparing data at the level of states or countries, rather than at the individual level

Ecological correlation can be misleading because of “Simpson’s paradox,” a term to describe when the relationship within each group is different than in the overall population

For example, it is possible there is a positive correlation between \(Y\) and \(X\) in the country overall, but within every state the correlation between \(Y\) and \(X\) is negative. How is this possible? (draw pictures)

Example: https://economix.blogs.nytimes.com/2013/05/01/can-every-group-be-worse-than-average-yes/

Simpson’s paradox can also happen outside of ecological correlation, when instead of a grouping/aggregating variable (like state or country) there is some other kind of

**confounding**variable that reverses the direction of the relationshipe.g. The Navy noticed that, for sailors who fell overboard, there was a negative correlation between survival and wearing a life jacket.

**Why?**Because they usually wore life jackets during bad weather, but not during good weather, so those who fell over without jackets were falling into calm water, and those who fell with jackets were falling during storms. In this case the**weather**is a confounding variable

Regression, including every possible confounding variable as a predictor

e.g. to determine effect of life jackets on survival, need to

**control for**weather by including it as a predictor varible in some wayHow could we control for a grouping variable, like states? (Include categorical predictor, for example)

Chocolate and Nobel prize example seems funny because it’s

*obviously*wrong, but what about life expectancy and GDP?What about some genetic measurement and some health outcome?

A business decision and the subsequent performance of the firm?

Keep in mind the Bradford Hill criteria like strength (a large enough effect size to actually explain the outcomes), consistency (multiple studies from different sources, using different kinds of data and analysis), plausible mechanism

Use methods designed specifically for causal inference (a more advanced topic you can learn more about now if you’re interested)

- Andrew Gelman: the most important topic not covered in statistics textbooks is measurement
Reliability: like reproducibility, if we did it again and possibly in a slightly different way would we get the same result?

Validity: are the variables in our data actually measuring what we claim to be using them for?

Is the thing being measured actually worth measuring? Is the data just noisey garbage?

There is generally going to be a big difference between data that was collected passively and then later used for some reason that wasn’t anticipated at the time of collection, and data that was collected to answer a specific question