#### Outline

• Covariance: interaction between variables
• Correlation: standardized covariance
• Summarizing two continuous variables

## Covariance

• Until this point we’ve focused on methods studying one variable (at a time), or independent and identically distributed samples of one random variable
• Now we’ll start considering relationships between two variables, and dependence
• Here’s an example plot to get a visual idea about what we’re doing:
mpg2008 <- filter(mpg, year == "2008")
mpgplot <- ggplot(mpg2008, aes(cty, hwy)) +
theme_tufte() +
ggtitle("Miles per gallon") +
geom_text_repel(data = subset(mpg2008, cty >= 26 | cty == 9),
mapping = aes(label = model))
mpgplot + geom_point(alpha = .5)

mpgplot + geom_point() + geom_jitter()

### Mathematical background

• We’ll start with a bit of math and then move to data, interpretation, more plots, etc

• Think about two random variables, $$X$$ and $$Y$$.
• Covariance measures how they vary together: $\text{Cov}(X, Y) = E[(X- E[X])(Y - E[Y])]$
• Roughly speaking:
• If larger values of $$X$$ occur together with larger values of $$Y$$, their covariance is positive
• If larger values of $$X$$ occur with smaller values of $$Y$$, their covariance is negative

• What is $$\text{Cov}(Y,X)$$? (symmetry)
• What is $$\text{Cov}(X,X)$$? ($$\text{Var}(X)$$)
• If $$E[X] = E[Y] = 0$$, then $$\text{Cov}(X,Y) = E[XY]$$
• What if we add a consant to $$X$$, what happens to the covariance?

• Variance of a sum involves covariance: $\text{Var}(X + Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)$
• Does this formula remind you of something in algebra? $$(x+y)^2 = x^2 + 2xy + y^2$$ (not a coincidence!)

### Calculating from samples of data

• Covariance between two random variables can be thought of as another unknown parameter
• We now focus on calculating the sample covariance when we have data
• (This is another example of using the plug-in principle to estimate an unknown parameter)

• The data has to come in pairs: $$(X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)$$.
• One way of thinking about this is that it makes sense to put them in columns next to each other in the same spreadsheet
• Remember the paired $$t$$-test for a difference in means? Similarly, it makes sense if observations of $$X$$ and $$Y$$ are measuring two aspects of one underlying thing, like city and highway mpg for the same car, or expenses and earnings for the same company
• If the two variables aren’t measured in pairs like this, we can’t calculate covariance

• The sample covariance formula is $\text{cov(X,Y)} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)$
• In spreadsheet terms: first center each variable by subtracting its mean, then multiply the two centered variables together, then take the average of that product

• Example using R functions for covariance:

n <- 30
X <- runif(n)
Y <- 2*X^2 + .5*rnorm(n)
# Scatterplot
qplot(X, Y) + theme_tufte()