- Covariance: interaction between variables
- Correlation: standardized covariance
- Summarizing two continuous variables
- Things to worry about

- Until this point we’ve focused on methods studying one variable (at a time), or
*independent*and identically distributed samples of one random variable - Now we’ll start considering relationships between two variables, and dependence
- Here’s an example plot to get a visual idea about what we’re doing:

```
mpg2008 <- filter(mpg, year == "2008")
mpgplot <- ggplot(mpg2008, aes(cty, hwy)) +
theme_tufte() +
ggtitle("Miles per gallon") +
geom_text_repel(data = subset(mpg2008, cty >= 26 | cty == 9),
mapping = aes(label = model))
mpgplot + geom_point(alpha = .5)
```

`mpgplot + geom_point() + geom_jitter()`

We’ll start with a bit of math and then move to data, interpretation, more plots, etc

- Think about two random variables, \(X\) and \(Y\).
- Remember linearity for expectation? \(E[X + Y] = E[X] + E[Y]\)
- For variance, we needed independence: if \(X\) and \(Y\) are independent, then \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\)
- What if \(X\) and \(Y\) are
*not*independent? (i.e. are dependent) **Covariance**measures how they vary together: \[ \text{Cov}(X, Y) = E[(X- E[X])(Y - E[Y])] \]- Roughly speaking:
- If larger values of \(X\) occur together with larger values of \(Y\), their covariance is positive
If larger values of \(X\) occur with smaller values of \(Y\), their covariance is negative

- What is \(\text{Cov}(Y,X)\)? (symmetry)
- What is \(\text{Cov}(X,X)\)? (\(\text{Var}(X)\))
- If \(E[X] = E[Y] = 0\), then \(\text{Cov}(X,Y) = E[XY]\)
What if we add a consant to \(X\), what happens to the covariance?

- Now we can answer this old question: \[ \text{Var}(X + Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y) \]
Does this formula remind you of something in algebra? \((x+y)^2 = x^2 + 2xy + y^2\) (not a coincidence!)

- Covariance between two random variables can be thought of as another unknown parameter
- We now focus on calculating the
**sample covariance**when we have data (This is another example of using the plug-in principle to estimate an unknown parameter)

- The data has to come in pairs: \((X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)\).
- One way of thinking about this is that it makes sense to put them in columns next to each other in the same spreadsheet
- Remember the paired \(t\)-test for a difference in means? Similarly, it makes sense if observations of \(X\) and \(Y\) are measuring two aspects of one underlying thing, like city and highway mpg for the same car, or expenses and earnings for the same company
- If the two variables aren’t measured in pairs like this, we can’t calculate covariance
Why is that? (Example: midterm grades and hours studying per week)

- The sample covariance formula is \[ \text{cov(X,Y)} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) \]
In spreadsheet terms: first center each variable by subtracting its mean, then multiply the two centered variables together, then take the average of that product

Example using

`R`

functions for covariance:

```
n <- 30
X <- runif(n)
Y <- 2*X^2 + .5*rnorm(n)
# Scatterplot
qplot(X, Y) + theme_tufte()
```