Outline

Covariance

mpg2008 <- filter(mpg, year == "2008")
mpgplot <- ggplot(mpg2008, aes(cty, hwy)) +
    theme_tufte() + 
    ggtitle("Miles per gallon") +
    geom_text_repel(data = subset(mpg2008, cty >= 26 | cty == 9),
                      mapping = aes(label = model))
mpgplot + geom_point(alpha = .5)

mpgplot + geom_point() + geom_jitter()

Mathematical background

  • We’ll start with a bit of math and then move to data, interpretation, more plots, etc

  • Think about two random variables, \(X\) and \(Y\).
  • Remember linearity for expectation? \(E[X + Y] = E[X] + E[Y]\)
  • For variance, we needed independence: if \(X\) and \(Y\) are independent, then \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\)
  • What if \(X\) and \(Y\) are not independent? (i.e. are dependent)
  • Covariance measures how they vary together: \[ \text{Cov}(X, Y) = E[(X- E[X])(Y - E[Y])] \]
  • Roughly speaking:
  • If larger values of \(X\) occur together with larger values of \(Y\), their covariance is positive
  • If larger values of \(X\) occur with smaller values of \(Y\), their covariance is negative

  • What is \(\text{Cov}(Y,X)\)? (symmetry)
  • What is \(\text{Cov}(X,X)\)? (\(\text{Var}(X)\))
  • If \(E[X] = E[Y] = 0\), then \(\text{Cov}(X,Y) = E[XY]\)
  • What if we add a consant to \(X\), what happens to the covariance?

  • Now we can answer this old question: \[ \text{Var}(X + Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y) \]
  • Does this formula remind you of something in algebra? \((x+y)^2 = x^2 + 2xy + y^2\) (not a coincidence!)

Calculating from samples of data

  • Covariance between two random variables can be thought of as another unknown parameter
  • We now focus on calculating the sample covariance when we have data
  • (This is another example of using the plug-in principle to estimate an unknown parameter)

  • The data has to come in pairs: \((X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)\).
  • One way of thinking about this is that it makes sense to put them in columns next to each other in the same spreadsheet
  • Remember the paired \(t\)-test for a difference in means? Similarly, it makes sense if observations of \(X\) and \(Y\) are measuring two aspects of one underlying thing, like city and highway mpg for the same car, or expenses and earnings for the same company
  • If the two variables aren’t measured in pairs like this, we can’t calculate covariance
  • Why is that? (Example: midterm grades and hours studying per week)

  • The sample covariance formula is \[ \text{cov(X,Y)} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) \]
  • In spreadsheet terms: first center each variable by subtracting its mean, then multiply the two centered variables together, then take the average of that product

  • Example using R functions for covariance:

n <- 30
X <- runif(n)
Y <- 2*X^2 + .5*rnorm(n)
# Scatterplot
qplot(X, Y) + theme_tufte()