## Outline

• Interpreting p-values
• P-hacking
• Multiple testing
• Confidence intervals are better

## Interpreting p-values

• We’ve learned about a few kinds of tests. They all involve a null hypothesis and (almost always) an alternative hypothesis, and the outcome can be reported as a p-value
• In a 2012 blog post called “R. A. Fisher is the most influential scientist ever,” statisticians Jeff Leek and Rafael Irizarry estimated from google scholar that Fisher’s work on p-values would have been cited about 3 million times if everyone who published p-values cited him
• p-values are everywhere and it’s important to understand what they mean
movies <- bechdel[complete.cases(bechdel),]
movies$return <- movies$intgross_2013/movies$budget_2013 wilcox.test(return ~ binary, data = movies) ## ## Wilcoxon rank sum test with continuity correction ## ## data: return by binary ## W = 299610, p-value = 0.0412 ## alternative hypothesis: true location shift is not equal to 0 • This p-value is less than 0.05, so we would reject the null hypothesis at a 5% significance level • In general, for any significance level higher than the p-value you would reject the null • P-values can be thought of as the cutoff for significance: 0.042 is the highest significance level that would cause us to not reject the null counts <- table(survey$Smoke, survey\$Sex)
counts
##
##         Female Male
##   Heavy      5    6
##   Never     99   89
##   Occas      9   10
##   Regul      5   12
chisq.test(counts)
##
##  Pearson's Chi-squared test
##
## data:  counts
## X-squared = 3.5536, df = 3, p-value = 0.3139
• Another interpretation: the p-value is the probability, assuming the null is true, of a test statistic value at least as extreme as the observed value
• The more extreme the test statistic value, the greater the evidence against the null hypothesis

• Remember that the chi-squared test involved adding up, over each cell of the table, the $$(\text{observed} - {expected})^2$$ where the expected amount is calculated assuming the null hypothesis of independence. If the test statistic is large that means the observed values are far from the expected values
• Similarly, remember that the $$t$$-test statistic for comparing two groups involved the difference between their means $$\bar X_1 - \bar X_2$$. If the absolute value of this difference is large (extreme)
• The definition of “extreme” depends on the alternative hypothesis. For one-sided alternatives, it is the extreme in the direction of that one side. For example, if the null is $$\mu = \mu_0$$ and the alternative is $$\mu > \mu_0$$, then extreme would correspond to $$\bar X$$ being much higher than $$\mu_0$$ (not much lower)

• p-values are not the probability that the null hypothesis is true
• p-values are not the probability that the null hypothesis is true
• p-values are not the probability that the null hypothesis is true
• p-values are not the probability that the outcome is due to chance
• p-values are not the probability that the outcome is due to chance
• p-values are not the probability that the outcome is due to chance
• The null hypothesis is either true or false, it’s not random
• These misinterpretations are astonishingly common…
• They can be avoided by remembering which things are considered random: the data, the test statistic, but not the hypothesis

## P-hacking

• P-hacking means using various tricks to try to get a p-value to be small (less than 0.05)
• I have to tell you what it is so you know what not to do
• Don’t do it
• It’s a form of cheating
• There are lots of embarrassing examples of it coming back to bite people

• Examples of what not to do
• Keep asking slightly different questions until the p-value is small
• Keep adding new data to your sample but stop once the p-value is small
• Remove certain data points (call them “outliers”) until the p-value is small
• Measure and test lots of variables, but only report the ones with small p-values
• Change the model in various ways (we’ll see more in regression) until the p-value is small
• Do tests for lots of different subgroups until you find a group with a significant p-value
• Transform the data and try different kinds of tests

• What are some best practices?
• Have a plan! Decide in advance what is going to be tested
• They call this pre-registration in clinical trials
• Be open! Keep a record of everything you try, and report all of it
• Reproducible research

## Multiple testing

• Suppose you have many different hypotheses to test, and you’re going to report all of them, or at least report how many are being tested
• This is different from only reporting the ones that are significant!
• For example, you have measurements for thousands of genes and want to do a test for each gene to see if it is associated with some kind of phenotype or medical condition
• If each test has a 5% probability of type 1 error, and the tests are independent, what is the probability of making at least one type 1 error?

$P(\text{at least one false positive out of } m \text{ tests}) = 1 - P(\text{no false positives out of } m)$ We can do this using independence! $P(\text{no false positives out of } m) = (0.95)^m$

m <- 1:90
FWER <- data.frame(m=m, FWER = 1 - 0.95^m)
ggplot(FWER, aes(x = m, y = FWER)) + geom_point() + theme_tufte()

• Understanding errors when there are many tests is different from understanding one test
• Interpreting many p-values simultaneously is different from interpreting one at a time
• In this class we will mostly limit ourselves to dealing with simple enough cases that we don’t have to worry about this
• But it’s becoming more important as technological progress makes more data available

• Statisticians have developed methods for “adjusting” p-values when there are many to be interpreted together, for example in genomics settings
• One method is based on the “false discovery rate” (FDR)
• Given $$p$$-values for $$m$$ tests, this method picks the $$0 \leq k \leq m$$ smallest $$p$$-values for rejection, and $$k$$ is chosen in a way to make it so the expected proportion of false discoveries is small (like 5% or 10%)
• (You will not be tested on this, but it’s good to know about)

## Confidence intervals are better

• Confidence intervals give more information than a hypothesis test
• “There is a significant difference between two means of two groups” vs “The CI for the difference between the groups is from 0.005 to 0.025”
• Can also be plotted!
• “A p-value is just a crude measure of sample size” (what?! see next lecture)