## Outline

• Interpreting p-values
• P-hacking
• Multiple testing
• Confidence intervals are better
• Statistical vs practical significance
• How large of a sample is enough?

## Interpreting p-values

• p-values are everywhere (Fisher would have over 3 million citations for them)
• It’s important to understand what they mean
movies <- bechdel[complete.cases(bechdel),]
movies$return <- movies$intgross_2013/movies\$budget_2013
wilcox.test(return ~ binary, data = movies)
##
##  Wilcoxon rank sum test with continuity correction
##
## data:  return by binary
## W = 299610, p-value = 0.0412
## alternative hypothesis: true location shift is not equal to 0
• This p-value is less than 0.05, so we would reject the null hypothesis at a 5% significance level
• In general, for any significance level higher than the p-value you would reject the null
• P-values can be thought of as the cutoff for significance: 0.042 is the highest significance level that would cause us to not reject the null
• Another interpretation: the p-value is the probability, assuming the null is true, of a test statistic value at least as extreme as the observed value
• The more extreme the test statistic value, the greater the evidence against the null hypothesis

• The definition of “extreme” depends on the alternative hypothesis. For one-sided alternatives, it is the extreme in the direction of that one side. For example, if the null is $$\mu = \mu_0$$ and the alternative is $$\mu > \mu_0$$, then extreme would correspond to $$\bar X$$ being much higher than $$\mu_0$$ (not much lower)

• p-values are not the probability that the null hypothesis is true
• p-values are not the probability that the null hypothesis is true
• p-values are not the probability that the null hypothesis is true
• p-values are not the probability that the outcome is due to chance
• p-values are not the probability that the outcome is due to chance
• p-values are not the probability that the outcome is due to chance
• The null hypothesis is either true or false, it’s not random
• These misinterpretations are astonishingly common…
• They can be avoided by remembering which things are considered random: the data, the test statistic, but not the hypothesis

## P-hacking

• P-hacking means using various tricks to try to get a p-value to be small (less than 0.05)
• I have to tell you what it is so you know what not to do
• Don’t do it
• It’s a form of cheating
• There are lots of embarrassing examples of it coming back to bite people

• Examples of what not to do
• Keep asking slightly different questions until the p-value is small
• Keep adding new data to your sample but stop once the p-value is small
• Remove certain data points (call them “outliers”) until the p-value is small
• Measure and test lots of variables, but only report the ones with small p-values
• Change the model in various ways (we’ll see more in regression) until the p-value is small
• Do tests for lots of different subgroups until you find a group with a significant p-value
• Transform the data and try different kinds of tests

• What are some best practices?
• Have a plan: decide in advance what is going to be tested
• They call this pre-registration in clinical trials
• Be open: keep a record of everything you try, and report all of it
• Reproducible research

## Multiple testing

• Suppose you have many different hypotheses to test, and you’re going to report all of them, or at least report how many are being tested
• This is different from only reporting the ones that are significant!
• If each test has a 5% probability of type 1 error, the tests are independent, and all the null hypotheses are true, what is the probability of making at least one type 1 error?

$P(\text{at least one false positive out of } m \text{ tests}) = 1 - P(\text{no false positives out of } m)$ We can do this using independence! $P(\text{no false positives out of } m) = (0.95)^m$

m <- 1:90
FWER <- data.frame(m=m, FWER = 1 - 0.95^m)
ggplot(FWER, aes(x = m, y = FWER)) + geom_point() + theme_tufte()

• Understanding errors when there are many tests is different from understanding one test
• Interpreting many p-values simultaneously is different from interpreting one at a time
• In this class we will mostly limit ourselves to dealing with simple enough cases that we don’t have to worry about this
• But it’s becoming more important as technological progress makes more data available

• Statisticians have developed methods for “adjusting” p-values when there are many to be interpreted together, for example in genomics settings
• Two of the most common are called “Bonferroni correction” and “false discovery rate” (FDR)
• You will not be tested on this, but it’s good to have heard about it once so you can remember to look it up or ask about it if you ever find yourself needing to interpret many $$p$$-values together

## Confidence intervals are better

• “There is a significant difference between two means of two groups” vs “The CI for the difference between the groups is from 0.005 to 0.025”
• Can also be plotted!
• Issues with sample sizes: when samples are large, significant $$p$$-values may not be interesting (see next section)

## Statistical significance and practical significance

• Suppose you read this headline: Diet X is associated with lower risk of cancer
• You check out the study, the null hypothesis is no assocation, the $$p$$-value is $$<0.00001$$
• Very significant result!
• But what if the risk reduction was, e.g., from 2.5% to 2.47% risk?
• The result is highly statistically significant, but not very practically significant

• (Statistical) power is the probability of rejecting the null hypothesis when it is not true
• Important to think about effect size
• Effect size, one sample: $$\mu/\sigma$$
• Effect size, two groups: $$(\mu_1 - \mu_2)/\sigma$$
• Large effect size and/or large sample size lead to high power

• Note: to make formulas simple for now, we assume variances are known and equal 1, hence we use normal distribution instead of t-distribution

• Let’s be more mathematical: consider an example for differences between groups
• Suppose the true difference is $$\mu_1 - \mu_2 = 0.01$$
• If the sample size is very large, the test will reject the null hypothesis
• But is that really useful information? It depends on if $$0.01$$ is large enough to be of any practical importance

group1 <- rnorm(1000000, mean = 0.01)
group2 <- rnorm(1000000, mean = 0)
t.test(group1, group2)
##
##  Welch Two Sample t-test
##
## data:  group1 and group2
## t = 5.646, df = 2e+06, p-value = 1.642e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.005209812 0.010750209
## sample estimates:
##    mean of x    mean of y
## 0.0082484169 0.0002684068
range <- data.frame(x = c(-2,2))
ggplot(range, aes(x)) +
stat_function(fun = dnorm, args = list(mean = 0.01, sd = 1)) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1)) +
theme_tufte() +
ggtitle("Two significantly different distributions?")

• Below we consider an even smaller effect of $$\mu = 0.001$$
• Plot the probability of rejecting the null at 5% significance as a function of sample size $$n$$
c_null <- qnorm(.95)
mu <- 0.001
powern <- function(n) {
1 - pnorm(c_null - mu*sqrt(n))
}
range <- data.frame(n = 10^c(1:7))
ggplot(range, aes(n)) +
stat_function(fun = powern) + theme_tufte() +
ylab("Power") +
ggtitle("Power as a function of sample size, mu = 0.001")

• Hypothesis tests are still useful if you must make a decision, e.g. A/B testing, summarizing the conclusion of a scientific study, etc
• But beware: very large sample sizes might mean any test you do will be significant

• Let’s also look at the “power function” for $$n = 100$$ and a two sided alternative

c_null <- qnorm(.975)
powermu <- function(mu) pnorm(-c_null - mu*10) + pnorm(c_null - mu*10, lower.tail = F)
range <- data.frame(mu = seq(from = -.5, to = .5, length.out = 100))
ggplot(range, aes(mu)) +
stat_function(fun = powermu) + theme_tufte() +
ylab("Power") +
ggtitle("Power as a function of true mean (two-sided), n = 100")