Hypothesis testing

The basic idea

  • Null hypothesis: the goal of scientific discovery is to try to disprove this.
  • Test statistic computed from data
  • The null hypothesis tells us the distribution of the test statistic
  • If the observed value seems very extreme, reject the null hypothesis

  • Why the name “null”? Usually this hypothesis will mean that nothing interesting is going on. People sometimes refer to it as the “dull hypothesis” for this reason.
  • Skeptical approach: when you think you’ve seen a pattern, assume it’s actually just due to randomness, and only change this belief if you can quantify sufficient evidence to disprove it

  • Coin toss example: the number of heads \(X\) in 10 coin tosses is Bin(10, \(p\)). If \(p = 1/2\), what is \(P(X \geq 8)\)?

1 - pbinom(7, 10, .5)
## [1] 0.0546875

So if the coin is fair, an outcome at least as extreme as the one we observed is pretty rare, occurring only about 5.5% of the time.

Types of errors

  • We are reasoning statistically or probabilistically, using data that may be noisy or imperfect
  • We want to avoid committing errors, but can’t guarantee this 100%
  • What if the null hypothesis is true, but we reject it? This is type 1 error or false positive
  • If the null hypothesis is false, but we fail to reject it, that’s a type 2 error or false negative
  • Our first goal will be to guarantee that the probability of a type 1 error (false positive) is low, say 5%
  • Our second goal will be to choose, among all possible tests that satisfy the 1st goal, the one that has the lowest probability of type 2 error (false negative). This is also called the most powerful test

Formalities: definitions and notation

Significance level is determined by the context or convention, often 5% or 1%. Similar to confidence level for intervals. For simplicity we’ll use 5% throughout.

Hypotheses for most common scenarios

  • Testing a parameter \(\theta\)
  • Null hypothesis \(H_0: \theta = \theta_0\)
  • Alternative hypothesis (simple) \(H_1 : \theta = \theta_1\) for some \(\theta_1 \neq \theta_0\)
  • Alternative hypothesis (composite) \(H_1: \theta \neq \theta_0\) (two-sided), or \(\theta > \theta_0\) (greater), or \(\theta < \theta_0\) (less)


  • Use data to compute a statistic \(T\). The choice of statistic will be usually determined by the specific example–past research by statisticians worked out details for the “best” statistic in many cases.
  • The distribution (cdf) of \(T\), if the null hypothesis is true, is \(P_{H_0}(T \leq t)\) or \(P_0(T \leq t)\)
  • Terminology: this is often called the distribution “under the null”
  • One-sided (greater): suppose \(P_0(T > t_{.95}) = 0.05\). If we observe \(T > t_{.05}\) then we reject the null hypothesis. Otherwise (if \(T \leq t_{0.05}\)) we fail to reject the null.
  • Two-sided: suppose \(P_0(-t_{.975} \leq T \leq t_{.975}) = 0.05\). If we observe \(T\) outside the interval from \([-t_{.975}, t_{.975}]\) then we reject the null hypothesis, otherwise fail to reject.


  • Probability, under the null, that the test statistic is at least as extreme as the observed value
  • e.g. Suppose \(T = 2.8\) is computed from the data. Then the \(p\)-value for a one-sided (greater) test is \(P(T \geq 2.8)\).
  • e.g. Same as above, but two-sided: \(P(|T| > 2.8)\) or \(P(T > 2.8) + P(T < -2.8)\).
  • Tail area starting from the observed value of the test statistic
  • Low \(p\)-values correspond to rejecting the null
  • If \(p < 0.05\) then the test rejects at the 5% significance level
  • If \(p < 0.01\), reject at the 1% significance level


Fair coin toss or cheating?

qbinom(.95, 10, .5)
## [1] 8

Testing for one mean

If \(\text{Var}(X_i) = \sigma^2\) is known, test is based on the normal distribution

\[ P_0\left (-z_{.975} < \frac{\bar X - \mu_0}{\sigma/\sqrt{n}} < z_{.975} \right) = 0.95 \] - Reject if \(|\bar X - \mu_0|/(\sigma/\sqrt{n})\) is larger than \(z_{.975} \approx 1.96\)

If \(\sigma\) is unknown (most common in real data examples), use \(t\)-test

\[ P_0\left (\frac{\bar X - \mu_0}{S/\sqrt{n}} < t_{n-1,.95} \right) = 0.95 \] - Reject if \((\bar X - \mu_0)/SE\) is larger than \(t_{n-1,.95}\). e.g. if \(n = 100\)

qt(.95, 99)
## [1] 1.660391
X <- flights %>% filter(carrier == "HA") %>% pull(air_time)
c(mean(X), sd(X))
## [1] 623.08772  20.68882
t.test(X, mu = 620, alt = "greater")
##  One Sample t-test
## data:  X
## t = 2.76, df = 341, p-value = 0.003046
## alternative hypothesis: true mean is greater than 620
## 95 percent confidence interval:
##  621.2426      Inf
## sample estimates:
## mean of x 
##  623.0877

This low \(p\)-value means we would reject the null hypothesis that the average flight time is 620 minutes.

Relation to intervals