This course surveys theory and methods addressing important statistical aspects of data science with a focus on high-dimensional data, statistical learning, and causal inference. We will begin with advances in hypothesis testing such as control of the false discovery rate for multiple comparisons. Then we will discuss statistical theory for popular learning and model selection methods such as the lasso, including recent advances in post-selection inference. Finally, after reviewing frameworks for causal inference the course will conclude by reading literature on the application of statistical learning methods to causal inference.
Updated syllabus here.
All notebooks linked below, and their .Rmd source files, are available on this github repository.
These references are generally good, and some parts of them closely match the material we are covering.
Computer Age Statistical Inference: Algorithms, Evidence and Data Science by Bradley Efron and Trevor Hastie. PDF.
- Part I, especially Chapters 4 and 5.
- Chapter 7.
- Chapter 15.
- Chapter 16.
- Chapter 20.
Statistical Learning with Sparsity: The Lasso and Generalizations by Trevor Hastie, Robert Tibshirani, and Martin Wainwright. PDF.
- Chapter 2, skipping 2.8-2.9
- Chapter 4, sections 4.1-4.3.1
- Chapter 5, sections 5.1, 5.2, and 5.6
- Chapter 11, sections 11.1-22.214.171.124
- Chapter 6.
Supplemental: Stats 300C lecture notes by Emmanuel Candès. Course page.
- Lectures on multiple testing, FDR, knockoffs.
See syllabus for additional readings.