Academic homepage and blog

I write about statistics, data science, and machine learning, and sometimes opine on politics and current events, data journalism, science, and academia.

I’m a statistician and data scientist with a broad range of interests including theory, applications, and teaching with the R statistical programming language. My research focuses on common practices in machine learning and data science pipelines and addressing sources and types of error that have previously been overlooked. This includes, for example:

  • Developing methods for inference after model selection such as p-values adjusted for selection bias
  • Analyzing the social fairness of machine learning algorithms from a causal perspective

My work has been published in the Annals of Statistics and Advances in Neural Information Processing Systems (NIPS).

As a first generation college graduate, my journey in higher education started in community college. I care about diversity and inclusion, and I’m happy to speak with, mentor, or help students from any background.

History

  • Assistant Professor, New York University, Department of Technology, Operations, and Statistics, 2017-present.
  • Research Fellow, Alan Turing Institute and University of Cambridge, 2016-17.
  • Ph.D. Statistics, (Biostatistics trainee), Stanford University, 2016.
  • M.A. Mathematics, (concentration in computational biology), Rutgers University, 2011.
  • B.S. Mathematics, (summa cum laude), Western Michigan University, 2009.

Selected Honors and Awards

  • Statistics Department Teaching Award, 2014.
  • Alan M. Abrams Memorial Fellowship, 2013-2015.
  • Phi Beta Kappa.

Least squares as springs

Physics intuition for regression

During a recent "Zoom" lecture a student asked me a question about outliers. In the process of answering I realized something that I knew was true but had never seen explained in any sources. This post is my first attempt to develop an analogy that connects least squares methods, like regression or PCA, to physical intuition about springs or elastics. […] To illustrate I will use data from gapminder conveniently provided in an R package by Jenny Bryan. Consider these two variables in the dataset, GDP per capita and life expectancy, plotted here in a standard scatterplot: (To have a less busy plot, one with fewer points, I've subsetted the data to the year 2007 and countries in Asia). Now we're going to bring physical intuition into this by imagining these points as … [Read More]

A concise defense of statistical significance

A letter, signed by over 800 scientists and published in Nature called for an end to using p-values to decide whether data refutes or supports a scientific hypothesis. The letter has received widespread coverage and reignited an old debate. […] Most of the objections to p-values or the p < 0.05 threshold in these articles can be summarized into two categories: […] Banning p-values or “p < 0.05” thresholds wouldn’t address these objections. We will still have to make decisions, we can’t just report a Bayes factor (or a p-value) and refuse to decide whether a drug trial should continue or not. So our decisions will still sometimes be wrong, and in both directions. […] The last kind of objection is more sensible–though less often the … [Read More]

Data for good talk at Columbia Data Science Institute

(Note: links don’t work in this preview, click on the post to view). I’m happy to be speaking at 1pm EST today at Columbia University on the topics of causal inference and selection bias in algorithmic fairness. I believe video will be available at the webinar link, and here are my slides. The talk is based on work described in this survey with my coauthors Matt Kusner, Chris Russell, and Ricardo Silva. See here for the video for Matt’s oral presentation of our first paper in this line of work at NIPS 2017. [Read More]

Russian twitter trolls attacked Bernie too

You may have seen stories about Twitter accounts operated by Russians attempting to influence the 2016 election in the United States. Much of the reporting that I’ve seen described a Simple Narrative: Russians tried to help trump and hurt Clinton, even supporting Bernie Sanders in order to attack Clinton. I’ve also seen plenty of Democrats on Twitter attacking Sanders over this. I have not seen any stories reporting the fact that many of these bots also attacked Sanders. (If you’re aware of such stories, I’d be happy to hear about them). […] Recently, FiveThirtyEight published a repository with roughly 3 million tweets from about 2,800 accounts that Twitter concluded were associated with the Russian effort. About the data: […] The data set is the work of two professors at … [Read More]

A conditional approach to inference after model selection

Model selection can invalidate inference, such as significance tests, but statisticians have recently made progress developing methods to adjust for this bias. One approach uses conditional probability, adjusting inferences by conditioning on selecting the chosen model. This post motivates the conditional approach with a simple screening rule example and introduces the selectiveInference R package that can compute adjusted significance tests after popular model selection methods like forward stepwise and LASSO. [Read More]

Algorithmic fairness is as hard as causation

This post describes a simple example that illustrates why algorithmic fairness is a hard problem. I claim it is at least as hard as doing causal inference from observational data, i.e. distinguishing between mere association and actual causation. In the process, we will also see that SCOTUS Chief Justice Roberts has a mathematically incorrect theory on how to stop discrimination. Unfortunately, that theory persists as one of the most common constraints on fairness. [Read More]

Model selection bias invalidates significance tests

People often do regression model selection, either by hand or using algorithms like forward stepwise or the lasso. Sometimes they also report significance tests for the variables in the chosen model. After all, a significant p-value means they’ve found something real. But there’s a problem: the reason for that significant p-value may just be something called model selection bias. This bias can invalidate inferences done after model selection, and may be one of the contributors to the reproducibility crisis in science. Adjusting inference methods to account for model selection is an area of ongoing research where I have done some work. [Read More]

A letter, signed by over 800 scientists and published in Nature called for an end to using p-values to decide whether data refutes or supports a scientific hypothesis. The letter has received widespread coverage and reignited an old debate. […] Most of the objections to p-values or the p < 0.05 threshold in these articles can be summarized into two categories: […] Banning p-values or “p < 0.05” thresholds wouldn’t address these objections. We will still have to make decisions, we can’t just report a Bayes factor (or a p-value) and refuse to decide whether a drug trial should continue or not. So our decisions will still sometimes be wrong, and in both directions. […] The last kind of objection is more sensible–though less often the focal point of this debate–and I … [Read More]