This is Jessica. In a paper to appear at AIES 2022, Sayash Kapoor, Priyanka Nanayakkara, Arvind Narayanan, and Andrew and I write:
Recent arguments that machine learning (ML) is facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences. They also inspire calls for greater integration of statistical approaches to causal inference and predictive modeling.
A deeper understanding of what reproducibility critiques in research in supervised ML have in common with the replication crisis in experimental science can put the new concerns in perspective, and help researchers avoid “the worst of both worlds,” where ML researchers begin borrowing methodologies from explanatory modeling without understanding their limitations and vice versa. We contribute a comparative analysis of concerns about inductive learning that arise in causal attribution as exemplified in psychology versus predictive modeling as exemplified in ML.
Our results highlight where problems discussed across the two domains stem from similar types of oversights, including overreliance on theory, underspecification of learning goals, non-credible beliefs about real-world data generating processes, overconfidence based in conventional faith in certain procedures (e.g., randomization, test-train splits), and tendencies to reason dichotomously about empirical results. In both fields, claims from learning are implied to generalize outside the specific environment studied (e.g., the input dataset or subject sample, modeling implementation, etc.) but are often difficult to refute due to underspecification of the learning pipeline. We note how many of the errors recently discussed in ML expose the cracks in long-held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to consider a true data generating process or formally
represent uncertainty in performance claims. At the same time, the goals of ML are inherently oriented toward addressing learning failures, suggesting that lessons about irreproducibility could be resolved through further methodological innovation in a way that seems unlikely in social psychology. This assumes, however, that ML researchers take concerns seriously and avoid overconfidence in attempts to reform. We conclude by discussing risks that arise when
sources of errors are misdiagnosed and the need to acknowledge the role that human inductive biases play in learning and reform.
As someone who has followed the replication crisis in social science for years and now sits in a computer science department where it’s virtually impossible to avoid engaging with the huge crushing bulldozer that is modern ML, I often find myself trying to make sense of ML methods and their limitations by comparison to estimation and explanatory modeling. At some point I started trying to organize these thoughts, then enlisted Sayash and Arvind, who had done some work on ML reproducibility, Priyanka who follows work on ML ethics and related topics, and Andrew as authority on empirical research failures. It was a good coming together of perspectives, and an excuse to read a lot of interesting critiques and foundational stuff on inference and prediction (we cite over 200 papers!) As a ten page conference style paper this was obviously ambitious, but the hope is that it will be helpful to others who have found themselves trying to understand how, if at all, these two sets of critiques relate. On some level I wrote it with computer science grad students in mind–I teach a course to first year PhDs where I talk a little about reproducibility problems in CS research and what’s unique compared to reproducibility issues in other fields, and they seem to find it helpful.
The term learning in the title is overloaded. By “errors in learning” here we are talking about not just problems with whatever the fitted models have inferred–we mean the combination of the model implications and the human interpretation of what we can learn from it, i.e., the scientific claims being made by researchers. We break down the comparison based on whether the problems are framed as stemming from data problems, model representation bias, model inference and evaluation problems, or bad communication.

The types of data issues that get discussed are pretty different – small samples with high measurement error versus datasets that are too big to understand or document. The underrepresentation of subsets of the population to which the results are meant to generalize comes up in both fields, but with a lot more emphasis on implications for fairness in decision pipelines in ML based on its applied status. ML critics also talk about unique data issues like “harms of representation,” where model predictions reinforce some historical bias, like when you train a model to make admissions decisions based on past decisions that were biased against some group. The idea that there is no value-neutral approach to creating technology so we need to consider normative ethical stances is much less prevalent in mainstream psych reform, where most of the problems imply ways that modeling diverges from its ideal value-neutral status. There are some clearer analogies though if you look at concerns about overlooking sampling error and power issues in assessing the performance of an ML model.
Choosing representations and doing inference are also obviously different on the surface in ML versus psych, but here the parallels in critiques that reformers are making are kind of interesting. In ML there’s colloquially no need to think about the psychological plausibility of the solutions that a learner might produce; it’s more about finding the representation where the inductive bias, i.e., properties of the solutions that it finds, is desirable for the learning conditions. But if you consider all the work in recent years aimed at improving the robustness of models to adversarial manipulations to input data, which basically grew out of acknowledgment that perturbations of input data can throw a classifier off completely, it’s often implicit that successful learning means the model learns a function that seems plausible to a human. E.g., some of the original results motivating the need for adversarial robustness were surprising because they show that manipulations that a human doesn’t perceive as important (like slight noising of images or masking of parts that don’t seem crucial) can cause prediction failures. Simplicity bias in stochastic gradient descent can be cast as a bad thing when it causes a model to overrely on a small set of features (in the worst case, features that correlate with the correct labels as a result of biases in the input distribution, like background color or camera angle being strongly correlated with what object is in the picture). Some recent work explicitly argues that this kind of “shortcut learning” is bad because it defies expectations of a human who is likely to consider multiple attributes to do the same task (e.g., the size, color, and shape of the object). Another recent explanation is underspecification, which is related but more about how you can have many functions that achieve roughly the same performance given a standard test-validate-train approach but where the accuracy degrades at very different rates when you probe them along some dimension that a human thinks is important, like fairness. So we can’t really escape caring about how features of the solutions that are learned by a model compare to what we as humans consider valid ways to learn how to do the task.
We also compare model-based inference and evaluation across social psych and ML. In both fields, implicit optimization–for statistical significance in psych and better than SOTA performance in ML–is suggested to a big issue. However in contrast to using analytical solutions like MLE in psych, optimization is typically non-convex in ML such that the hyperparameters and initial conditions and computational budget you use in training the model can matter a lot. One problem critics point to is that in reporting researchers don’t always recognize this. How you define the baselines you test against is another source of variance and potentially bias if chosen in a way that improves your chances of beating SOTA.
In terms of high-level takeaways, we point out ways that claims are irrefutable by convention across the two fields. In ML research one could say there’s confusion about what’s a scientific claim and what’s an engineering artifact. When a paper claims to have achieved X% accuracy on YZ benchmark with some particular learning pipeline, this might be useful for other researchers to know when attempting progress on the same problem, but the results are more possibilistic than probabilistic, especially when based on only one possible configuration of hyperparameters etc and with an implicit goal of showing one’s method worked. The problem is that the claims are often stated more broadly, suggesting that certain innovations (a new training trick, a model type) led to better performance on a loosely defined learning task like ‘reading comprehension,’ ‘object recognition’, etc. In a field like social psych on the other hand you have a sort of inversion of NHST as intended, where a significant p-value leads to acceptance of loosely defined alternative hypotheses and subject samples are often chosen by convenience and underdescribed but claims imply learning something about people in general.
There’s also some interesting stuff related to how the two fields fail in different ways based on unrealistic expectations about reality. Meehl’s crud factor implies that using noisy measurements, small samples and misspecified models to argue about classes of interventions that have large predictable effects on some well-studied class of outcomes (e.g., political behavior) is out of touch with common sense about how we would expect multiple large effects to interact. In ML, the idea that we can leverage many weak predictors to make good predictions is accepted, but assumptions that distributions are stationary and that good predictive accuracy can stand alone as a measure of successful learning imply a similarly naive view of the world.
So… what can ML learn from the replication crisis in psych about fixing its problems? This is where our paper (intentionally) disappoints! Some researchers are proposing solutions to ML problems, ranging from fairly obvious steps like releasing all code and data to things like templates for reporting on limitations of datasets and behavior of models to suggestions of registered reports or pre-registration. Especially in an engineering community there’s a strong desire to propose fixes when a problem becomes apparent, and we had several reviewers that seemed to think the work was only really valuable if we made specific recommendations about what psych reform methods can be ported to ML. But instead the lesson we point out from the replication crisis is that if we ignore the various sources of uncertainty we face about how to reform a field—in how we identify problematic claims, how we define the core reasons for the problems, and how we know that a particular reform will more successful than others—it’s questionable whether we’re making real progressin reform. Wrapping up a pretty nuanced comparison with a few broad suggestions based on our instincts just didn’t feel right.
Ultimately this is the kind of paper that I’ll never feel is done to satisfaction, since there’s always some new way to look at it, or type of problem we didn’t include. There are also various parts where I think a more technical treatment would have been nice to relate the differences. But as I think Andrew has said on the blog, sometimes you have to accept you’ve done as much as you’re going to and move on from a project.