The mysteries of statistics
For many, statistics is one of those few fields of study that leaves one feeling more mystified the more one learns of it. I would guess that many will chalk this up to their own limitations. Anyways, shouldn't a science like statistics be self-correcting?
'Self-correcting' is a very flattering way to describe the way scientific theories change. It encourages one to think of science as a closed system with error-detecting mechanisms which spring into action before too much damage can been done. But the nice thing about science is surely not that it is self-correcting but more simply that it may still have the capacity to correct mistaken ideas even at the expense of many experts' pride and prestige.
Here is a claim that is sure to prick someone's pride:
The methods of modern statistics...are founded on a logical error. These methods are not just wrong in a minor way...They are simply and irredeemably wrong. They are logically bankrupt, with severe consequences for the world of science that depends on them.
So begins Aubrey Clayton's Bernoulli's Fallacy: Statistical Illogic and the Crisis of Modern Science (Columbia University Press, 2021). The book is written less as scholarship than as `a piece of wartime propaganda, designed to be printed on leaflets and dropped from planes over enemy territory' (p. XV). The leaflets are for researchers whose life's work is dominated by modern statistical methods.
Perhaps you are one of them. If so, Clayton has a lot to say to you but, in the end, I think he has one big point which deserves a broader audience than his book will be able to reach. That one point is what he calls Bernoulli's fallacy. Clayton provides many examples from law, science, and games. Here I've tried to put together a review of Bernoulli's Fallacy that doesn't overtly use any statistics or equations (though the book has plenty of numerical examples). After sharing some of the big ideas from the book I'll describe what I think is still missing from it, namely a critique of the empiricist philosophy standing behind the failures of statistics as a discipline of methodological specialists.
If it's not sufficiently clear, I'll just add that Clayton clearly loves probability and science. So far as I can tell, if he's a big critic its only because his high expectations have been disappointed and he would like to see some serious problems be corrected.
Objectivity and probability
Before getting into the substance of the book, it will help to introduce one concept that appears over and over again in debates about probability, including in Bernoulli's Fallacy.
The core question here is about objectivity—what does it mean for an analysis or its method to be objective? One answer is that an objective method is one that is both transparently stated and fixed, so that one at least knows when the rules are being applied and when not. One can apply it equally to contending ideas.1 Moreover, only if one can extract the logical structure from an argument ('objectifying' it) can that logic can be subjected to scrutiny. According to one theory, probability is a part of logic and it consists of formal logical structures, a standard of sorts, that can be used to make assessments of arguments and evidence. Here, objectivity is about logic and critical analysis. This is the view that Clayton favors.
There are other meanings to the word 'objective'. One of them refers to things that exist independently of our thoughts. Trees and earth have objective existence, as do symmetry and light; beauty does not. According to the single most influential theory in statistics, probabilities have objective existence in just this sense. The probability of an event is its long-run rate of occurrence. This is known as the frequency theory of probability and its greatest defenders include Venn, von Mises, and Fisher. Probability is said to be a very scientific term which need not resemble the meaning of 'probability' in common parlance, law, and so forth.
Defenders of the frequency theory ardently object to the claim that probability theory is mere logic2. They have been known to doubt whether there could even be a logic of induction. The frequency theory is a cordon that is supposed to keep inductive logic out of probability theory. The rub is that none of these beliefs ever diminished the will of (frequency-theory) statisticians to tell scientists how to learn from data. If anything, these ideas seemed only to strengthen their confidence and to lend to their inductive practices an air of objectivity (in the second sense).
If probability is objective like logic then it can be used to probe ideas, to construct arguments, and to explode them. If probabilities are objective like stones then it would seem pointless to argue with them. In that sense the 'objective' frequency theory of probability is distinctly soporific.
Barbarians with pencils
The soporific power of 'objective' probabilities was, on Clayton's telling, in no sense incidental to the development of modern statistical methods. It happens to have been part of an effort to promote the ideology of eugenics. The eugenics movement was mostly British and American men of the professional class arguing that men of their class and nation deserve special privileges over others due to their innate qualities, and that some ambiguously defined 'races' or 'racial stocks' are best exterminated. Francis Galton, Karl Pearson, Ronald Fisher, and others in the eugenics movement preferred to portray their social ideology as objective fact.
Consider Pearson's study of students at the Jews' Free School in East London. He and his coauthor Margaret Moul examined 600 children, Clayton says, `to see if it would be appropriate for the British government to prejudicially deny entry to Jews', who Pearson feared 'will develop into a parasitic race' (145) Examining table after table of averages and correlations of variables such as vision, home cleanliness, and skull shapes they write
we must admit that...we have not reached close correlations...But in breaking what we believe to be new ground we have come across indications that such correlations probably exist [between vision and either eye color or head shape]...there is far more hope of showing vision as a function of anthropometric characters than a product of environment. In other words, it is a question of race, rather than of immediate surroundings. (264)
They found 'indications' that some correlations may exist. Not impressive but enough in their view to support spiteful political action.
Clayton rightly points to Pearson's flexibility when it comes to determining desirable traits, readily altering his position to accommodate his antisemitism, as when saving money became bad only after he found that Jews did it. See Stephen Jay Gould's Mismeasure of Man for a more intensive study of the eugenics literature.
The eugenicists' beliefs were unshakable. After the catastrophe of Nazism—the ultimate test of eugenics—Fisher felt this way (p. 158):
I have no doubt also that the [Nazi] party sincerely wished to benefit the German racial stock, especially by the elimination of manifest defectives, such as those deficient mentally, and I do not doubt that von Verschuer gave, as I should have done, his support to such a movement.
He wrote this in defense of the Nazi Otmar Freiherr von Verschuer who, Clayton reminds us, `used data collected by [his mentee Josef] Mengele in his Auschwitz experiments' (158).
Free Sally Clark!
This history of eugenics and statistical methods is important and intriguing but not central to Clayton's argument. It may even consume more pages than the argument warrants because the central thesis of Bernoulli's Fallacy is that the problem with (orthodox) statistics stems from a single logical fallacy which has been repeated over and over again. Clayton attributes the mistake to Jacob Bernoulli (1655—1705). Thankfully, the problem can be understood without any statistical training.
Here's a very serious and real example of this logic at work (97-101): Sally Clark of Manchester, England gave birth to two baby boys, one in 1996 and a second in 1997. Sally relays that both babies died under similar circumstances; essentially, the baby was unconscious and stopped breathing. This is known as sudden infant death syndrome (SIDS). In the second case, the baby showed signs of trauma and, according to Sally and her husband, this was due to attempts to resuscitate the boy.
The reasoning that sent Sally to prison for double murder goes like this:
The probability that two boys of the same mother would die of SIDS is extremely small. Therefore the probability that Sally is innocent of murdering them is also very small.
Indeed, SIDS is rare and so it would seem to be extremely unlikely for a mother to lose two babies to it. The jury was convinced that this reasoning was objective, mathematically sound evidence which amounted nearly to proof that Sally could not really be innocent. Sally was convicted to life in prison for murdering her infants.
What is the logical fallacy here? The statistical reasoning provides an answer to the wrong question. The statistical question was presented this way:
What is the probability that both of Sally's two boys would die of SIDS?
That is known as a sampling probability. It may be an interesting question but the jury had to answer this question:
Did Sally kill her children?
The logical fallacy consists of the claim that to answer the second question, one need only answer the first question. It is a case of false substitution.
The usual way of reasoning one's way through a problem, and what probability theory actually tells us we must do, is to weigh all of the available evidence. We do have to think about sampling probabilities (and there is more to that than we will mention) but we also have to ask the following question:
Setting aside the untimely death of her boys, what is the probability that Sally would kill her own children?
This is know as the prior probability. Put differently, what grounds does the prosecutor have to propose that Sally would even consider committing such an act? That a seemingly normal, loving mother like Sally would murder her babies is ridiculously improbable. The good prosecutor wants a motive. They want character witnesses that could puncture the image of Sally as a normal mother and thereby remove a source of reasonable doubt.
So we have two kinds of evidence. First, the 'sampling probabilities' or 'likelihoods' which are like the relative explanatory power of the competing theories presented by prosecution and defense. Then we have the prior probability. If these two forms of evidence push in opposite directions then they can balance or cancel each other out.
Here is how Clayton puts it:
Two children dying in infancy by whatever means is already an extremely unlikely event...The whole landscape of our probability assignments needs to change to reflect the fact that, by necessity, we are dealing with an extremely rare circumstance. And the prior probability we should reasonably assign to the proposition "Sally Clark murdered her two children," determined before considering the evidence, is itself extremely low because double homicide within a family is also incredibly rare! (99)
So a second crucial fact which could have informed the prior probability is the rate at which mothers are found to murder their children. If SIDS is rare, then mother murderers are even more rare. The jury should have been shown a comparison of these two rates. They would have seen that the two not only balance one another but even favor Sally's innocence.
Sally Clark was tragically imprisoned for three years before the courts reversed this mistake and began revisiting similar cases as well.
Being wrong precisely
Bernoulli's Fallacy is at the heart of what Clayton calls 'the crisis of modern science', by which he means the failure of an embarrassing number of peer-reviewed studies (in certain fields) to survive attempts at replication. Failure to replicate results is an issue for those fields that (a) rely heavily on the statistician's method of 'null-hypothesis significance testing' (NHST) and (b) adopt an experimental method or laboratory-based observation (unlike social research, geology, astrophysics, climatology, and other fields). This includes psychology but also neuroscience, genetics, and others.
NHST is nothing but Bernoulli's fallacious logic. For example, 'Sally is innocent' would the prosecutor's null hypothesis which we could reject if, while holding it as a supposition, the observations appear sufficiently unlikely. I don't doubt that the damage done by this fallacy is countered somewhat by other scientific practices. For example, medical research does use NHST but new medicines go through multiple stages of evaluation on multiple criteria, including knowledge of mechanisms.
Clayton does a nice job of recounting some of the severe criticisms of NHST that were voiced before it became dogma. Dr. Joseph Berkson of the Mayo Clinic, for one, complained in 1942,
There is no logical warrant for considering an event known to occur in a given hypothesis, even if infrequently, as disproving the hypothesis. (Berkson cited 241)
P-values provide a formal, quantifiable way to commit Bernoulli's fallacy. They appear in journal articles next to estimates of all sorts of quantities, like differences between various groups or effect sizes for some treatment. A small p-value (< .01) means that a value equal to or greater than the actual estimate would (supposedly) be unlikely to arise if the quantity being estimated were, in reality, equal to zero (e.g., if the treatment has no effect).
Clayton provides a fun review of one of the more scandalous papers from the replication crisis. Some psychologists tested for extrasensory perception (yes, ESP!) and obtained 'significant' results (small enough p-values). The study is especially troublesome because the authors followed all of the standard statistical procedures. The 'right' methods gave ridiculous results, and this called into question the methods.
But I think we could have learned this lesson long ago just by reading papers from the eugenics literature. All of today's problems are there (plus some others). Consider Pearson and Moul's study of the 'alien Jew' children in London. Using statistical methods they claim to find objectively existing associations. They dutifully acknowledge whenever one of the 'significant' associations was quite small but, like a good detective, they treat each as 'a clue to something more important' (207).3 After finding out which correlations are 'significant' and which are not, they attempt to explain the constellation of results (i.e., test a was 'significant' but it's so curious that b was not). At this stage the authors engage in speculation and 'just so' story telling based on their concept of racial stocks. It all looks naive in the extreme.
So the eugenicists methodology was dual: correlations are subjected to seemingly serious 'tests of significance' but explanations are unconstrained except by the researchers' own 'common sense'. They tested 'statistical hypotheses' but they never actually tested their theory of racial stocks. They wield their theory like an oversized medieval sword.
For any of these statistical 'tests' to work, one has to have a method for connecting observations to theory. One has to reason about the connections between the 'statistical hypothesis', plus all sorts of other information, and your actual hypothesis, the theory that you need to formulate somehow. But the frequency theory says that probability theory is silent on this matter, that probability does not apply to theories themselves (whether that is actually consistent with NHST itself is another matter). Researchers are then left to figure out the actual logic of science by themselves or, even worse, by reading Karl Popper.
Alternative to what?
Clayton calls his work 'propaganda'. The propagandistic element of the book, the subtle rhetorical maneuver, is that he coins the term Bernoulli's fallacy to talk about something that we've known for a long time. The term forces us to deal with statistics from the vantage point of logic, rather than portraying it as pure mathematics or as a body of objective facts. It punctures the false air of objectivity and brings in a more robust, critical kind of objectivity. I hope it catches on.
When it comes to fixing the problems of statistics Clayton has a number of suggestions, all of which are worth considering. The main line of argument is to abandon the frequency theory of probability. 'The better, more complete interpretation of probability', he writes, 'is that it measures the plausibility of a proposition given some assumed information' (281). This means making use of Bayes' theorem which, in Clayton's view, places a premium on two things (which he helpfully repeats throughout the text): (1) you have to formulate clear alternative explanations and (2) you have to assess their prior probabilities.
What is missing from Bernoulli's Fallacy is the bigger philosophy of science that stands behind the frequency theory. That is, empiricism. The point of the empiricist philosophy championed by the likes of Fisher and von Mises was to banish 'speculative' concepts from science so that it can rest securely on observational evidence. It was a dangerous, naive failure. Modern Bayesian statistics is also part of empiricist philosophy, having been developed mostly by adherents to logical positivism (an extreme branch of the empiricist tradition). A certain kind of Bayesian analysis (not quite like Clayton's position) has already become a standard part of statistical practice, without changing much at all with respect to research methodology.
But prior probabilities are needed precisely because one is no longer pretending to be making inferences just about 'empirical' correlations. Researchers are concerned with substantive explanations and the mechanisms 'speculated' or hypothesized to have generated the observed phenomena. This implies that non-statistical, qualitative ways of reasoning must, at appropriate times, come to the foreground of scientific inference. In science as in law, what really matter are the actual lines of reasoning and forms of evidence that enter into 'prior probability', not the performance of plugging numbers into a formula.
One way to improve statistical methods would be to place less of a burden on them by acknowledging that most of scientific inference is not statistical in nature and cannot be made so. Whereas Clayton's work is guided by Jaynes, a physicist who made great use of quantitative analysis, it would be a mistake to skip over the mathematician who inspired Jaynes to theorize probability as a measure of plausibility, George Polya. Polya's point was that probability theory is highly relevant even when one does not apply quantitative analysis. He provides all sorts of carefully theorized examples that somewhat resemble our discussion of Sally Clark's court case.
For logical probability to work well, researchers need an alternative to empiricist philosophy itself. The most powerful alternative is realist philosophy and it so happens that some of the key figures behind logical probability made their case for it explicitly on the grounds of realism, or as Harold Jeffreys called it, 'critical realism'. Since then, the realist literature has developed its own way of understanding research methodology, concept formation, explanation, and causality. They promote the development of theories of mechanism and explanation that are, in my view, simply more compelling, deeper, and more like 'good science' than the empiricist theories promoted by statisticians today ('causal effects estimates', 'formal causal inference').
Logical probability will have far more to offer if it can shed its current 'Bayesian'/positivist image and reunite with realist philosophy.
General references
Aubrey Clayton (2021). Bernoulli's Fallacy: Statistical Illogic and the Crisis of Modern Science. Columbia University Press.
Connor Donegan (2025). 'Probability and the philosophies of science: A realist view'. SocArxiv preprint. https://osf.io/preprints/socarxiv/k3nf5_v2 PDF
James Franklin (2001). 'Resurrecting Logical Probability'. Erkenntnis: 55 (277-305). https://philarchive.org/rec/FRARLP
Rom Harre (1972). The Philosophies of Science. Oxford University Press.
George Pólya (1954). Mathematics and Plausible Reasoning. Princeton University Press, 2 volumes.
Douglas Porpora (2015). Reconstructing Sociology: The Critical Realist Approach. Cambridge University Press.
Notes
This meaning of 'objectivity' is based on some ideas found in Harold Jeffreys' Theory of Probability. His emphasis was on 1) knowing when we are following certain rules and when we are not, and 2) being able to apply the same rules to different ideas, so that the ideas I favor can be assessed using the same standards as the ideas I do not favor.
They want you to think about games of dice or cards or other gambling machines. You'll notice that probability distributions, once you learm them, can be very good at predicting average outcomes. In fact, there is an incredibly wide range of situations which show aggregate results that superficially resemble a probability distribution. The frequentist sees this and says, 'it must not be logic because logic lives in your head; but this here is an objective fact'. The response to this argument is that the same is true of all logic. Deductive logic accords with all experience but it is logic nonetheless. That is the nature of (sound) logic and the subject of a different kind of discussion. The way we feel about the consistency of probability theory with relevant experience is due, first, to the correctness of that logic and, second, to how readily we dismiss of all those cases where results don't resemble expectations.
Pay attention to their use of data visualization and curve-fitting. They draw curved lines through plots that contain no more than seven points, and then use the slightest curvature to 'support' their argument. Their justification offers insight into their methods: 'It has struck us repeatedly in this work that occurrences in the tail values of various characters present peculiarities which deserve to be pursued further. As a rule tail frequencies are small and their eccentricities within the limits of random sampling, but when they are significant and anomalous, they may be most suggestive for evolutionary interpretation or as indications of racial admixtures' (Pearson and Moul 1928, 241).