Article
Introduction
Statistical significance at p < 0.05 has served as the primary publication criterion in experimental science for nearly a century, a period during which the methodology literature has produced approximately 40,000 papers explaining why this is a bad idea and the practice has not meaningfully changed. We take a different approach: rather than arguing against p < 0.05, we study how to obtain it efficiently.
The method is straightforward. After each data collection unit, compute the p-value. If p < 0.05, stop and write the paper. If p ≥ 0.05, collect more data and repeat. This procedure is known as “optional stopping,” “sequential testing without correction,” or “what everyone does when the grant runs out in three months.” It is universally understood to inflate the false positive rate. We are here to quantify exactly how much and to provide the first explicit tutorial.
The AdapTEST Framework
Let $H_0$ denote the null hypothesis. Under $H_0$, we monitor the p-value $p_t$ computed on a growing sample ${x_1, \ldots, x_t}$ at each step $t$. AdapTEST proceeds as follows: at each step, if $p_t < 0.05$, declare significance and stop; otherwise, collect one more observation. We derive that under $H_0$, the probability of eventually stopping with a “significant” result approaches 1 as $t \to \infty$.
This result is not new. It appears in multiple statistics textbooks, all of which present it as a cautionary tale. We present it as a convergence guarantee.
We also introduce AdapTEST-Plus, which extends the framework by allowing the researcher to additionally: (a) collect data from a slightly different population if the original sample is not cooperating, (b) exclude outliers defined post-hoc as observations that move p in the wrong direction, and (c) try a one-tailed test if the two-tailed test is at p = 0.06. AdapTEST-Plus achieves the desired result in 99.2% of cases.
Empirical Validation
We validated AdapTEST on 10,000 simulated experiments under the null hypothesis. In each experiment, we ran the optional stopping procedure with a maximum sample size of 500. We obtained p < 0.05 in 9,470 experiments (94.7%). The mean sample size at stopping was 41.3 observations, suggesting that most real-world effect sizes are obtainable within a standard pilot study budget.
We then applied AdapTEST to a real dataset — a psychology study on the effect of priming on task performance — and obtained p = 0.031 after 89 participants, having obtained p = 0.21 after the pre-registered 60. We report this as a replication of the original effect.
Discussion
We anticipate that this paper will be controversial. We note that controversy correlates with citation counts and proceed accordingly. The core point stands: if the field insists on p < 0.05 as the publication threshold while permitting flexible data collection, then AdapTEST is not a methodological violation but a rational response to irrational incentives. We recommend changing the incentives. We recognize this will not happen.
References
- Simmons, J., et al. (2011). “False-Positive Psychology.” Psychological Science, 22(11), pp. 1359-1366. (A real paper. We recommend reading it.)
- Optional, O., & Stopping, S. (2024). “It Worked on the Third Dataset.” Journal of Flexible Analysis, 6(1), pp. 1-9.
- Threshold, T. (2020). “Why p = 0.051 Is Fundamentally Different from p = 0.049.” Significance Magazine, 17(3), pp. 12-14.
- Hypothesis, N. (2026). “Our Results Are Significant (After 23 Attempts).” I3E Trashactions on Catastrophic P-value Shopping, 1(1), pp. 11-11.
Submit your response to this paper — provided it has been reviewed, revised, rejected, re-reviewed, and reconsidered.