Skip to main content
Research Article

Optional Stopping as Continuous Improvement: A Defense of Running Experiments Until They Work

I3E TCPS· Volume 1 , No. 1 · pp. 1-10 ·
DOI: 10.I3E/tcps.2026.00156 Link copied!
6 Citations Check Access

Editor's Summary

The editors ran this paper’s analysis themselves with a different random seed. After 63 iterations, we obtained p = 0.049. We are satisfied.

Abstract

The practice of continuing data collection until a statistically significant result is obtained — colloquially known as “p-hacking” but more charitably described as “adaptive sequential analysis” — is widely condemned in the methodology literature and widely practiced everywhere else. We present a formal framework for this practice, derive the conditions under which it reliably produces p < 0.05 under the null hypothesis, and find that patience of approximately 40 additional samples is sufficient in 94.7% of cases. We argue that this is not a bug but a feature, provided one is willing to redefine “feature.”

Article

Introduction

Statistical significance at p < 0.05 has served as the primary publication criterion in experimental science for nearly a century, a period during which the methodology literature has produced approximately 40,000 papers explaining why this is a bad idea and the practice has not meaningfully changed. We take a different approach: rather than arguing against p < 0.05, we study how to obtain it efficiently.

The method is straightforward. After each data collection unit, compute the p-value. If p < 0.05, stop and write the paper. If p ≥ 0.05, collect more data and repeat. This procedure is known as “optional stopping,” “sequential testing without correction,” or “what everyone does when the grant runs out in three months.” It is universally understood to inflate the false positive rate. We are here to quantify exactly how much and to provide the first explicit tutorial.

The AdapTEST Framework

Let $H_0$ denote the null hypothesis. Under $H_0$, we monitor the p-value $p_t$ computed on a growing sample ${x_1, \ldots, x_t}$ at each step $t$. AdapTEST proceeds as follows: at each step, if $p_t < 0.05$, declare significance and stop; otherwise, collect one more observation. We derive that under $H_0$, the probability of eventually stopping with a “significant” result approaches 1 as $t \to \infty$.

This result is not new. It appears in multiple statistics textbooks, all of which present it as a cautionary tale. We present it as a convergence guarantee.

We also introduce AdapTEST-Plus, which extends the framework by allowing the researcher to additionally: (a) collect data from a slightly different population if the original sample is not cooperating, (b) exclude outliers defined post-hoc as observations that move p in the wrong direction, and (c) try a one-tailed test if the two-tailed test is at p = 0.06. AdapTEST-Plus achieves the desired result in 99.2% of cases.

Empirical Validation

We validated AdapTEST on 10,000 simulated experiments under the null hypothesis. In each experiment, we ran the optional stopping procedure with a maximum sample size of 500. We obtained p < 0.05 in 9,470 experiments (94.7%). The mean sample size at stopping was 41.3 observations, suggesting that most real-world effect sizes are obtainable within a standard pilot study budget.

We then applied AdapTEST to a real dataset — a psychology study on the effect of priming on task performance — and obtained p = 0.031 after 89 participants, having obtained p = 0.21 after the pre-registered 60. We report this as a replication of the original effect.

Discussion

We anticipate that this paper will be controversial. We note that controversy correlates with citation counts and proceed accordingly. The core point stands: if the field insists on p < 0.05 as the publication threshold while permitting flexible data collection, then AdapTEST is not a methodological violation but a rational response to irrational incentives. We recommend changing the incentives. We recognize this will not happen.

References

  1. Simmons, J., et al. (2011). “False-Positive Psychology.” Psychological Science, 22(11), pp. 1359-1366. (A real paper. We recommend reading it.)
  2. Optional, O., & Stopping, S. (2024). “It Worked on the Third Dataset.” Journal of Flexible Analysis, 6(1), pp. 1-9.
  3. Threshold, T. (2020). “Why p = 0.051 Is Fundamentally Different from p = 0.049.” Significance Magazine, 17(3), pp. 12-14.
  4. Hypothesis, N. (2026). “Our Results Are Significant (After 23 Attempts).” I3E Trashactions on Catastrophic P-value Shopping, 1(1), pp. 11-11.

Author Affiliations

1. Statistical Flexibility Laboratory, Center for Desired Outcomes

References

eLetters

Submit your response to this paper — provided it has been reviewed, revised, rejected, re-reviewed, and reconsidered.