← Back to Blog
Clinical TrialsTrial DesignMethods Critique

Multiple Testing in Clinical Trials: When One Positive Endpoint Is Just the Loudest Coin Flip

June 11, 2026·16 min read

Anas H. Alzahrani, MD PhD MPH

Department of Preventive Medicine and Public Health

Faculty of Medicine, King Abdulaziz University

A trial does not become rigorous just because every table cell has a p-value. Once investigators test several endpoints, several doses, several subgroups, several time points, or several interim looks, chance gets more opportunities to produce something photogenic. The common mistake is to report the winner as if it were the only race that happened.

This is the practical meaning of multiple testing or multiplicity. It is not a ceremonial statistics paragraph for regulators. It is the reason a lone positive finding can be much less persuasive than it first appears, especially when the manuscript is vague about how many other analytic doors were tried first.

The Core Decision Rule

Before you celebrate a positive endpoint, ask a simpler question: how many clinically important chances did this study have to be positive?

Decision rule:

If a paper offers several endpoints, several subgroup stories, or several interim opportunities to win, the headline result is only trustworthy when the testing family and error-control plan were made explicit before the applause started.

That does not mean every exploratory analysis is forbidden. It means confirmatory claims and hypothesis-generating claims should stop pretending to be the same thing.

Where Multiplicity Actually Sneaks In

Several endpoints

Co-primary endpoints, key secondaries, symptom scales, biomarkers, and safety outcomes can all turn one trial into many shots at significance.

Several looks over time

Interim analyses, repeated week-by-week outcomes, and unscheduled peeks all spend the same alpha budget unless the design says otherwise.

Several ways to slice patients

Biomarker cutoffs, age bands, sex, baseline severity, and geography can produce a compelling subgroup story long after the main effect disappoints.

Several dose or model choices

Dose pooling, covariate sets, responder thresholds, and post hoc endpoint definitions can multiply opportunities even when the protocol looks tidy on the first page.

Interactive multiplicity check

How easy is it to get one “positive” result just by looking in enough places?

This reviewer tool treats each planned or unplanned hypothesis look as another chance for luck to impersonate a finding. It is a rough teaching approximation, not a replacement for a full trial operating-characteristics simulation.

Chance at least one false-positive appears56.0%Expected nominal hits by luck: 0.80
QuantityValueWhy it matters
Confirmatory hypotheses5Even planned families need an error-control story if more than one claim can win.
Exploratory looks11This is where many manuscripts quietly turn one trial into many opportunities for luck.
Chance of at least one false-positive confirmatory hit22.6%A family with several formal hypotheses needs gatekeeping, alpha splitting, or another explicit plan.
Chance of at least one exploratory false-positive43.1%Exploratory signals can teach, but they are weak foundations for decisive claims.
Bonferroni threshold for confirmatory family0.0100Crude but easy reminder that the usual `0.05` does not survive unlimited reuse.

Quick read

This setup creates enough multiplicity pressure that one nominally positive result should not be treated as decisive on its own.

Reviewer rule: if a paper has several clinically important endpoints, several subgroup stories, or several interim decision points, ask to see the testing family before believing the headline.

  • Hierarchy beats improvisation when confirmatory claims compete for the same alpha budget.
  • Exploratory findings are useful for design and replication, not for pretending a trial asked only one question.
  • A correction does not rescue a clinically incoherent endpoint family; it only controls one kind of error.

A Concrete Clinical Example

Imagine a randomized heart-failure trial of a new infusion. The protocol names one primary endpoint: 90-day cardiovascular death or hospitalization. By the end, the manuscript emphasizes a positive symptom-score difference at day 14, a nominally favorable biomarker change at day 30, and an apparent mortality benefit in patients with high baseline natriuretic peptide levels.

None of those findings is impossible. But if the paper never clarifies whether the symptom score was a key secondary endpoint, whether the biomarker analysis was inside a hierarchical sequence, or how many subgroup thresholds were tried first, the reader cannot tell whether the headline is discovery or residue from repeated looking.

Bad reflex

Promote the nicest p-value to the abstract and leave the rest of the family structure in the supplement.

What that usually means

The trial asked several questions, but the paper narrates only the one that smiled back.

What a stronger paper would do

Name the confirmatory family, show the alpha-allocation plan, and label the subgroup result as exploratory unless it was truly prespecified and protected.

When Correction Is Essential, Helpful, or Mostly Theater

SituationWhy multiplicity mattersPractical response
Several confirmatory endpoints can each support a success claimFamily-wise false-positive risk rises quickly.Use a prespecified hierarchy, alpha splitting, gatekeeping, or another explicit confirmatory plan.
One primary endpoint plus descriptive safety and exploratory biomarkersReaders may over-read secondary signals as efficacy proof.Label exploratory work clearly and avoid confirmatory language.
Several subgroup stories after a weak overall resultFishing pressure and low interaction power combine badly.Demand interaction tests, protocol evidence, and replication.
Post hoc correction applied after flexible endpoint huntingCorrection can control one error rate while hiding design drift.Ask what the endpoint family was before the results were seen.

What Reviewers Should Ask Before Trusting a Lone Positive Result

  • What exact hypotheses were confirmatory, and where were they defined before unblinding?
  • How many endpoints, time points, subgroup looks, and interim analyses were realistically in play?
  • Was there a hierarchical testing sequence or alpha-allocation plan, and did the paper follow it?
  • Are nominally positive secondary or subgroup findings being discussed as if they independently prove benefit?
  • Would the clinical conclusion still survive if the prettiest p-value had belonged to a different endpoint?

Common Failure Modes

The endpoint shell game

The main endpoint disappoints, so a better-looking secondary outcome becomes the emotional center of the abstract.

Correction without coherence

A multiplicity method is named, but the paper never makes clear which endpoints were in the family or why that family reflects the real clinical question.

Exploratory upgraded to decisive

The discussion treats a subgroup or biomarker result as practice-changing even though it was not protected, replicated, or central to the original design.

Multiplicity ignored because the effect is plausible

Biological plausibility can support a finding, but it does not refund the alpha budget already spent.

Where Aqrab Fits

Multiplicity problems often survive peer review because each individual analysis sounds reasonable in isolation. The weakness appears only when someone asks how many such analyses were possible and which of them were supposed to carry confirmatory weight. That is exactly the kind of methodological audit Aqrab is built to do quickly.

If you want to wire those critique checks into your own review workflow, the developer tools are the better fit for repeated manuscript screening.

The Practical Bottom Line

A positive result is not automatically persuasive when it emerged from a crowd of nearby chances to be positive. Clinical researchers do not need to memorize every correction method. They do need to ask whether the study declared its testing family honestly and whether the conclusion matches the part of that family that was actually protected.

In short: if one endpoint wins after twenty opportunities, the right emotion is not celebration. It is curiosity about the nineteen that did not.

Keep reading

Don't stop at one method.

Good methods judgment comes from contrast. Read the neighboring guides, see where the assumptions diverge, and avoid treating every observational problem like it needs the same hammer.

Browse full archive