← Back to Blog
Prediction ModelsClinical UtilityMethods Critique

Net Reclassification Improvement: When a New Biomarker Wins by Moving Patients Between the Wrong Boxes

June 15, 2026·15 min read

Anas H. Alzahrani, MD PhD MPH

Department of Preventive Medicine and Public Health

Faculty of Medicine, King Abdulaziz University

Prediction-model papers love a good rescue narrative. The base model discriminates decently. A new biomarker, imaging feature, or machine-learning layer gets added. Then the authors announce that patients were “reclassified correctly,” often with a positive net reclassification improvement and a tone suggesting the clinical argument is finished.

It usually is not. NRI can be informative, but it is also one of the easiest ways to make small model changes sound more decision-relevant than they really are. If the risk categories are arbitrary, if the treatment thresholds do not trigger real management changes, or if false upward movement in non-events is brushed aside, the metric can flatter a model upgrade that clinicians did not need.

The Core Decision Rule

Never accept a positive NRI as evidence of clinical value unless the risk categories map to concrete decisions and the paper shows what was gained and what was sacrificed in both events and non-events.

Decision rule:

Reclassification is only impressive when the boxes mean something. Moving patients between arbitrary bins is accounting, not utility.

What NRI Is Actually Counting

PieceWhat counts as goodWhy it can still mislead
Event NRIPatients who truly develop the outcome move up into higher-risk categories.It can look good even if the higher-risk bins do not change treatment or if too many future cases also move down.
Non-event NRIPatients who stay event-free move down into lower-risk categories.It gets ugly fast when many non-events are instead pushed upward and exposed to unnecessary workup or therapy.
Total NRIThe event and non-event components sum to a positive headline number.A single total can hide that one side improved modestly while the other side became clinically worse.

That last row is the trap. Once a total NRI is reported without its two ingredients and without a decision context, the metric becomes very easy to overread.

Why Risk Categories Matter More Than the Formula

Thresholds should trigger action

If 5%, 10%, and 20% risk bins do not change surveillance, biopsy, anticoagulation, or referral, then reclassification between them may be mathematically neat but clinically idle.

Categories can be chosen to flatter

The more arbitrary the cutoffs, the easier it is to make small probability shifts look like a big categorical upgrade.

False upward movement is not free

A model that scares more low-risk patients into higher bins may create overtesting and overtreatment while still posting a tidy overall NRI.

A Concrete Clinical Example

Case

A cardiac biomarker that improves NRI more than it improves care

Imagine a cardiovascular risk model used to decide whether middle-aged patients should receive preventive therapy intensification. Investigators add a new biomarker and celebrate a positive NRI because some future events move from the intermediate bin into a high-risk bin.

But the same update also moves many non-events upward, and the treatment threshold itself is weakly justified. If clinicians would probably treat most intermediate-risk patients anyway, the extra categorical drama has not bought much. The paper has improved its press release more than its decision logic.

Interactive NRI explorer

Reclassification can look exciting even when the clinical story is thin

Adjust the share of events and non-events moving between risk categories after adding a new biomarker or predictor. The tool shows how quickly a flattering NRI can appear even before anyone proves the categories are meaningful or the decisions improve.

Total NRI+9.0 ptsEvents: +13.0 ptsNon-events: -4.0 pts

This is the part authors love to emphasize: true events getting pushed into bins that sound more alarming.

This is the quiet penalty. If future cases move downward, the “improved” model may be hiding some patients who actually need action.

This can help avoid unnecessary escalation if those lower bins actually change what clinicians do.

This is where overtreatment pressure hides. A shiny NRI can coexist with too many false alarms.

ComponentCalculationInterpretation
Event NRI22.0% - 9.0% = +13.0 ptsPositive means more true events moved upward than downward.
Non-event NRI14.0% - 18.0% = -4.0 ptsPositive means more non-events were spared high-risk labeling than pushed into it.
Total NRI+13.0 pts + -4.0 pts = +9.0 ptsUseful only if the risk categories are clinically justified and tied to real management changes.

Reviewer cue

This is a modest NRI. It needs backup from calibration, decision consequences, and external validation.

Main warning

The apparent gain comes mainly from moving events upward while also pushing too many non-events into higher-risk bins.

If the paper never explains what happens at each risk threshold, the categories may be numerically tidy and clinically arbitrary. NRI is easiest to inflate when bins exist for publication, not for care.

Where Published NRI Claims Go Wrong

The bins are arbitrary

If authors cannot defend the clinical thresholds, then crossing them is not automatically meaningful.

Only total NRI is shown

Hiding the event and non-event pieces makes it easier to miss whether the “gain” came with a heavy overtreatment cost.

Calibration is weak or unstated

A model can reshuffle people between categories while still assigning poorly calibrated risks.

Decision consequences are missing

If the paper never says what changed in management, the reclassification exercise may be mostly a mathematical performance.

What Reviewers Should Demand Instead

QuestionWhy it mattersMinimum acceptable answer
Do the risk categories map to real decisions?Without decision thresholds, reclassification can be numerically vivid and clinically empty.A clear statement of what action changes across categories and why those cutoffs were chosen.
Are event and non-event NRI both reported?The total can hide clinically asymmetric harm.Both components, shown separately, with counts or proportions readers can audit.
Did calibration and external validation survive?Reclassification without trustworthy risk estimates is a fragile improvement.Calibration evidence and validation beyond the training sample.
Does the update improve decisions, not just categories?Clinical utility lives downstream of management, not inside the formula.Decision-curve evidence, treatment consequences, or another decision-relevant analysis.

Practical Rules for Authors and Editors

  1. Start with the decision threshold, not the biomarker. If the threshold is arbitrary, the NRI inherits that weakness.
  2. Report the event and non-event components separately. A positive total can conceal overtreatment pressure.
  3. Pair NRI with calibration and decision analysis. Reclassification alone is not a license to claim clinical utility.
  4. Be suspicious of large improvements from tiny probability shifts across cutoffs. That is often a category artifact, not a clinical breakthrough.
  5. If you want a faster critique of a prediction paper, use tools that ask whether the thresholds, harms, and tradeoffs were ever defended. That is where Aqrab's review workflow is most useful, especially before a flattering metric hardens into a manuscript claim.

The Bottom Line

NRI is not worthless. It is just unusually vulnerable to being treated as a shortcut to clinical relevance. A model update can produce a respectable reclassification statistic while leaving the hard questions unanswered: Are the risks calibrated? Do the categories reflect decisions? Who gets moved upward unnecessarily? What actually changes in care?

If a paper cannot answer those cleanly, the right reaction is not awe. It is restraint. Prediction metrics should earn their interpretive glamour the same way treatments do: by showing that the movement matters where patients and clinicians live.

Keep reading

Don't stop at one method.

Good methods judgment comes from contrast. Read the neighboring guides, see where the assumptions diverge, and avoid treating every observational problem like it needs the same hammer.

Browse full archive