Causal InferenceClinical AIStudy Design

Prediction vs Causation: Why Your Best Risk Model Still Cannot Tell You What to Treat

May 10, 2026·15 min read

Anas H. Alzahrani, MD PhD MPH

Department of Preventive Medicine and Public Health

Faculty of Medicine, King Abdulaziz University

Clinical research has a recurring bad habit: build a strong risk model, achieve a handsome AUC, and then start talking as if the model discovered what clinicians should do. It did not. It discovered who tends to have the outcome.

Prediction and causation are related in the way maps and steering wheels are related. Both matter. Only one tells you where to turn.

The Core Distinction

Question type	What you are asking	What a good answer looks like
Prediction	Who is likely to experience the outcome?	Accurate ranking, calibration, external validity
Causation	What would happen if we changed treatment, policy, or exposure?	A defensible estimand under explicit intervention and bias assumptions

A prediction model can tell you that septic patients with lactate elevation, vasopressor use, and organ dysfunction are high risk. That is useful. It still does not tell you whether giving treatment A instead of treatment B will improve outcomes, because the variables that predict bad outcomes are often the same variables that make treatment choice confounded in the first place.

Why This Confusion Happens So Often

Because prediction feels operational. The model outputs a number. The number looks actionable. Dashboards adore it. Reviewers sometimes adore it too, especially if there is a confusion matrix nearby looking very competent.

The trap:

Variables associated with worse outcomes are not automatically levers whose modification improves outcomes. Sometimes they are causes. Sometimes they are consequences. Sometimes they are just excellent markers of how sick the patient already was.

In observational clinical data, treatment choice is usually entangled with prognosis, clinician judgment, access, contraindications, and timing. A model that predicts deterioration may simply be learning the same severity structure that drove clinicians to treat certain patients more aggressively. That is not policy guidance. That is confounding wearing machine learning cologne.

A Simple Example: ICU Transfer Prediction Is Not a Treatment Policy

Suppose a hospital builds a model that predicts 48-hour ICU transfer from ward data. The model performs well. Patients flagged high risk are more likely to deteriorate.

Then comes the usual slide into trouble: “Therefore, the factors driving prediction can identify which patients benefit from early broad-spectrum antibiotics or aggressive monitoring.” Slow down.

The model was trained to forecast risk, not estimate effects of alternative actions.
High-risk patients may already be receiving more aggressive care.
Some predictive features may be downstream of evolving disease severity or earlier treatment.
The intervention itself may work differently across timing windows, units, or diagnostic subgroups.

The fact that risk is concentrated in one subgroup does not mean the treatment effect is concentrated there. Risk heterogeneity and treatment-effect heterogeneity are cousins, not twins.

Interactive question triage

Decide whether you need prediction, causation, or two analyses pretending to be one.

Most bad methods sections start by asking the wrong question. Toggle the aims below and the explorer will point to the study logic you actually need.

What this means

You are choosing between actions. That means you need an estimand tied to interventions, exchangeability assumptions, and a design that respects time zero.

Main warning

Do not let AUC cosplay as causal evidence. Treatment policy needs counterfactual reasoning, not just ranking patients by badness.

Method families to consider

•Target trial emulation
•G-methods
•IPW / standardization
•Sensitivity analysis

Decision Rule: Ask “Predict What?” and “Intervene On What?”

If the outcome is something like mortality, readmission, or flare, a prediction model asks which patients are likely to experience it. A causal study asks what would happen under competing interventions: early steroid versus delayed steroid, intensive follow-up versus usual care, biologic A versus biologic B.

Prediction is appropriate when...

You need triage, prognosis, resource planning, or a validated risk score for future patients under roughly similar care patterns.

Causal inference is appropriate when...

You need to recommend an action, estimate treatment benefit or harm, compare strategies, or justify a clinical policy.

If your paper ends with “clinicians should intervene on patients with high predicted risk,” you owe readers an explicit argument for why the intervention changes outcomes, not just why the outcome is common in that group.

Common Failure Modes in AI-Flavored Clinical Papers

1. Feature importance dressed up as causal importance

SHAP values, permutation importance, and regression coefficients can show what the model relied on. They do not, by themselves, identify what will help if changed.

2. Post-treatment variables in the predictor set

Models often include downstream labs, treatment responses, or early complications. Great for prediction. Terrible if you later pretend the same variables define a baseline treatment strategy.

3. High-risk subgroup equals “treat here first”

The highest-risk patients may have the smallest absolute benefit if the disease is already far advanced, or the largest harm if the treatment is toxic. Risk alone does not settle that.

4. Temporal leakage that looks impressive until you notice the clock

If predictors incorporate information collected after a treatment decision point, the model may forecast well while being useless for real-time intervention.

Reviewer Red-Flag Table

If the paper says...	Ask immediately...	Why it matters
“The model identifies patients who would benefit from treatment.”	Was treatment effect actually estimated under a causal design?	Benefit claims require counterfactual comparisons, not just prediction accuracy.
“Important features reveal modifiable drivers.”	Could these variables be proxies, consequences, or colliders?	Association structure does not sort variables into causes versus markers.
“High-risk patients should be prioritized for intervention.”	What evidence shows treatment effect heterogeneity rather than just risk heterogeneity?	Prioritization is a policy claim, not a calibration metric.
“Real-world EHR model supports causal recommendations.”	Where are time zero, eligibility, intervention definition, confounding control, and sensitivity analysis?	Without design logic, “real-world” usually means “real-world confounding included at no extra charge.”

What to Do Instead

Start with the decision, not the model. If the real question is treatment choice, define the intervention, comparator, time zero, eligibility criteria, follow-up, outcome, and target estimand before you choose the algorithmic frosting.

Separate prognostic and causal aims. Use one analysis for risk prediction and another for intervention effects when you genuinely need both.
Map the causal structure. Draw the timeline and the DAG before you let software infer importance from convenience.
Check whether the intervention is well-defined. “Better care” is not an intervention. “Start anticoagulation within 24 hours” is closer.
Use causal methods that match the design problem. Standardization, IPW, g-computation, target trial emulation, or sensitivity analyses may be needed; none are replaced by cross-validation.
If you study heterogeneity, estimate heterogeneity of treatment effect. Risk strata can help, but only after the treatment contrast itself is credible.

Where Aqrab Fits

This is exactly the kind of distinction Aqrab is built to pressure-test. Not “is there a model?” but “what question does the model answer, what assumptions does the paper smuggle in, and where does the methods section quietly swap prediction for intervention?”

If you are reviewing AI-assisted clinical research or drafting a protocol that needs cleaner causal logic, try Aqrab for a method critique pass, or explore the developer workflows if you want that scrutiny embedded upstream.

The Practical Bottom Line

A good prediction model tells you who is in trouble. A good causal analysis tells you what might help. Those are not interchangeable achievements.

Clinical research gets into trouble when it treats prediction as a shortcut to intervention. That shortcut usually runs straight through confounding, temporal leakage, and feature-importance theater.

So the next time a model with a glossy ROC curve starts making treatment recommendations, ask one impolite but necessary question: did this study estimate risk, or did it estimate what would happen if we acted differently? If the answer is the first one, keep the dashboard. Just do not mistake it for a policy trial in disguise.

Keep reading

Don't stop at one method.

Good methods judgment comes from contrast. Read the neighboring guides, see where the assumptions diverge, and avoid treating every observational problem like it needs the same hammer.

Browse full archive

Related guides

Pharmacoepidemiology

Prevalent-User Bias: When Your Drug Study Starts After the Interesting Harm Already Happened

A practical guide to prevalent-user bias for clinical researchers. Covers depletion of susceptibles, survivor selection, post-treatment baseline covariates, and what reviewers should demand before trusting late-entry treatment cohorts.

2026-05-18 · 16 min read

Target Trial Emulation

Clone-Censor-Weight: The Target Trial Fix That Still Breaks When You Use It Casually

A practical guide to clone-censor-weight for clinical researchers. Covers when the design is needed, how cloning and artificial censoring work, where immortal time bias reappears, and what reviewers should demand before trusting a target trial emulation.

2026-05-16 · 16 min read

Case-Crossover Design

Case-Crossover Design: When Patients Become Their Own Controls

A practical guide to case-crossover designs for clinical researchers. Covers self-matching, hazard versus control windows, transient exposures, protopathic bias, time trends, and when this elegant design is exactly right or exactly wrong.

2026-05-04 · 15 min read

Previous guide

← Restricted Mean Survival Time: When Hazard Ratios Are Not the Clinical Answer

Next guide

Per-Protocol Effects: The Estimand Everyone Wants and the Bias Trap They Usually Build →