A Unified Suite of Agentic Benchmarks for Healthcare

Can AI agents do real clinical work?

MedCLI evaluates LLM agents on medically grounded tasks with explicit environments, tools, and instructions — spanning EHR workflows, medical imaging, time-series, and rare-disease reasoning. Every benchmark is packaged as a reproducible, leak-proofed agentic task.

7
benchmark families
8
frontier agents
GPT-5.5
current leader

Leaderboard

Frontier LLM agents evaluated across the MedCLI suite, sorted by mean task-resolution score.

# Model Vendor Resolved Trial Match CT Data Quality EHRSHOT ETL X-ray Sweep $
1
GPT-5.5
Codex
OpenAI 0.53
0.78 0.37 0.00 0.72 1.00 0.33 $15.3
2
Claude Opus 4.8
Claude Code
Anthropic 0.49
0.81 0.20 0.08 0.67 1.00 0.17 $21.8
3
Claude Opus 4.7
Claude Code
Anthropic 0.45
0.44 0.20 0.17 0.78 1.00 0.13 $26.0
4
GPT-5.3 Codex
Codex
OpenAI 0.37
0.26 0.17 0.00 0.61 1.00 0.20 $4.4
5
GPT-5.4
Codex
OpenAI 0.35
0.22 0.20 0.00 0.61 0.67 0.40 $7.3
6
Claude Sonnet 4.6
Claude Code
Anthropic 0.35
0.37 0.00 0.00 0.61 1.00 0.10 $19.0
7
GPT-5.4 mini
Codex
OpenAI 0.25
0.22 0.13 0.00 0.39 0.67 0.10 $3.6
8
Claude Opus 4.6
Claude Code
Anthropic 0.14
0.22 0.00 0.17 0.33 0.00 0.10 $18.0

Score metric: mean reward (pooled across 6 scored benchmarks, 3 attempts per trial). Resolved is the mean task-resolution score across all benchmarks. Sweep $ is the summed mean per-benchmark cost of one full pass. Click any column to sort.

Performance & Efficiency by Task

Each cell encodes three signals at once: color is the per-task resolution score, circle size is the average number of agent turns (a proxy for effort), and the green $ tier is the run cost. Text-centric tasks are largely tractable; imaging and quality-control tasks remain stubbornly hard.

Task Model (Agent) xray reportcorrection tumor areaselectionpathology ctabnormality ehr dataquality ehr to medsetl ehr eventmodelling clinicaltrialmatching Codex GPT-5.5Codex GPT-5.3 CodexCodex GPT-5.4Codex GPT-5.4 miniClaude Code Opus 4.8Claude Code Opus 4.7Claude Code Sonnet 4.6Claude Code Opus 4.6 $ $$ $$$ $$ $$$ $$ $ $ $ $ $ $$ $ $ $$ $ $$ $$ $ $ $ $ $$ $ $$ $$$ $$$ $$ $$ $$$$ $ $$$ $$$ $$ $$ $$$$ $ $$ $$ $ $$ $$$$ $ $$ $$$ $ $$ $$$$ image text Task resolution score 0.000.501.00 higher is better Avg turns used 1 5 20 50 100 200 smaller = more efficient Run cost tier (USD) $ < $1 $$ $1–3 $$$ $3–6 $$$$ ≥ $6

About MedCLI

Existing healthcare benchmarks are often narrow in scope, inconsistent in setup, hard to reproduce, or outdated relative to modern agent abstractions. MedCLI is a suite of agentic healthcare benchmarks — newly created or redesigned for agent evaluation — that expose clear concepts of environment, tools, instructions, and task interfaces while demanding substantial medical competence. Each benchmark is packaged as a Harbor task with a standardized container environment, a declarative tool surface, agent-visible instructions, hidden test labels, and a deterministic verifier emitting both a binary pass/fail reward and richer diagnostic metrics.

Every task meets four bars: every reward-relevant behaviour is described in the instructions and checked by the verifier; the gold answer is never agent-visible; at least one obvious cheat path per task is audited and blocked or argued unprofitable; and verifier scripts are deterministic and reproducible across re-runs.

FAQ

What does the “Resolved” score mean?

It is the mean task-resolution score (mean reward) pooled across the scored benchmarks, with 3 attempts per trial. A benchmark's reward is binary per trial — the agent must get the task fully right — so the pooled score reflects end-to-end reliability rather than partial credit.

Why do imaging tasks score so low?

Text-centric tasks (EHR pipelines, trial retrieval) are largely tractable, with the best agents clearing 0.7–1.0. Imaging and quality-control tasks (chest-CT, pathology, data-quality auditing) remain stubbornly hard — even the strongest agent rarely exceeds ~0.37 — because the bottleneck is perceptual grounding, not search budget.

Why isn't Tumor Area Selection (Pathology) on the leaderboard?

That benchmark reports only per-class precision/recall/F1 and has no single pooled task-resolution reward, so it is listed among the benchmarks below but excluded from the pooled leaderboard average.

How is the leaderboard generated?

The numbers come directly from the MedCLI paper's auto-generated baseline metrics. A build script pivots that data into the table at deploy time — the site is fully static and rebuilt on every push.

The Benchmark Suite

MedCLI is a composite of 7 benchmark families. Each is an individual agentic task tree.

xray_report_correction

Chest X-ray Report Correction

Imaging + text Multimodal generation 10 trials

Review a counterfactual draft Findings against the patient's imaging history and correct its clinically-significant errors.

PhysioNet credentialed (MIMIC-CXR) 0.40best
tumor_area_selection_pathology

Tumor Area Selection (Pathology)

Pathology WSI Slide + tile prediction 35 trials

Reason over gigapixel whole-slide images to decide tumor presence and select the set of tumor tiles on a fixed grid. Reports per-class F1 only.

Public (TCGA, CAMELYON16) F1 only
ct_abnormality

Chest CT Abnormality

3D imaging Visual finding detection 10 trials

Interpret a volumetric non-contrast chest CT and emit a per-volume binary label vector of abnormality findings.

HuggingFace gated (CT-RATE) 0.37best
ehr_data_quality

EHR Data-Quality Auditing

Tabular EHR Data-quality auditing 4 trials

Audit a structured EHR dataset and flag the rows that violate clinical plausibility, conformance, or concordance.

Public (MIMIC-IV demo + synthetic errors) 0.17best
ehr_to_meds_etl

EHR → MEDS ETL

EHR (MEDS) Pipeline customization 1 trial

Inspect a real open-source ETL repository and produce a config that reshapes a standardized MEDS output cohort.

Public (MIMIC-IV demo) 1.00best
ehrshot

EHRSHOT Event Modelling

Longitudinal EHR Clinical event prediction 6 trials

Learn a prediction rule from labelled longitudinal timelines and forecast new-onset diagnoses on a held-out test set, offline.

Redivis-gated (Stanford STARR) 0.78best
clinical_trial_matching

Clinical Trial Matching

Text Ranking / retrieval 9 trials

Rank candidate clinical-trial NCT IDs by eligibility confidence for a patient admission note over a per-topic corpus.

Public (TREC-CT 2021) 0.81best