MedCLI — Agentic Benchmarks for Healthcare

Leaderboard

Frontier LLM agents evaluated across the MedCLI suite, sorted by mean task-resolution score.

# ▲	Model ↕	Vendor ↕	Resolved ↕	Trial Match ↕	CT ↕	Data Quality ↕	EHRSHOT ↕	ETL ↕	X-ray ↕	Sweep $ ↕
1	GPT-5.5 Codex	OpenAI	0.53	0.78	0.37	0.00	0.72	1.00	0.33	$15.3
2	Claude Opus 4.8 Claude Code	Anthropic	0.49	0.81	0.20	0.08	0.67	1.00	0.17	$21.8
3	Claude Opus 4.7 Claude Code	Anthropic	0.45	0.44	0.20	0.17	0.78	1.00	0.13	$26.0
4	GPT-5.3 Codex Codex	OpenAI	0.37	0.26	0.17	0.00	0.61	1.00	0.20	$4.4
5	GPT-5.4 Codex	OpenAI	0.35	0.22	0.20	0.00	0.61	0.67	0.40	$7.3
6	Claude Sonnet 4.6 Claude Code	Anthropic	0.35	0.37	0.00	0.00	0.61	1.00	0.10	$19.0
7	GPT-5.4 mini Codex	OpenAI	0.25	0.22	0.13	0.00	0.39	0.67	0.10	$3.6
8	Claude Opus 4.6 Claude Code	Anthropic	0.14	0.22	0.00	0.17	0.33	0.00	0.10	$18.0

Score metric: mean reward (pooled across 6 scored benchmarks, 3 attempts per trial). Resolved is the mean task-resolution score across all benchmarks. Sweep $ is the summed mean per-benchmark cost of one full pass. Click any column to sort.

About MedCLI

Existing healthcare benchmarks are often narrow in scope, inconsistent in setup, hard to reproduce, or outdated relative to modern agent abstractions. MedCLI is a suite of agentic healthcare benchmarks — newly created or redesigned for agent evaluation — that expose clear concepts of environment, tools, instructions, and task interfaces while demanding substantial medical competence. Each benchmark is packaged as a Harbor task with a standardized container environment, a declarative tool surface, agent-visible instructions, hidden test labels, and a deterministic verifier emitting both a binary pass/fail reward and richer diagnostic metrics.

Every task meets four bars: every reward-relevant behaviour is described in the instructions and checked by the verifier; the gold answer is never agent-visible; at least one obvious cheat path per task is audited and blocked or argued unprofitable; and verifier scripts are deterministic and reproducible across re-runs.

FAQ

What does the “Resolved” score mean?

It is the mean task-resolution score (mean reward) pooled across the scored benchmarks, with 3 attempts per trial. A benchmark's reward is binary per trial — the agent must get the task fully right — so the pooled score reflects end-to-end reliability rather than partial credit.

Why do imaging tasks score so low?

Text-centric tasks (EHR pipelines, trial retrieval) are largely tractable, with the best agents clearing 0.7–1.0. Imaging and quality-control tasks (chest-CT, pathology, data-quality auditing) remain stubbornly hard — even the strongest agent rarely exceeds ~0.37 — because the bottleneck is perceptual grounding, not search budget.

Why isn't Tumor Area Selection (Pathology) on the leaderboard?

That benchmark reports only per-class precision/recall/F1 and has no single pooled task-resolution reward, so it is listed among the benchmarks below but excluded from the pooled leaderboard average.

How is the leaderboard generated?

The numbers come directly from the MedCLI paper's auto-generated baseline metrics. A build script pivots that data into the table at deploy time — the site is fully static and rebuilt on every push.

The Benchmark Suite

MedCLI is a composite of 7 benchmark families. Each is an individual agentic task tree.

xray_report_correction

Chest X-ray Report Correction

Imaging + text Multimodal generation 10 trials

Review a counterfactual draft Findings against the patient's imaging history and correct its clinically-significant errors.

PhysioNet credentialed (MIMIC-CXR) 0.40best

tumor_area_selection_pathology

Tumor Area Selection (Pathology)

Pathology WSI Slide + tile prediction 35 trials

Reason over gigapixel whole-slide images to decide tumor presence and select the set of tumor tiles on a fixed grid. Reports per-class F1 only.

Public (TCGA, CAMELYON16) F1 only

ct_abnormality

Chest CT Abnormality

3D imaging Visual finding detection 10 trials

Interpret a volumetric non-contrast chest CT and emit a per-volume binary label vector of abnormality findings.

HuggingFace gated (CT-RATE) 0.37best

ehr_data_quality

EHR Data-Quality Auditing

Tabular EHR Data-quality auditing 4 trials

Audit a structured EHR dataset and flag the rows that violate clinical plausibility, conformance, or concordance.

Public (MIMIC-IV demo + synthetic errors) 0.17best

ehr_to_meds_etl

EHR → MEDS ETL

EHR (MEDS) Pipeline customization 1 trial

Inspect a real open-source ETL repository and produce a config that reshapes a standardized MEDS output cohort.

Public (MIMIC-IV demo) 1.00best

ehrshot

EHRSHOT Event Modelling

Longitudinal EHR Clinical event prediction 6 trials

Learn a prediction rule from labelled longitudinal timelines and forecast new-onset diagnoses on a held-out test set, offline.

Redivis-gated (Stanford STARR) 0.78best

clinical_trial_matching

Clinical Trial Matching

Text Ranking / retrieval 9 trials

Rank candidate clinical-trial NCT IDs by eligibility confidence for a patient admission note over a per-topic corpus.

Public (TREC-CT 2021) 0.81best

Can AI agents do real clinical work?