MedCLI evaluates LLM agents on medically grounded tasks with explicit environments, tools, and instructions — spanning EHR workflows, medical imaging, time-series, and rare-disease reasoning. Every benchmark is packaged as a reproducible, leak-proofed agentic task.
Frontier LLM agents evaluated across the MedCLI suite, sorted by mean task-resolution score.
| # ▲ | Model ↕ | Vendor ↕ | Resolved ↕ | Trial Match ↕ | CT ↕ | Data Quality ↕ | EHRSHOT ↕ | ETL ↕ | X-ray ↕ | Sweep $ ↕ |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.5 Codex | OpenAI | 0.53 | 0.78 | 0.37 | 0.00 | 0.72 | 1.00 | 0.33 | $15.3 |
| 2 | Claude Opus 4.8 Claude Code | Anthropic | 0.49 | 0.81 | 0.20 | 0.08 | 0.67 | 1.00 | 0.17 | $21.8 |
| 3 | Claude Opus 4.7 Claude Code | Anthropic | 0.45 | 0.44 | 0.20 | 0.17 | 0.78 | 1.00 | 0.13 | $26.0 |
| 4 | GPT-5.3 Codex Codex | OpenAI | 0.37 | 0.26 | 0.17 | 0.00 | 0.61 | 1.00 | 0.20 | $4.4 |
| 5 | GPT-5.4 Codex | OpenAI | 0.35 | 0.22 | 0.20 | 0.00 | 0.61 | 0.67 | 0.40 | $7.3 |
| 6 | Claude Sonnet 4.6 Claude Code | Anthropic | 0.35 | 0.37 | 0.00 | 0.00 | 0.61 | 1.00 | 0.10 | $19.0 |
| 7 | GPT-5.4 mini Codex | OpenAI | 0.25 | 0.22 | 0.13 | 0.00 | 0.39 | 0.67 | 0.10 | $3.6 |
| 8 | Claude Opus 4.6 Claude Code | Anthropic | 0.14 | 0.22 | 0.00 | 0.17 | 0.33 | 0.00 | 0.10 | $18.0 |
Score metric: mean reward (pooled across 6 scored benchmarks, 3 attempts per trial). Resolved is the mean task-resolution score across all benchmarks. Sweep $ is the summed mean per-benchmark cost of one full pass. Click any column to sort.
Each cell encodes three signals at once: color is the per-task resolution score, circle size is the average number of agent turns (a proxy for effort), and the green $ tier is the run cost. Text-centric tasks are largely tractable; imaging and quality-control tasks remain stubbornly hard.
Existing healthcare benchmarks are often narrow in scope, inconsistent in setup, hard to reproduce, or outdated relative to modern agent abstractions. MedCLI is a suite of agentic healthcare benchmarks — newly created or redesigned for agent evaluation — that expose clear concepts of environment, tools, instructions, and task interfaces while demanding substantial medical competence. Each benchmark is packaged as a Harbor task with a standardized container environment, a declarative tool surface, agent-visible instructions, hidden test labels, and a deterministic verifier emitting both a binary pass/fail reward and richer diagnostic metrics.
Every task meets four bars: every reward-relevant behaviour is described in the instructions and checked by the verifier; the gold answer is never agent-visible; at least one obvious cheat path per task is audited and blocked or argued unprofitable; and verifier scripts are deterministic and reproducible across re-runs.
It is the mean task-resolution score (mean reward) pooled across the scored benchmarks, with 3 attempts per trial. A benchmark's reward is binary per trial — the agent must get the task fully right — so the pooled score reflects end-to-end reliability rather than partial credit.
Text-centric tasks (EHR pipelines, trial retrieval) are largely tractable, with the best agents clearing 0.7–1.0. Imaging and quality-control tasks (chest-CT, pathology, data-quality auditing) remain stubbornly hard — even the strongest agent rarely exceeds ~0.37 — because the bottleneck is perceptual grounding, not search budget.
That benchmark reports only per-class precision/recall/F1 and has no single pooled task-resolution reward, so it is listed among the benchmarks below but excluded from the pooled leaderboard average.
The numbers come directly from the MedCLI paper's auto-generated baseline metrics. A build script pivots that data into the table at deploy time — the site is fully static and rebuilt on every push.
MedCLI is a composite of 7 benchmark families. Each is an individual agentic task tree.
Review a counterfactual draft Findings against the patient's imaging history and correct its clinically-significant errors.
Reason over gigapixel whole-slide images to decide tumor presence and select the set of tumor tiles on a fixed grid. Reports per-class F1 only.
Interpret a volumetric non-contrast chest CT and emit a per-volume binary label vector of abnormality findings.
Audit a structured EHR dataset and flag the rows that violate clinical plausibility, conformance, or concordance.
Inspect a real open-source ETL repository and produce a config that reshapes a standardized MEDS output cohort.
Learn a prediction rule from labelled longitudinal timelines and forecast new-onset diagnoses on a held-out test set, offline.
Rank candidate clinical-trial NCT IDs by eligibility confidence for a patient admission note over a per-topic corpus.