AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation

Abstract

Agentic AI offers a path from single-task tools to integrated decision support in clinical imaging. We introduce AURA, a multi-modal medical agent that performs visual-linguistic explanation (VLE), report generation, grounding, VQA, segmentation, and counterfactual editing—while self-evaluating its outputs and autonomously selecting the right tool for each case. AURA coordinates modules such as RadEdit and PRISM to propose and test hypotheses through counterfactual simulation, aiming for faithful, interpretable, and clinically aligned reasoning.

Clinical teaser with VLE overlays — Teaser: AURA produces visual–linguistic explanations and counterfactuals to justify predictions accross different tasks.

System Overview

AURA is a modular, multi-agent system. A Planner routes requests to specialist tools (Report, Grounding, Segmentation, Counterfactual), aggregates evidence, and produces a coherent explanation. A self-evaluation loop compares predictions against evidence maps and difference masks to detect failure modes and trigger refinement.

Visual–Linguistic Explanations (VLE)

Generate aligned text and spatial evidence overlays; justify findings by highlighting supporting regions and uncertainty.

Examples of VLE overlays — Example VLEs on chest X-rays with explanatory captions.

Results

Dataset & Implementation. We evaluate on the held-out CheXpert test split, matching the PRISM data split for fairness. AURA runs as an inference-time agent (no fine-tuning), using Qwen2.5-Coder-32B-Instruct for tool orchestration via programmatic function calls (SmolAgents). Experiments use on-prem 2× NVIDIA A100 80GB, with adaptive parallelization when multiple GPUs are present.

Counterfactual Editing Performance

We compare single-CF and ensemble baselines for RadEdit/PRISM against AURA’s agent-driven generate→evaluate→select loop. Metrics: Subject Identity Preservation (SIP; L1 distance), Counterfactual Prediction Gain (CPG; Δ classifier score), Classifier Flip Rate (CFR; proportion flipped), and SSIM (similarity).

Method	#CFs	CPG ↑	CFR ↑	SSIM ↑	SIP ↓
RadEdit	1	0.264	0.41	0.764	0.055
RadEdit-Ensemble	5	0.355	0.55	0.778	0.059
PRISM	1	0.418	0.67	0.648	0.081
PRISM-Ensemble	5	0.459	0.71	0.661	0.079
AURA (agent)	5	0.443	0.71	0.740	0.060

AURA internally explores RadEdit/PRISM settings, self-evaluates candidates, and selects the best CF without external post-processing.

Adaptive Explanations under Limited Knowledge

With ambiguous or underspecified inputs, fixed prompts in editing tools often degrade output quality. AURA first gathers context (e.g., projected findings via report/grounding), then proposes/evaluates multiple CFs to produce a faithful, clinically plausible explanation.

Qualitative examples under limited pathology knowledge — Qualitative examples (placeholders): AURA improves edits by adding context and self-evaluation before final selection.

Components & Reasoning Loop

The Planner evaluates uncertainty and directs the next action (e.g., request segmentation or attempt an edit), then re-ranks candidate explanations through a scoring function combining coverage, faithfulness, and clinical plausibility.

Reasoning loop / pipeline diagram — Reasoning loop with iterative refinement and self-evaluation triggers.

Citation

@article{fathi2025aura,
  title   = {AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation},
  author  = {Fathi, Nima and Kumar, Amar and Arbel, Tal},
  journal = {arXiv preprint arXiv:2507.16940},
  year    = {2025}
}