AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation
McGill University • Mila – Quebec AI Institute
Preprint, 2025
Abstract
Agentic AI offers a path from single-task tools to integrated decision support in clinical imaging. We introduce AURA, a multi-modal medical agent that performs visual-linguistic explanation (VLE), report generation, grounding, VQA, segmentation, and counterfactual editing—while self-evaluating its outputs and autonomously selecting the right tool for each case. AURA coordinates modules such as RadEdit and PRISM to propose and test hypotheses through counterfactual simulation, aiming for faithful, interpretable, and clinically aligned reasoning.
System Overview
AURA is a modular, multi-agent system. A Planner routes requests to specialist tools (Report, Grounding, Segmentation, Counterfactual), aggregates evidence, and produces a coherent explanation. A self-evaluation loop compares predictions against evidence maps and difference masks to detect failure modes and trigger refinement.
Visual–Linguistic Explanations (VLE)
Generate aligned text and spatial evidence overlays; justify findings by highlighting supporting regions and uncertainty.
Results
Dataset & Implementation. We evaluate on the held-out CheXpert test split, matching the PRISM data split for fairness. AURA runs as an inference-time agent (no fine-tuning), using Qwen2.5-Coder-32B-Instruct for tool orchestration via programmatic function calls (SmolAgents). Experiments use on-prem 2× NVIDIA A100 80GB, with adaptive parallelization when multiple GPUs are present.
Counterfactual Editing Performance
We compare single-CF and ensemble baselines for RadEdit/PRISM against AURA’s agent-driven generate→evaluate→select loop. Metrics: Subject Identity Preservation (SIP; L1 distance), Counterfactual Prediction Gain (CPG; Δ classifier score), Classifier Flip Rate (CFR; proportion flipped), and SSIM (similarity).
| Method | #CFs | CPG ↑ | CFR ↑ | SSIM ↑ | SIP ↓ |
|---|---|---|---|---|---|
| RadEdit | 1 | 0.264 | 0.41 | 0.764 | 0.055 |
| RadEdit-Ensemble | 5 | 0.355 | 0.55 | 0.778 | 0.059 |
| PRISM | 1 | 0.418 | 0.67 | 0.648 | 0.081 |
| PRISM-Ensemble | 5 | 0.459 | 0.71 | 0.661 | 0.079 |
| AURA (agent) | 5 | 0.443 | 0.71 | 0.740 | 0.060 |
AURA internally explores RadEdit/PRISM settings, self-evaluates candidates, and selects the best CF without external post-processing.
Adaptive Explanations under Limited Knowledge
With ambiguous or underspecified inputs, fixed prompts in editing tools often degrade output quality. AURA first gathers context (e.g., projected findings via report/grounding), then proposes/evaluates multiple CFs to produce a faithful, clinically plausible explanation.
Components & Reasoning Loop
The Planner evaluates uncertainty and directs the next action (e.g., request segmentation or attempt an edit), then re-ranks candidate explanations through a scoring function combining coverage, faithfulness, and clinical plausibility.
Citation
@article{fathi2025aura,
title = {AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation},
author = {Fathi, Nima and Kumar, Amar and Arbel, Tal},
journal = {arXiv preprint arXiv:2507.16940},
year = {2025}
}