The Scan That Reads Itself

Key Facts

Prima published in Nature Biomedical Engineering, Feb 6, 2026 (DOI 10.1038/s41551-025-01608-0).
Trained on UM220K: 221,147 brain MRI studies from 170,000+ patients over two decades at U-M Health.
Mean diagnostic AUROC: 92.0% across 52 neurological conditions (range 76.4%–98.5%).
Outperformed OpenAI CLIP, Microsoft LLaVA, BioMedCLIP, Med-Flamingo by more than 30%.
No FDA clearance, no clinical deployment; code and weights open-source on GitHub.

01 — ContextThe Queue

On any given Friday evening at a rural hospital in Michigan, a brain MRI sits in a queue. The radiologist who will read it is not on site. According to the research team behind Prima, rural patients are two to five times more likely than urban patients to wait a full week for MRI results (P<0.001). Scans ordered on weekends face twice the likelihood of a three-day-or-longer delay (P<0.001). The bottleneck is not the machine. It is the human who reads the output.

This bottleneck is structural. According to a 2025 analysis in Nature npj Health Systems, over 1,000 AI- and machine-learning-enabled medical devices have received FDA marketing authorization, with radiology accounting for approximately 76% of all clearances. But nearly all are narrow tools — single-task classifiers trained to detect one lesion type, flag one condition, measure one structure. They answer specific questions. They do not read a scan the way a radiologist does: integrating sequences, weighing clinical context, producing a differential diagnosis across dozens of possible conditions. No foundation model had been validated in the peer-reviewed literature for comprehensive brain MRI interpretation.

Prior AI work in MRI has addressed different parts of the problem. FastMRI, a collaboration between Meta AI and NYU Langone launched in 2018, used deep learning to reconstruct MRI images from 75% less raw data — making scans faster to acquire. That shortened the time a patient spends in the machine. It did not shorten the time the scan spends waiting to be read. The queue that matters most is not before the scan. It is after.

02 — DiscoveryThe Archive Becomes the Teacher

In February 2026, a team of twenty researchers at the University of Michigan published a model called Prima in Nature Biomedical Engineering (DOI: 10.1038/s41551-025-01608-0). Prima is a vision language model that processes brain MRI studies and generates diagnostic assessments across 52 neurological conditions — strokes, hemorrhages, tumors, dementia indicators, inflammatory and infectious diseases, developmental malformations. It does this in under three seconds on a single GPU.

The model's distinguishing feature is not its architecture. It is its training data. Prima was built on the UM220K Neuroimaging Dataset: 221,147 MRI studies from more than 170,000 patients, comprising 5.6 million MRI sequences, 362 million individual slices, and 3.2 billion volume tokens. This is not internet-scraped data or a curated benchmark set. It is two decades of digitized radiology from a single health system — the brain scans that University of Michigan Health performed and archived between 2004 and 2024, each paired with the radiologist's report.

UM220K Neuroimaging Dataset — University of Michigan Health, 2004–2024

Scale

MRI studies indexed221,147

Unique patients170,000+

MRI sequences5,600,000

Individual slices362,000,000

Volume tokens3,200,000,000

Training Pipeline

Paired with221,147 radiologist reports

Report summarizationGPT-3.5

Automated labelingGPT-4 (94.0% accuracy)

Compute

Training time50 days

Hardware1 HPC node, 8× NVIDIA L40S

Model parameters56.6M

The training approach used a CLIP-style objective: aligning 3D MRI volumes with summarized radiology reports so the model learns to connect visual patterns with diagnostic language. The radiologists' reports were the labels. Michigan did not hire a team of annotators. It used the diagnostic expertise already embedded in twenty years of clinical records. GPT-3.5 stripped non-diagnostic information from the reports, leaving clean diagnostic summaries. GPT-4 handled automated labeling, achieving 94.0% annotation accuracy — comparable, according to the authors, to expert human annotators.

The architecture itself is a hierarchical vision transformer that processes MRI data through three stages: volume tokenization, sequence-level learning, and study-level learning. The visual encoder has 56.6 million parameters. Training required 50 days on a single high-performance computing node with eight NVIDIA L40S GPUs — substantial, but within reach of any major academic medical center with its own HPC infrastructure.

The team spans five University of Michigan departments — Computer Science, Neurosurgery, Radiology, Neurology, and Computational Medicine — plus the University of Cologne's neurosurgery department. The senior author is Todd Hollon, M.D., a neurosurgeon at U-M Health who leads the Machine Learning in Neurosurgery Lab. The co-first authors are Yiwei Lyu, a postdoctoral fellow in Computer Science and Engineering, and Samir Harake, a data scientist in Hollon's lab. Funding came from the National Institute of Neurological Disorders and Stroke, the Chan Zuckerberg Initiative, the Frankel Institute for Heart and Brain Health, and several university-affiliated foundations.

The Archive — Scattered nodes drift within a faint circular boundary; connecting strands form between them, pulling them into clusters; the interior crystallizes into a warm lattice while isolated nodes outside the boundary drift unconnected; the structure dissolves and reforms.

03 — The MechanismWhy Institutional Training Wins

According to the paper, Prima achieved a mean diagnostic AUROC of 92.0% across 52 radiologic diagnoses spanning neoplastic, inflammatory, infectious, and developmental conditions. AUROC — area under the receiver operating characteristic curve — measures a model's ability to distinguish between positive and negative cases across all classification thresholds. It is not a simple "accuracy" percentage. The full range of Prima's performance stretched from 76.4% for arachnoid cysts to 98.5% for Dandy-Walker malformations. Press coverage of the paper frequently led with a figure of "97.5% accuracy" or "up to 97.5%," drawn from the high end of individual diagnoses. The paper itself uses 92.0% as its headline metric. The low end — 76.4% — appeared in none of the press coverage reviewed for this story.

The comparison that defines the paper's contribution is not the absolute number. It is the gap. According to the authors, Prima outperformed every general-purpose model tested by more than 30% in diagnostic AUROC. The baselines were not weak: they included OpenAI CLIP (base and large variants), Microsoft LLaVA (1.5-7B, Mistral, and Med variants), PubMedCLIP, BioMedCLIP, and Med-Flamingo. These represent the best available general-purpose and medical vision-language models. All were tested on the same diagnostic task. None came close. The gap is the thesis in numbers: a model trained on one institution's own diagnostic archive outperformed every model trained on the open internet or broad medical corpora by a wide margin.

Prima Diagnostic Performance

92.0%

Mean AUROC across 52 diagnoses

Dandy-Walker malformations98.5%

press headline: "97.5% accuracy"

Mean (52 diagnoses)92.0%

Arachnoid cysts76.4%

never mentioned in press coverage

Comparison — Same Task, Same Test

OpenAI CLIP (large)< 62%

Microsoft LLaVA (Med)< 62%

BioMedCLIP< 62%

PubMedCLIP< 62%

Med-Flamingo< 62%

Prima outperformed all general-purpose models by >30%

The prospective validation is unusually rigorous. The authors tested Prima on 29,435 MRI studies conducted over one year at Michigan Medicine — exceeding their calculated minimum sample size of 22,338. This is not a retrospective analysis of selected cases. It is every brain MRI that passed through one health system for twelve consecutive months.

Prima also demonstrated clinical utility beyond raw diagnosis. According to the paper, the model achieved an AUROC of 85.1% for neurosurgery referral recommendations and 89.1% for neurology referrals. For worklist prioritization — deciding which scans a radiologist should read first — it produced a Pearson correlation of 0.69. Combined with its sub-three-second inference time on a single GPU, processing approximately 6.5 studies per second, these results suggest a system that could not only identify conditions but triage which scans need human attention most urgently. Against the rural and weekend wait times established above, this is the payoff the research is aiming for.

04 — The LimitationOne Hospital's Memory

Prima was trained entirely on data from a single institution. The UM220K dataset reflects University of Michigan Health's patients, scanning protocols, equipment manufacturers (Siemens, Philips, GE), reporting conventions, and disease prevalence. Prima learned to read MRIs the way Michigan's radiologists read them — because it learned from Michigan's radiologists. This is both the source of its strength and the central unresolved question.

Validated

Training data221,147 studies (U-M Health)

Prospective test29,435 studies (U-M Health, 1 yr)

Fairness (within U-M)No significant differences by race, sex, manufacturer, insurance

External: BraTS glioma98.0% Top-3 accuracy

External: UCSF/NYU metastasisOn par with prospective

External: acute strokeOn par with prospective

Unknown

Training data from0 other institutions

Prospective test at0 other hospitals

Community hospital protocolsNot tested

Disease prevalenceReflects Michigan only

Out-of-distribution conditionsNot addressed

FDA statusNo clearance. No submission disclosed.

Within U-M Health's patient population, the fairness results are strong. According to the paper, Prima showed no statistically significant performance differences across race, sex, MRI manufacturer, or insurance status. One exception: patients aged 34 to 61, where the authors attribute the disparity to diagnostic diversity in that cohort. But these fairness results prove algorithmic equity within Michigan's population. They do not prove it universally. The external validation offers partial reassurance — 98.0% Top-3 accuracy on BraTS glioma region selection, performance on par with prospective results on UCSF and NYU metastasis datasets, and competitive acute stroke detection. Transfer learning applications showed promise too: brain age estimation with 5.6 years mean absolute error, autism spectrum prediction matching independent supervised benchmarks, dementia and Alzheimer's prediction competitive with ADNI/OASIS baselines. These are encouraging. They are not a substitute for the core generalizability test: how does Prima perform on routine scans from a community hospital with different equipment, different protocols, and a different patient population?

The authors acknowledge this. Prima is in what they describe as the "initial stage" of clinical certification. It has not received FDA clearance. It has not been submitted for clearance, as far as the paper or any public statement discloses. It is not deployed in clinical practice. The code and model weights are open-source under an MIT license on GitHub, which means the architecture can be replicated. The dataset cannot. UM220K is Michigan's. Every other hospital that wants to follow this approach would need to build a model from its own archive.

05 — SignalCan Every Hospital Build Its Own?

What Prima proves is specific: a mid-sized research university with twenty years of digitized radiology records and modest high-performance computing resources — eight GPUs, fifty days — can build a specialist diagnostic AI that outperforms every general-purpose model by a wide margin. The ingredients are not exotic compute or a proprietary architecture. They are institutional data and time. The authors report consistent performance improvements with larger training datasets, suggesting foundation model scaling properties. If this holds, hospitals with larger or more diverse archives would build even stronger models.

The open-source release lowers one barrier. The code is on GitHub. The weights are available. Any institution can study the architecture, adapt the training pipeline, and attempt replication. But the bottleneck is not the method. It is the data. UM220K represents a specific institutional asset — two decades of paired scans and reports from radiologists whose clinical culture, diagnostic vocabulary, and decision patterns are baked into every label. Building an equivalent dataset requires having generated and stored that data in the first place. Most community hospitals do not have twenty years of digitized radiology at this scale. Many do not have the HPC infrastructure either.

One gap is worth noting: every quote in the coverage of Prima comes from the research team or University of Michigan institutional staff. No independent radiologist or AI researcher has publicly evaluated Prima's claims. The peer review process at Nature Biomedical Engineering provides editorial and reviewer oversight as a partial proxy. The open-source release enables independent replication. But as of this writing, no external evaluation has been published. Hollon has described Prima as a "co-pilot for interpreting medical imaging studies" and, more ambitiously, as "ChatGPT for medical imaging." These framings are aspirational, not descriptive of current capability. Prima is a research result published in a peer-reviewed journal. It is not a product. It is not deployed. Whether the approach scales beyond one institution's walls — whether the template can travel even if the dataset cannot — is the question the paper raises but cannot yet answer.

∞

What If?

Imagine two hundred academic medical centers in the United States each follow Prima's template. Each trains a foundation model on its own radiology archive. Johns Hopkins's model learns from Hopkins patients — disproportionately referred cases, high acuity, urban demographics. Mayo's model learns from Mayo patients — older, whiter, upper Midwestern. A safety-net hospital in the Bronx trains on its archive — younger patients, higher rates of untreated chronic disease, older scanners, shorter reports. Each model is excellent inside its own walls. Each scores above 90% AUROC on its own prospective data. None has been validated against the others. A patient transfers from a rural clinic in Alabama to a tertiary center in Boston. The Alabama model flagged nothing. The Boston model flags a probable glioma. Who is right? There is no arbitration layer. There is no shared validation standard. There is no national benchmark for institutional diagnostic AI. The FDA has cleared over a thousand narrow medical devices but has no framework for certifying a model whose training data is one hospital's memory. Peer review does not scale to every institution's bespoke model. Insurance does not yet reimburse AI-assisted reads differently by institution. The very thing that makes these models powerful — they learned from the institution that will use them — is the thing that makes them incomparable. Training a model on your own patients is a structural choice: it bakes in your protocols, your demographics, your disease prevalence, your radiologists' linguistic habits, and their blind spots. The model does not know it is biased toward its training population, because from its perspective, the training population is the world. When a patient's scan crosses institutional lines, the model that reads it at the receiving hospital has never seen a scan like this one — or more subtly, it has seen thousands like it, but from a different population, and the difference is not visible in any output. And in the gap between two confident, institution-specific diagnoses, a patient waits for an answer that no system is designed to reconcile.

How did this land?

Sources

← Previous Chapter 12 Golden Gate Claude 9 min read