AI

Two AI Doctor Chatbots Top Doctors in Nature Trials

Q: What are the biggest risks of AI doctor chatbots?

The systems are text-only, so they cannot see a patient's face, hear a cough, or perform a physical exam. They can also miss context a trained clinician would catch, especially in psychiatry, where symptoms are subjective. Outside experts also warn that even a high-performing AI can give confidently wrong answers in unfamiliar cases, and that regulatory approval, liability rules, and integration with existing hospital IT systems all remain unresolved.

Q: Could AI doctor chatbots help with doctor shortages?

Both papers suggest that scenario, with caveats. The Nature press framing notes that AI agents could assist physicians in routine tasks and possibly address physician shortages in some regions. The technology is not there yet, and any deployment would need prospective studies on real patients and clear rules for who is responsible when the AI is wrong.

Q: Should I trust medical advice from a chatbot today?

Public health agencies and medical regulators advise against replacing a clinician with a general-purpose chatbot for diagnosis or treatment. For non-urgent questions, official health service websites and accredited telehealth providers are safer than an unrestricted AI chatbot. Anyone with acute symptoms should contact emergency services or a licensed clinician.

Two AI doctor chatbots, MIRA and AMIE, matched or beat doctors in Nature patient management trials. Neither system is ready for real patients yet.

Published

4 weeks ago

June 27, 2026

Logan Pierce

Two AI doctor chatbots, MIRA from German researchers and AMIE from Google, matched or beat physicians in patient management trials published this month in Nature. Neither team says the technology is ready for real clinics, and outside experts warn the gap between a controlled lab test and a working hospital ward remains wide.

Both papers were published in Nature on 17 June 2026. They describe how far conversational AI has come inside simulated clinical workflows, and both teams stress that real-world testing is the next step.

Meet MIRA and AMIE

MIRA stands for Medical Intelligence for Reasoning and Action. Developed primarily by researchers at Germany’s Heidelberg University Hospital and the Dresden University of Technology, with support from a Google team, the system runs inside a sandboxed copy of a real electronic health record where it can interview a simulated patient, order laboratory tests and imaging, draft a differential diagnosis, and write prescriptions. The MIRA paper in Nature on autonomous medical AI describes the system as an autonomous agent using 11 specialised tools with more than 85,000 options to work through a full clinical workflow.

AMIE, the Articulate Medical Intelligence Explorer, first appeared in 2024 as a diagnostic chatbot. The new AMIE paper in Nature on conversational disease management extends the system across multi-visit cases, with the model built on Google’s Gemini family and running an empathetic dialogue agent for patient-facing conversations alongside a deep-thinking management reasoning agent that cross-references drug formularies and clinical guidelines. Mike Schaekermann and Alan Karthikesalingam, the senior authors, frame AMIE as a step toward using conversational AI as a tool for ongoing care. Both systems are research projects, not products, and neither has been deployed in a real clinic.

AI doctor chatbot MIRA AMIE Nature trial results

The Numbers From the Simulated Wards

Both teams put their systems against human doctors in head-to-head comparisons, and the numbers lean toward the machines. MIRA scored 88.9% diagnostic accuracy across 574 emergency department cases drawn from the MIMIC-IV database, and 87.8% accuracy in a 311-case matched comparison where the system was scored against the same physicians handling the same cases. A panel of board-certified physicians reached 78.1% accuracy on those matched cases, while a mixed-seniority group of residents and attending doctors scored 71.1%.

System	Comparator	Test size	Headline result
MIRA	Board-certified physicians	311 matched cases	87.8% vs 78.1%
MIRA	Mixed-seniority physicians	311 matched cases	87.8% vs 71.1%
MIRA	None (MIMIC-IV dataset)	574 emergency cases	88.9% diagnostic accuracy
AMIE	21 primary care physicians	100 multi-visit scenarios	Non-inferior in management reasoning; higher on preciseness and guideline alignment

The Nature paper says MIRA performed at or above physician level from both groups on diagnosis and treatment quality, adhered to clinical guidelines, and showed strong medication safety across renal dosing, drug interactions, allergies, QT risk, and opioid risk. Follow-up evaluations showed high recall on admission decisions. The system did not achieve perfection on every antibiotic choice or treatment detail, and the authors are explicit that MIRA is not a replacement for human clinicians. Both teams stressed the same point in different ways: their systems are tools to support doctors, not substitutes.

AMIE was tested in a different setup. In a blinded Objective Structured Clinical Examination, the system was compared to 21 primary care physicians across 100 multi-visit case scenarios in five medical specialties, with cases designed around UK NICE guidance and BMJ Best Practice. Specialist physicians rated AMIE non-inferior to the doctors in management reasoning and higher than the doctors on preciseness of treatments and alignment with clinical guidelines. On a separate medication-reasoning benchmark the teams call RxQA, AMIE outperformed the doctors on the harder questions.

What an AI Doctor Can Actually Do

MIRA’s headline capability is action. The system can request blood tests, order imaging studies, schedule procedures, prescribe medications, and recommend hospital admissions, all while keeping a coherent chat going with a simulated patient whose responses are anchored to the documented history of present illness. The patient agent inside the simulation was tested for stability against rephrased questions and held up at 99.4% content consistency. Across 933 audited conversations, the agent never prematurely disclosed a documented diagnosis.

AMIE works more like an extended clinical conversation. The system is built to handle multi-visit cases, tracking a patient’s progression over time and adjusting its management plan as new information arrives. It pulls in up-to-date clinical practice guidelines and drug formularies to ground its recommendations, and the technology company’s write-up of AMIE’s disease management trials describes the goal as giving physicians more time with patients by absorbing the structured reasoning burden.

The Catch: These Were Simulated Patients

The performance numbers come with a long list of caveats that both teams underline. MIRA was tested on retrospective emergency department cases, with patient responses generated by a separate AI agent rather than by real people in distress. AMIE was tested with trained patient actors following scripted scenarios. Neither system has touched a real hospital, a real exam room, or a real patient with ambiguous symptoms.

Jakob Kather, the senior author of the MIRA paper, said future work is needed to establish generalisation in real-world studies. The patient agent used in MIRA’s evaluation showed some structural limits: when researchers asked it semantically equivalent questions phrased differently, the agent produced fully consistent answers 99.4% of the time. It never leaked a premature diagnosis in 933 audited conversations, but it did mention prior diagnostic workup in 31 cases, most often in pancreatic cancer encounters.

Both teams said their systems belong in the research lab, not the clinic, in language that stressed collaboration over replacement. The Nature press framing on both papers was equally cautious, noting that if AI agents could carry out such tasks, they may be able to assist physicians in routine tasks and possibly address physician shortages in some regions of the world, but only after years of additional work.

Both systems were tested on narrow condition sets, which limits what the results can tell us. MIRA was run on eight specific diagnoses drawn from emergency department cases, including appendicitis, pneumonia, and pancreatic cancer. AMIE was tested across five medical specialties using scenarios built around UK NICE guidance. Both papers acknowledge that performance on these selected conditions does not guarantee performance on the much broader range of cases encountered in everyday clinical practice.

What the Skeptics Want Next

Outside experts raised sharper warnings about the gap between lab and clinic. Robert Ranisch, a professor of medical ethics at the University of Potsdam, called the MIRA study an exciting and methodologically well-designed contribution, then added that it examined AI performance under laboratory conditions, and that many promising AI systems have so far fallen short once real patients, clinicians, incomplete data, and different IT systems came into play. Dr Dominic Oliver, a postdoctoral researcher in psychiatry at Oxford, said in the collection of independent expert reactions to both studies that simulated patients are constructed with known truths, and that real patients often struggle to describe which symptoms matter. Oliver added that psychiatry, in particular, may be poorly served by text-only AI, since symptoms there are subjective and depend on tone, posture, and observation, not typed answers alone. Alfonso Valencia, an ICREA professor and director of life sciences at the Barcelona Supercomputing Centre, noted that MIRA is open-source while AMIE is closed, which means independent researchers cannot evaluate AMIE on their own.

The results demonstrate the potential that AI agents hold for medicine. The key question for further development was how we can integrate such innovations safely, transparently and for the benefit of patients into clinical practice.

Uwe Platzbecker, medical director of Dresden University Hospital, made the comment per dpa. German health policy specialists pushed the same point, with Reinhard Busse, head of healthcare management at the Berlin University of Technology, saying an AI agent’s ability to map clinical processes in a structured way does not in itself translate into better care or cost savings in practice, while Kerstin Denecke, a researcher at the Bern University of Applied Sciences, listed obstacles including the state of healthcare data, regulatory approval timelines, unclear lines of responsibility, and the need for representative studies on the risks of such systems. Earlier benchmarks of chatbots on vaccine questions have found clinical rules are the weak spot for medical LLMs.

Eugen Brysch, who heads the German Patient Protection Foundation, warned that despite the time savings AI can offer, the doctor-patient conversation will continue to be indispensable, particularly for older patients. Brysch also cautioned against growing dependence on non-European companies. The Federal Medical Association in Germany has previously voiced concern about the loss of human touch when AI assistants are deployed, and researchers there are now studying whether the interpersonal and emotional aspects of care are being pushed into the background. Google has launched a nationwide study to assess AMIE in real-world virtual care settings, and integrating the systems into real hospitals safely is the work that has not yet started.

Frequently Asked Questions

Could an AI doctor chatbot prescribe me real medication today?

No. Both MIRA and AMIE are research systems. They have not been approved by any medical regulator, and neither has been tested on real patients outside controlled simulations. Any chatbot that already prescribes or orders tests for real users is operating outside the scope of the published research.

Did AI doctor chatbots actually beat human doctors?

In their respective simulations, yes. MIRA scored 87.8% diagnostic accuracy against a matched physician panel that scored 78.1%, a gap that was statistically significant at p < 0.001. AMIE was rated non-inferior to 21 primary care physicians in management reasoning and was rated higher on guideline alignment and treatment preciseness. Both results came from simulated or scripted patient cases, not real clinical encounters.

What are the biggest risks of AI doctor chatbots?

The systems are text-only, so they cannot see a patient’s face, hear a cough, or perform a physical exam. They can also miss context a trained clinician would catch, especially in psychiatry, where symptoms are subjective. Outside experts also warn that even a high-performing AI can give confidently wrong answers in unfamiliar cases, and that regulatory approval, liability rules, and integration with existing hospital IT systems all remain unresolved.

Could AI doctor chatbots help with doctor shortages?

Both papers suggest that scenario, with caveats. The Nature press framing notes that AI agents could assist physicians in routine tasks and possibly address physician shortages in some regions. The technology is not there yet, and any deployment would need prospective studies on real patients and clear rules for who is responsible when the AI is wrong.

Should I trust medical advice from a chatbot today?

Public health agencies and medical regulators advise against replacing a clinician with a general-purpose chatbot for diagnosis or treatment. For non-urgent questions, official health service websites and accredited telehealth providers are safer than an unrestricted AI chatbot. Anyone with acute symptoms should contact emergency services or a licensed clinician.

Disclaimer: This article discusses research on experimental AI medical systems. It is for informational purposes only and does not constitute medical advice. Figures and study results are accurate as of publication on 27 June 2026. Consult a qualified healthcare professional for any medical concerns.