Connect with us

AI

AI Chatbots Ace Vaccine Questions but Stumble on Clinical Rules

An npj Vaccines benchmark of 13 LLMs found 86% accuracy on English vaccine questions, with sharp drops for Chinese, dengue, and RSV, and clinical-rule errors.

Published

on

AI chatbots answer most vaccine questions correctly, hitting 86% accuracy on average in English across 13 large language models, but they consistently miss the clinical details that drive real vaccination decisions, a 10 June 2026 study in npj Vaccines has found. The benchmark, called VaxEval, ran 1,886 multiple-choice questions spanning 14 vaccines and three languages through the leading chatbots on the market, including GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, Llama-4 Maverick, and DeepSeek-V3. The result, according to the the full VaxEval benchmark and methodology from the research team, is a strong average that conceals sharp failures on dosing intervals, contraindications, and age-based eligibility, the very rules patients need to get right. A chain-of-thought prompt, the technique that asks the model to “think step by step,” produced 21% lower odds of a correct answer than a direct prompt, a finding the authors flag as opposite to expectations. The paper’s authors are explicit that multiple-choice accuracy does not establish clinical reliability, and they call for structured safeguards before deployment in any health-related setting.

How VaxEval Tested 13 AI Models

The VaxEval team built its 1,886 questions from six health authorities: the World Health Organization, the US Centers for Disease Control and Prevention, the United Nations Children’s Fund, the Africa Centres for Disease Control and Prevention, the American Medical Association, and Immunize.org, with additional material from peer-reviewed scientific literature. The benchmark covers 14 vaccines, ranging from influenza and COVID-19 to respiratory syncytial virus, dengue, and meningococcal disease, and was administered in English, Spanish, and Chinese. Reference answers were verified against the source documents, and every question went through quality control.

The researchers scored each model on whether it selected the pre-specified reference answer. They ran each of the 13 models under three prompting conditions: zero-shot (a direct question), few-shot (a small set of worked examples first), and chain-of-thought (an instruction to reason step by step). Statistical analysis used mixed-effect logistic regression to compare models, languages, prompting styles, and vaccine types. The 13 models in the test were:

  1. GPT-4.5
  2. GPT-4o
  3. GPT-4
  4. GPT-3.5-Turbo
  5. Claude 3 Opus
  6. Gemini 1.5 Pro
  7. Llama-4 Maverick
  8. DeepSeek-V3
  9. Grok-3
  10. Qwen 2.5
  11. GLM-4
  12. Reka Core
  13. Yi-Lightning

The Top of the Leaderboard

GPT-4o led the field at 90.3% overall accuracy, followed closely by Llama-4 Maverick at 90.2% and DeepSeek-V3 at 89.6%. The numbers sit within a single percentage point of each other, a reminder that the contest among the top three is tight enough to flip on a re-run.

At the group level, newer flagship models had 57% higher odds of producing a correct answer than earlier-generation systems, an odds ratio of 1.57 with a 95% confidence interval of 1.50 to 1.65 (P < .001). The pattern held across vaccines and languages, though the paper notes that GPT-4o, which was classified as an earlier model in this study, still posted the highest individual score. Across all 13 models, the mean accuracy was 86.0% in English, 83.7% in Spanish, and 80.0% in Chinese.

Prompting style shifted the numbers too. Few-shot prompts, where the model sees a handful of worked examples before being asked to answer, increased the likelihood of a correct response by 17% compared with zero-shot prompts (OR 1.17, P < .001). Few-shot was the only prompting style that helped.

Model Overall accuracy
GPT-4o 90.3%
Llama-4 Maverick 90.2%
DeepSeek-V3 89.6%

Where the Clinical Errors Cluster

The 86% headline average hides large variation by vaccine. Accuracy ranged from 90.5% on influenza to 76.4% on dengue, a 14-point spread that the field-wide mean flattens. The pattern follows public visibility: vaccines that get heavy public-health communication score well, while those with thinner public profiles drag the average down.

The strongest performance came on influenza (90.5%), hepatitis A (89.5%), HPV (88.4%), and COVID-19 (85.3%). These are the vaccines that appear most often in news coverage, public service campaigns, and patient handouts. Models perform better on widely discussed vaccines that are heavily represented in public health communication, the study notes.

The weakest came on dengue (76.4%), pneumococcal disease (77.7%), meningococcal disease (81.7%), and RSV (80.6%). A patient using a chatbot to ask whether an RSV vaccine is appropriate for a specific age group, or how many doses of a dengue vaccine are required, faces a roughly one-in-five chance of a wrong answer.

Researchers sampled 150 incorrect responses and broke them down by error type. Nearly half came from overgeneralization, broad statements the model produced without regard to vaccine-specific requirements. Other common failures included incorrect dosing intervals, misidentification of contraindications, wrong recommendations for age-based eligibility, and an inability to tell two vaccine types apart. These are the categories that matter at the pharmacy counter, the pediatrician’s office, and the travel clinic.

Vaccine Accuracy
Influenza 90.5%
Hepatitis A 89.5%
HPV 88.4%
COVID-19 85.3%
RSV 80.6%
Meningococcal 81.7%
Pneumococcal 77.7%
Dengue 76.4%

The Chain-of-Thought Surprise

Chain-of-thought prompting, which asks the model to reason step by step before giving a final answer, was associated with 21% lower odds of a correct response compared with zero-shot prompting (OR 0.79, P < .001). The result runs against the usual assumption that "showing your work" improves accuracy on structured tasks. The study did not test why, and the authors call for further work, but the finding is consistent with the overgeneralization pattern that turned up in the error analysis. On factual, multiple-choice vaccine questions, a verbose reasoning trace may simply give the model more room to drift.

For builders wiring LLMs into health tools, the implication is direct: turning chain-of-thought on by default may make the answers worse, not better. Few-shot prompting is the only one of the three styles the paper endorses.

The Language Gap and Its Caveat

Mean accuracy across the 13 models was 86.0% in English, 83.7% in Spanish, and 80.0% in Chinese, a six-point spread between the highest- and lowest-resource language in the benchmark. Vaccine misinformation travels across language borders, so a six-point drop in Chinese is not a footnote. It is the difference between a usable assistant and one that gets a fifth of its answers wrong.

The authors flag an important caveat: the Spanish and Chinese datasets were independently constructed rather than direct translations of the English questions. Many of the apparent language differences are linked to variations in dataset composition, item difficulty, topic distribution, and source composition, not to inherent language bias in the models themselves. Read that way, the spread may say more about which questions the team managed to gather in each language than about how the models perform across languages. An earlier study of ChatGPT and the CDC on vaccine info raised similar questions about whether English-language training data is masking equity gaps in non-English answers.

Question category mattered as much as language. Models hit 93.0% accuracy on misconceptions and corrections, 90.0% on prevention-related questions, and 87.2% on regulatory or monitoring systems. The weaker categories, dosing or recommendation questions (82.5%), cost and accessibility (82.6%), effectiveness and benefits (86.3%), and basic vaccine-type identification (82.5%), are the ones that determine whether a person actually gets the right shot on the right day.

The strongest numbers came on questions the public actually asks: “Is this vaccine safe?” and “Does it work?” The weakest numbers came on questions that decide behavior: “When do I get my next dose?” and “Which version of this vaccine do I need?”

  • Mean accuracy, English: 86.0%
  • Mean accuracy, Spanish: 83.7%
  • Mean accuracy, Chinese: 80.0%
  • Best category (misconceptions and corrections): 93.0%
  • Worst category (dosing or recommendation questions): 82.5%

Why the Authors Hold Back From Endorsement

The VaxEval paper’s conclusion is measured. Modern LLMs possess strong vaccine knowledge and can accurately answer most vaccine questions across multiple languages, the authors write, but the remaining error rates are concentrated in the clinical guidance that affects what a patient actually does.

They call for rigorous evaluation, structured guardrails, and targeted refinement before deploying LLMs for vaccine communication. Multiple-choice accuracy does not establish clinical reliability or readiness for real-world vaccine counseling without prospective validation and context-specific safety evaluation. Outside the benchmark, one policy view on AI and shifting US vaccine guidance has framed chatbots as potential helpers for families sorting through changing recommendations, a use case this benchmark does not address and the VaxEval authors do not endorse.

The study leaves several practical questions unanswered. It does not test how models handle back-and-forth conversations, how they cite their sources, or how they perform in live clinical settings. The authors flag each of these for future work, and the field has yet to answer them.

Their remaining error rates highlight the need for careful oversight, continuous evaluation, and structured safeguards before widespread deployment in health-related settings.

That line is from the paper’s own conclusions, written by the VaxEval team led by Leesa Lin of the London School of Hygiene and Tropical Medicine, with co-authors from the University of Hong Kong and Hong Kong’s Laboratory of Data Discovery for Health. The funder, the InnoHK initiative of Hong Kong’s Innovation and Technology Commission, had no role in the study design or the decision to publish, and the authors declare no competing interests.

What a Patient Should Do With These Results

For everyday questions about what a vaccine is, what it protects against, or whether a misconception is true, the leading chatbots are accurate enough to be useful. For questions about when to get a dose, whether a specific condition rules out a shot, or whether a vaccine is right for a particular age, the error rate is high enough that a clinician should confirm the answer. A useful rule of thumb: GPT-4o is a better reference for “what is the HPV vaccine?” than for “which HPV vaccine doses have I had, and when is my next one due?”

The benchmark measures multiple-choice recall, not conversation. A chatbot that scores 90% on a 1,886-question test can still mislead in a chat, where context drifts, follow-ups compound, and the model can sound confident about a wrong number. Treat the chatbot as a starting point rather than a final answer, double-check on a primary source like a national health agency, and ask a clinician for anything that affects a real appointment. A research on AI chatbots and vaccine decision-making at Duke Kunshan makes a similar point: conversational tools can support vaccine literacy, but they are not a substitute for clinical judgment.

Frequently Asked Questions

How accurate are AI chatbots on vaccine questions?

Across 13 large language models, the VaxEval benchmark found a mean accuracy of 86.0% in English, 83.7% in Spanish, and 80.0% in Chinese on 1,886 multiple-choice questions covering 14 vaccines. GPT-4o led the field at 90.3% overall accuracy, with Llama-4 Maverick at 90.2% and DeepSeek-V3 at 89.6% close behind.

Which AI model scored highest on vaccine questions?

GPT-4o finished first at 90.3% overall accuracy, followed by Llama-4 Maverick at 90.2% and DeepSeek-V3 at 89.6%, according to the npj Vaccines study published on 10 June 2026. The three top models were within a single percentage point of each other.

Where do AI models get vaccine questions wrong?

Errors clustered in clinical guidance: dosing intervals, contraindications, age-based eligibility, and the ability to distinguish between vaccine types. Nearly half of the 150 incorrect responses the researchers sampled were overgeneralizations, broad statements produced without vaccine-specific context. The weakest vaccine category was dengue at 76.4%.

Does asking the AI to “think step by step” improve vaccine answers?

No. Chain-of-thought prompting was associated with 21% lower odds of a correct answer compared with zero-shot prompting (OR 0.79, P < .001), a result the authors describe as opposite to what they expected. Few-shot prompting, by contrast, increased the odds of a correct response by 17%.

Should I ask an AI chatbot about vaccines instead of seeing a doctor?

The paper’s authors are explicit that multiple-choice accuracy does not establish clinical reliability. They recommend careful oversight, continuous evaluation, and structured safeguards before any deployment in health-related settings, and call for prospective validation before these tools are trusted with real vaccine counseling.

Disclaimer: This article is for informational purposes only and is not a substitute for professional medical advice, diagnosis, or treatment. Always seek the advice of a qualified healthcare provider with any questions you may have regarding vaccines or your personal health. The figures cited are accurate as of the publication date of 12 June 2026 and may not reflect subsequent updates to the underlying models or the study itself.

Logan Pierce is a writer and web publisher with over seven years of experience covering consumer technology. He has published work on independent tech blogs and freelance bylines covering Android devices, privacy focused software, and budget gadgets. Logan founded Oton Technology to publish clear, no nonsense tech news and reviews based on real hands on testing. He has personally tested and reviewed dozens of mid range and budget Android phones, written extensively about app privacy, and built and managed multiple WordPress publications over the past decade. Logan holds a bachelor's degree in English and studied digital marketing at a certificate level.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending