Ask Dr. Maia · Issue 2, Part 1 · May 12, 2026
This is Part 1 of a two-part issue. Part 1 covers what is happening and whether the tools work. Part 2 (publishing May 18) covers who is most affected, where regulators stand, and how to use these tools without getting hurt. A patient-facing companion guide arrives Friday May 15.
ChatGPT Health, Claude, and Copilot: a physician review
Five of the largest technology companies launched dedicated consumer AI health products in the first quarter of 2026. Most of them run on the same general-purpose model that was already in your pocket. Here is what changes when you turn on the Health features, what stays exactly the same, and whether the evidence says the tools work.
It is 2:17 in the morning. A 24-year-old in rural Alabama, uninsured, has been having chest tightness for six hours. She opens her phone. The nearest emergency department is 45 minutes away. She has no primary care physician. Her ChatGPT Health tab is open, and it is free.
According to OpenAI’s January 2026 report, roughly 70% of health-related conversations with ChatGPT happen outside standard clinic hours. More than 580,000 healthcare-related messages per week come from hospital deserts, defined as areas more than a 30-minute drive from a general medical facility. The population asking a chatbot about chest tightness at 2 a.m. is not, on average, the population with a cardiologist on speed-dial and a $0 copay. It is the population with the fewest alternatives, and therefore the population for whom an error in the answer carries the heaviest cost.
That is not a reason to condemn these tools. It is a reason to examine them honestly.
What’s actually happening
The speed of this launch cycle is worth slowing down to absorb. According to KFF’s April 2026 report, five major technology companies launched or significantly expanded consumer-facing AI health products in the first quarter of 2026 alone: ChatGPT Health (OpenAI, January 2026), Claude for Healthcare (Anthropic, January 2026), Copilot Health (Microsoft, March 2026), Perplexity Health (Perplexity, March 2026), and Amazon Health AI (January 2026, expanded March 2026).
All five invite users to connect medical records, lab results, and wearable data for personalized health guidance. All five use general-purpose large language models as their core engine. The clinical validation record for each of them, at the time of launch, was either thin or absent.
The usage numbers are not small. OpenAI reported that more than 40 million people ask ChatGPT health questions every day, with health-related queries making up more than 5% of all global messages. The March 2026 KFF Tracking Poll on Health Information and Trust found that 32% of U.S. adults have used AI chatbots for health information in the past year, roughly equal to the share who use social media for health. A concurrent Pew Research Center survey found that 22% of Americans get health information from AI chatbots at least sometimes, and among adults ages 18 to 29, that share rises to 32%.
The KFF poll also found that 41% of health AI users have uploaded personal medical information, including test results and doctor’s notes, into an AI tool. That behavior is happening largely without FDA oversight and, in most cases, without HIPAA protections, because consumer health AI applications fall outside HIPAA when used outside a covered entity’s system.
The question this issue is here to answer: when a general-purpose AI adds a Health tab, what actually changes?
The technical reality of the “Health” button
The honest answer, based on what the platforms have disclosed, is: less than the branding implies.
ChatGPT Health runs on the same underlying GPT model as standard ChatGPT. The Health tab adds a privacy-isolated conversation environment (health data is stored separately and, per OpenAI, is not used for model training), structured integrations with medical records through b.well’s network of approximately 2.2 million U.S. healthcare providers, and connectivity to wellness apps including Apple Health, MyFitnessPal, and Function Health. The system prompt has been designed with health-specific guidance developed with more than 260 physicians, according to OpenAI. The base model, however, is the same general-purpose model that already existed. There has been no publicly disclosed fine-tuning on clinical outcomes, no prospective trial of clinical safety, and no FDA clearance or classification for any indication.
Perplexity Health similarly adds a personalized layer: connected medical records via b.well, wearable data via the Terra API, a Health hub dashboard, and AI-generated summaries. It is available to Pro and Max subscribers in the U.S. Its product page explicitly states it is designed for informational use and is not intended to diagnose conditions or replace professional medical care.
Copilot Health (Microsoft) integrates wearable data from more than 50 devices and EHR from more than 50,000 U.S. hospitals through HealthEx. Microsoft describes it as a step toward “medical superintelligence.” It has answers verified by more than 230 physicians from 24 countries. Whether that verification is prospective clinical validation or post-hoc review is not publicly specified.
The pattern across all five platforms is similar: a general-purpose model, a health-specific interface, record-connectivity features, physician review of some kind, and a disclaimer that the tool is not for diagnosis or treatment. None has published a prospective clinical safety trial. None has FDA clearance for any clinical indication.
The Journal of Multidisciplinary Healthcare paper on the AI health arms race puts it plainly: “commercial deployment exceeds regulatory development” across every one of these platforms.
The clinical accuracy reality
There is now peer-reviewed evidence about how general LLMs perform on health questions. The picture is uneven, and the unevenness falls in a predictable and dangerous place.
The emergency triage failure
The most important single study to understand is a structured stress test published in Nature Medicine in February 2026. Researchers designed 60 clinician-authored clinical vignettes across 21 domains, run under 16 factorial conditions, yielding 960 total responses from ChatGPT Health. The results followed an inverted-U pattern: ChatGPT Health performed adequately for mid-acuity conditions and worst at the clinical extremes.
Among gold-standard emergency conditions, the system undertriaged 52% of cases. Among nonurgent presentations, it undertriaged 35%. To translate that to clinical language: ChatGPT Health was most dangerous precisely where a mistake is most likely to kill someone. In specific examples, the system directed patients with diabetic ketoacidosis or impending respiratory failure to 24-48 hour follow-up rather than the emergency department. It correctly triaged classical emergencies like stroke and anaphylaxis.
There was another finding worth noting. Crisis-intervention messages, the safety guardrail intended to activate when a user expresses suicidal ideation, were inconsistent. The system was more likely to activate crisis messaging when patients described suicidal thoughts without a specific method than when they described a specific method. That is the opposite of the clinically correct pattern.
The study raises one important interpretive note. A critique by Navarro et al. (Semantic Scholar, March 2026) argued that the evaluation format, not model capability, may drive some of the triage failures, because users often do not provide enough context in real conversations. That is a legitimate methodological debate. It also does not change the bottom line: a system serving millions of users in hospital deserts is performing below acceptable clinical thresholds for emergency recognition, by whatever mechanism.
The general accuracy picture
Across general health accuracy, the picture is more mixed. A February 2026 study in npj Digital Medicine benchmarked multiple ChatGPT model versions on care-seeking accuracy and found an average accuracy of approximately 70% across models, relatively stable across model generations. That is meaningful for routine questions. It is insufficient for high-stakes clinical decisions.
The real-world dataset comes from the Copilot Health study, published in npj Digital Medicine in April 2026. Researchers analyzed 617,827 classified health conversations between general public users and Microsoft Copilot in January 2026. Forty percent were health information and education queries (non-personalized, general). On mobile devices, however, the most common intent after general information was “Symptom Questions and Health Concerns,” at 15.9% of mobile conversations, compared to 6.9% on desktop. Emotional well-being queries were also higher on mobile (5.1% vs. 3.0%). Queries about finding providers, insurance, and paperwork indicate healthcare system navigation.
The mobile-vs.-desktop split is the access data in action. Mobile users ask about symptoms; desktop users do research and paperwork. The population using a phone, often after hours, for symptom questions is the population with limited clinical alternatives.
The safety guardrail paradox
A preprint published to arXiv in April 2026 by Suhas and colleagues adds a counterintuitive dimension to the safety picture. Researchers evaluated four LLMs across 250 mental health simulations and found that standard AI safety training, the guardrail mechanism added to protect users, itself produced psychological deterioration in more than one-third of cases. Refusal responses, hedging language, and deflection behaviors designed to protect vulnerable users instead increased distress when users were in genuine psychological need. Only 16% of deployed LLM-based mental health chatbot interventions have undergone rigorous clinical efficacy testing. ArXiv preprint 2604.23445, April 2026. Note: this preprint had not completed peer review as of publication.
The clinical logic is not complicated. Telling a distressed person “I am not a therapist, please seek professional help” is not a neutral response when professional help is inaccessible or unavailable. It is a refusal with a cost. The “just add guardrails” solution does not resolve the problem; it may create new ones.
What Part 2 takes up next week
Part 1 has covered what is happening and whether the tools work. The honest answer is that they sometimes do, especially for routine information, and sometimes fail in exactly the places where failure matters most.
Part 2, publishing Monday May 18, takes the next set of questions. Who is most affected by these failures, both by who is using the tools and by who is underrepresented in the data those tools were trained on. Where regulators stand, including the EU AI Act high-risk health deadline 82 days from this issue. A tier-by-tier verdict using the HAIRA framework. And concrete guidance for patients, families, and clinicians on how to use these tools without getting hurt.
In the meantime, a patient-facing companion guide publishes Friday May 15. It is a checklist you can use mid-conversation to tell whether the AI you are using is giving you health advice grounded in real clinical infrastructure, or just talk dressed up to look like it.
If you are in crisis right now
This section is here because health content reaches people at hard moments. If you are in a medical or mental health emergency right now, please stop here and use one of the resources below. Everything else in this newsletter can wait.
Medical Emergency Call 911 immediately. If you are experiencing chest pain, difficulty breathing, signs of stroke, or any symptom that may require emergency care, do not search for AI advice. Call 911.
988 Suicide and Crisis Lifeline Call or text 988 (available 24/7, free, confidential, in English and Spanish) 988lifeline.org
Crisis Text Line Text HOME to 741741 (available 24/7) crisistextline.org
Nurse Advice Line If you have insurance, call the number on the back of your card and ask for the nurse advice line. Available 24/7 for most major insurers. A licensed nurse can help you determine whether your symptoms need emergency care, urgent care, or can wait for a scheduled appointment.
SAMHSA National Helpline (free treatment referrals, 24/7) Call 1-800-662-4357 samhsa.gov/find-help/national-helpline
Veterans Crisis Line Dial 988 then Press 1, or text 838255 veteranscrisisline.net
Take care of yourself. Take care of each other.
Dr. Maia
Ask Dr. Maia is educational content. It is not medical advice and does not create a doctor-patient relationship. If you are in a medical emergency, call 911. If you are in crisis, call or text 988 in the US. © 2026 Ask Dr. Maia. All rights reserved. To unsubscribe, click here.
Sources cited in Part 1
OpenAI health usage report, January 2026. Axios: 40 million people turn to ChatGPT for health care
KFF Tracking Poll on Health Information and Trust: Use of AI for Health Information and Advice. February-March 2026 (n=1,343). KFF
KFF: Companies Expand AI Health Offerings, Even as Accuracy Questions Remain. April 23, 2026. KFF
Pew Research Center: Health information from social media and AI rated more convenient than accurate. April 7, 2026. Pew
Hugo H, et al. ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine. 2026. nature.com/articles/s41591-026-04297-7
Costa-Gomes B, Tolmachev P, Taysom E, et al. Public use of a generalist LLM chatbot for health queries. npj Digital Medicine. 2026. nature.com/articles/s44360-026-00117-x
Ben Shitrit I, et al. Evaluating the accuracy of ChatGPT model versions for care-seeking advice. npj Digital Medicine. 2026. nature.com/articles/s43856-026-01466-0
Ahmed MM, Othman ZK. The AI health arms race: a critical perspective on Big Tech and the widening equity gap. Journal of Multidisciplinary Healthcare. 2026. Dove Medical Press
Suhas et al. AI safety training can be clinically harmful. arXiv:2604.23445. April 2026. (preprint, peer review status unconfirmed as of publication) arxiv.org/abs/2604.23445
Forbes: Everything About ChatGPT Health. January 8, 2026. forbes.com
Perplexity Health: product announcement. March 19, 2026. perplexity.ai/hub/blog/introducing-perplexity-health
Full citation list, including HAIRA framework citation and Part 2 sources, appears in Part 2.
Part 2, “ChatGPT Health, Claude, and Copilot: who carries the risk, and how to use them safely,” publishes Monday May 18, 2026.