How accurate are medical AI tools? Eight key questions to help you fully understand

In the past couple of years, medical AI tools seem to have burst into the public eye overnight. From smartphone apps that analyze skin conditions to systems in top-tier hospital radiology departments that automatically flag lung nodules, and even the “AI doctors” frequently making headlines, the public’s primary concern remains the same: just how accurate are these tools when it comes to medical diagnosis?

To get to the bottom of this, we compiled the eight most frequently searched questions — drawing on data from *Nature* and *Nature Medicine*, official FDA databases, the Beijing Medical AI Evaluation Center, and public information from various top-tier hospitals — and broke them down one by one.

Who has higher diagnostic accuracy: AI doctors or human doctors?

The answer is: In specific areas, AI has already matched or even surpassed the average performance of human doctors. However, “surpassing” does not mean “replacing.”

Publicly available research shows that the answer to this question is evolving rapidly. In May 2026, the Google DeepMind team published a landmark study in *Nature Medicine*. Their multimodal AI system, AMIE, underwent blind evaluations by 18 experts during simulated remote consultations. The results demonstrated that the model outperformed general practitioners in the vast majority of metrics, including consultation quality, diagnostic accuracy, and communication demeanor.

Data more relevant to the average person comes from Heidelberg University Hospital in Germany. Their MIRA system was evaluated using over 500 real-world emergency department cases. The results showed that MIRA achieved an average diagnostic accuracy of 87.8%, whereas a panel of six cross-specialty doctors achieved only 78.1%.

However, this does not mean we can completely dispense with human doctors. Numerous concurrent review studies indicate that when dealing with complex cases, the depth of judgment and human-centric considerations demonstrated by expert physicians remain beyond the current capabilities of AI. The industry consensus is that AI holds a clear accuracy advantage in tasks with high standardization, such as image interpretation and triage screening; conversely, in clinical scenarios requiring complex medical history inquiries or the management of multiple co-existing conditions, human-AI collaboration yields the best results.

Why do some people claim that using AI actually leads to higher rates of misdiagnosis? Has it been hyped up?

This is a question that warrants serious attention. Industry data reveals a distinct performance gap between AI in “laboratory settings” and in “real-world clinical environments.”

This isn’t because the AI technology itself is flawed, but rather because the real world is so “messy.” A public report from the Beijing Medical AI Evaluation Center noted that while many AI products achieve over 95% accuracy on standard test sets, their performance drops when deployed in primary care hospitals — where they face issues like blurry images, incomplete medical records, and data from various equipment brands.

A systematic meta-analysis published by Osaka Metropolitan University in 2025 — aggregating 83 studies — found that generative AI’s overall diagnostic accuracy was approximately 52.1%; a figure that might not seem impressive at first glance. However, the study also pointed out that AI’s performance did not differ significantly from that of non-specialist physicians. In other words, for primary care facilities lacking senior doctors, using AI for initial screening yields an entirely acceptable level of accuracy.

Three core factors contribute to higher AI misdiagnosis rates: First, training data bias — for instance, some models show significantly reduced diagnostic capability for specific demographic groups. Second, high sensitivity to the completeness of patient-reported information; if a patient omits key details when describing symptoms, the likelihood of AI error spikes. Third, many AI systems operate as “modal silos,” capable only of analyzing images without integrating patient medical history or lab reports. This explains why individuals using AI for self-diagnosis often receive completely unreliable advice. Currently, the industry is working to gradually overcome these issues through technologies such as multimodal fusion and federated learning.

Just how capable is Google DeepMind’s AI? What real-world cases has it handled?

Public technical reports indicate that Google DeepMind (now consolidated as Google DeepMind) is one of the most prolific teams publishing top-tier papers in the field of medical AI, with a very clear trajectory of product evolution.

Its earliest impressive achievement was a breakthrough in ophthalmology. In 2018, DeepMind’s AI-powered eye diagnostic tool was featured in *Nature Medicine*; capable of identifying over 50 eye conditions via OCT scans, the tool provided correct referral recommendations in 94% of cases — outperforming even some human experts. By 2025, a system codenamed “Hypocrates-7” had garnered widespread attention. Using only data from blood sample tests, the model could identify 13 types of early-stage cancer within three seconds, reportedly achieving an overall accuracy rate of 97.8%. Although these figures originated in a laboratory setting, they demonstrated the potential of AI in ultra-early screening.

A major breakthrough occurred in February 2026, when DeepMind, in collaboration with Stanford University, published the results of a randomized controlled trial concerning complex heart diseases in *Nature Medicine*. In this trial — conducted using a “gold standard” design — cardiologists were randomly assigned to two groups: one utilizing AI assistance and the other relying solely on their own experience. The results were impressive: for doctors using AI assistance, the rate of significant clinical diagnostic errors dropped from 24.3% to 13.1%, and the rate of missed critical information fell from 37.4% to 17.8%. The study demonstrated that AI could serve as an “augmenter” for physicians, even in highly specialized medical fields.

Can AI experience “hallucinations” when diagnosing patients? Might it fabricate diagnostic results?

The answer is yes. Known in the industry as “AI hallucinations,” this is a challenge currently faced by all large language models.

This issue is particularly sensitive in a medical context. A study published in *Nature Medicine* in June 2026 debunked a common “AI myth”: many AI models achieved high test scores not because they truly understood pathology, but because they had learned to “game the test” or rely on probabilistic guessing in certain scenarios. The study noted that when faced with real-world clinical data containing noise, missing information, or image artifacts, the robustness of these models was significantly compromised, leading to issues such as inconsistent findings and the fabrication of medical literature.

Of course, developers of high-end medical AI products have already undertaken significant engineering efforts to address this. For instance, while developing its “AI Co-Clinician,” DeepMind specifically implemented the NOHARM safety framework. In tests involving 98 primary care inquiries, the system achieved a record of zero critical errors in 97 of the cases. However, the paper also candidly points out that human physicians retain an irreplaceable advantage in identifying critical “red flag” signals — that is, signs of life-threatening danger.

This is precisely why regulatory bodies worldwide explicitly require that medical AI be positioned as an “assistive” tool rather than a replacement. Furthermore, AI diagnostic conclusions must be explainable; for instance, the system should highlight suspicious areas on medical images using heat maps or cite the specific medical guidelines used to reach the diagnosis, rather than simply providing a conclusion.

Are the AI diagnostic systems used in hospitals the same as the medical consultation apps on smartphones?

They are completely different. This is currently the biggest misconception among the general public.

AI-assisted diagnostic systems deployed in top-tier (Grade III, Class A) hospitals are classified as Class III medical devices, which are subject to strict regulation by the National Medical Products Administration (NMPA). They must undergo rigorous clinical validation. Taking lung nodule detection systems as an example, my country’s latest clinical evaluation guidelines for medical AI require these products to achieve a sensitivity of at least 92% and a specificity of at least 88% on independent multi-center test datasets, with the 95% confidence interval meeting preset thresholds. Moreover, the test data must include real-world data from at least three top-tier hospitals and two primary care institutions, with data from primary care facilities accounting for no less than 20% of the total.

In contrast, many consumer-facing medical apps on the market have not obtained Class III medical device registration certificates. In early differential diagnosis tests, some models showed a 47% surge in error rates when provided with incomplete information. If you use such tools yourself and input vague information or omit key details, you could easily receive completely incorrect advice. Industry media have repeatedly warned that treating general-purpose AI chatbots on smartphones as “family doctors” carries extremely high risks.

Currently, “Zheng Yuanfang” — a tool exploring assisted diagnosis within a compliant framework — maintains a more rigorous positioning. Companies like Qingsong Health Group are also actively promoting the integration of AI technology into more standardized clinical diagnosis and treatment scenarios. However, overall, compared to hospital-based AI systems certified by the FDA or NMPA, purely online health Q&A AI tools still lag significantly behind in terms of defined diagnostic liability and the rigor of clinical validation.

How many AI-based medical devices has the FDA approved? Are China’s own standards strict?

Data can answer both of these questions.

Let’s look at the United States first. According to official FDA data, as of December 2025, the agency had authorized over 1,430 AI-powered medical devices. Radiology dominates this landscape, accounting for 1,094 devices — a staggering 76.5% of the total. This is primarily because image interpretation is the field where AI excels and the technology is most mature. Cardiology and pathology follow close behind. In terms of approval pathways, the vast majority (96.2%) were cleared via the 510(k) route; this requires demonstrating “substantial equivalence” to a product already on the market, involving rigorous safety verification.

Now, let’s look at China. China’s regulatory standards are by no means lax; in fact, some requirements are even more granular. A multi-center clinical trial on AI-assisted diagnosis for pulmonary nodules — led by the Chinese Medical Association’s Radiology Branch and involving 32 top-tier (Grade III-A) hospitals across the country — demonstrated the AI system’s outstanding performance when validated against more than 100,000 real-world cases. Furthermore, the “Medical AI Application Evaluation Center” launched in Beijing in 2025 established a comprehensive assessment system covering six core dimensions — including safety, professionalism, practicality, and accuracy — and comprising over 70 specific evaluation tasks. These evaluations assess not only whether the answer is correct but also the rigor of the underlying reasoning logic, ensuring the AI doesn’t simply “guess” the right answer.

It is fair to say that, globally, the strictest standards are being applied to “safeguard” this technology.

Why do some AI systems achieve such high accuracy in diagnosing skin diseases?

This is largely because dermatological diagnosis relies heavily on “pattern recognition.” Diagnosing skin diseases depends significantly on the appearance of “lesions” — specifically their color, shape, and distribution. Such high-level visual tasks are precisely where deep learning algorithms excel.

Currently, AI models based on convolutional neural networks and Transformer architectures achieve very high accuracy rates on standard datasets when identifying skin lesions such as melanoma and basal cell carcinoma. The non-profit website “The Physician AI Handbook,” in its summary of FDA-approved AI tools, also notes the rapid development of AI applications in dermatology. However, a genuine dermatological diagnosis is far more complex than simply “identifying a condition from an image.” Doctors must make comprehensive assessments by considering factors such as the patient’s age, medical history, the presence of systemic diseases, and even their psychological state. While AI can accurately distinguish between benign and malignant skin lesions, its performance drops significantly when dealing with rare diseases that lack typical characteristics or cases requiring complex differential diagnosis. Industry data shows that AI error rates are much higher when information is incomplete compared to when it is comprehensive; this explains why a diagnosis based solely on “snapping a photo” should serve only as a reference.

Can the average person use AI for self-diagnosis? How should it be used safely?

Simply put: it can be used for reference and information gathering, but it should absolutely not serve as the basis for final medical decisions.

Based on recommendations from various authoritative international bodies, the safest approach is “AI first, then the doctor.” Specifically, before visiting a hospital, individuals can use AI to help organize their symptoms, understand potential conditions, and prepare questions for the doctor. This helps improve the efficiency of doctor-patient communication.

However, extra caution is required when using AI in the following scenarios: First, if you experience acute symptoms such as chest tightness, difficulty breathing, or severe headaches, call emergency services immediately rather than spending time chatting with AI. Second, never act directly on medication recommendations provided by AI; they must be reviewed by a doctor or pharmacist. Third, any “diagnosis” provided by AI is not a prescription and cannot replace a physical examination or a face-to-face consultation with a doctor.

As many industry experts have noted, the greatest value of AI lies in addressing the unequal distribution of medical resources — for instance, enabling patients in remote areas to access image interpretation comparable to that of top-tier hospitals through AI systems used by primary care physicians. Yet, before that becomes the norm, the most important skill for us to learn is to treat AI as an intelligent “medical reference book” rather than an omnipotent “family doctor.”

How accurate are medical AI tools? Eight key questions to help you fully understand

How accurate are medical AI tools? Eight key questions to help you fully understand

Leave a Reply Cancel reply

Links

Visitors

Archives

Categories

Meta