What The Pitt Gets Surprisingly Right About AI in Healthcare




In the second episode of season two of The Pitt, a scene stopped us cold. Dr. Al-Hashimi demonstrates a new AI charting app to her team—fast, efficient, impressive—until a resident spots something troubling. The AI app has prescribed the wrong medication for a patient.
“AI … almost intelligent,” a medical student jokes.
The attending brushes it off: “Generative AI is 98% accurate at present.” Later she adds: “You must always proofread.”
Hard to accept when one missed check can cost a patient their life.
The scene is fictional, but the risks are not. AI errors in healthcare rarely look like dramatic system failures. More often, they emerge quietly: unclear benchmarks, models used outside their domain, degraded context, or fragile infrastructure.
The show unintentionally surfaces several failure modes that are already well known in healthcare AI research and real-world deployments. Here are four that we’ve been working to address at Clinia.
Many of these issues are discussed in more depth on our blog, where we publish technical articles on healthcare AI, from evaluation methods and benchmarks to model safety and clinical deployment challenges.
When Dr. Al-Hashimi cites a 98% accuracy rate, we have no details on what was tested, on which patients, or under what conditions. That number could come from a generic benchmark with little connection to the clinical task where the model is actually being used.
In machine learning, “accuracy” can refer to many different metrics depending on the task. A model might perform well on a classification benchmark but still miss the one piece of information that matters for a patient.
This becomes especially important in healthcare information retrieval.
Standard search metrics prioritize ranking: returning the best result at the top of the list. But in clinical contexts, buried information can be just as critical. A note about a past allergic reaction to penicillin might appear deep in the chart, yet missing it could directly affect treatment decisions.
In that case, the key problem isn’t ranking but recall: ensuring that critical information is not missed. Performance metrics only matter if they reflect real clinical use.
→ To learn more about evaluation metrics in healthcare search, see: Rethinking How We Measure Search Quality in Healthcare
In the show, the AI charting app confuses neurology with urology, hallucinates patient details such as an appendicectomy, and presents incorrect information with the same confidence as correct information.
This is what happens when a general-purpose model is deployed in a domain it wasn’t designed for.
Healthcare terminology is uniquely dense and precise. The same concept can appear as clinical jargon, abbreviations, or plain patient language. A model must be able to interpret all of them correctly, often in contexts where small differences matter.
For example, abbreviations like MS could refer to multiple sclerosis, mitral stenosis, or morphine sulfate depending on the clinical context. Domain-specific training and evaluation make a significant difference here.
At Clinia, our health-grade models are trained on biomedical literature and clinical data and validated by medical experts across more than 70 specialties. When benchmarked on CURE, our Knowledge Embedder V2 shows a measurable performance gap compared with general-purpose models.
In healthcare, that difference can affect whether the right information surfaces when clinicians need it most.
For further reading:
→ Introducing heMTEB, an open-source benchmark for health information retrieval
→ Clinia Unveils Updated Health-Grade Models for Trusted and Scalable Health Workflows
→ Building Better AI Models with Medical Experts and Linguists
The Pitt also hints at a subtler risk: the faster clinicians adopt AI tools, the harder it becomes to work without them. But dependency isn't the only concern : something quieter can happen inside the AI itself. And blind dependence on a degrading system may be the most dangerous risk of all.
As more information flows through the system, AI can start losing track of what matters. Earlier context gets buried and critical details may be dropped.
This is what we call context rot: the gradual degradation of AI output quality as conversations grow longer and more information accumulates, like a whiteboard that keeps getting written on without ever being erased. The earliest notes don't disappear, but they become harder to read, easier to overlook.
For example, an early note about a medication allergy might no longer influence later answers if it falls outside the model's effective context window or becomes diluted among newer information.
Larger context windows or better prompts are insufficient to address this problem. It requires continuous evaluation: structured ways to detect when answers begin to drift before they cause harm.
To dive deeper:
→ Why "Context Rot" is Quietly Degrading Search and Summarization in Healthcare AI
→ Clinia's Evaluation Framework for Safer Answers
Later in the season, The Pitt turns to a different kind of threat, which goes beyond AI itself.
After a cyberattack hits a nearby institution, the hospital preemptively shuts down its electronic infrastructure. No records, no patient data, and no AI. The team is left scrambling with pen and paper, unable to access critical information about patients already in their care.
It’s a dramatic scenario, but the underlying vulnerability is real. Healthcare AI systems operate on some of the most sensitive personal data that exists. A breach both exposes private information and disrupts care when clinicians need reliable systems the most.
This is why third-party verification matters. A SOC 2 Type II certification, for instance, means an independent auditor has continuously verified that security controls are functioning as intended. At Clinia, we hold ourselves to that standard.
→ Clinia Renews SOC 2 Type II Compliance
The Pitt suggests that clinicians should simply “proofread” AI systems.
But in healthcare, safety cannot rely solely on catching mistakes after the fact. The goal is to build systems where critical errors are less likely to occur in the first place.
That requires more than better prompts or bigger models. It requires:
benchmarks that reflect real clinical workflows,
domain-specific models trained for healthcare,
continuous evaluation of model behavior,
infrastructure that protects sensitive data,
and human oversight at every stage.
The show doesn’t answer whether AI belongs in healthcare. In reality, that question has already been settled. The real challenge is building AI systems that clinicians can rely on, not just proofread. Because in healthcare, 98% accuracy isn’t reassuring when you might be the 2%.