Clinia’s Evaluation Framework for Safer Answers

When we ask an AI system a question, it can sometimes give an answer that sounds convincing but isn’t backed up by the right information. In everyday situations, this might not matter much, but in health care—where accuracy and trust are essential—the stakes are much higher.
One promising approach to reduce this risk is called Retrieval-Augmented Generation (RAG). Instead of relying only on what it has memorized during training, a RAG system first searches trusted sources—like medical guidelines or scientific articles—and then uses that information to generate its response. In other words, it doesn’t just invent an answer; it looks things up before speaking.
This design makes RAG especially valuable in health care, where knowledge is constantly evolving and every answer must be both reliable and safe. But it also highlights why careful evaluation is necessary: we need to ensure the system retrieves the right sources, interprets them correctly, and communicates its conclusions in a way health professionals can trust. Evaluating RAG in this context is about much more than performance—it’s about ensuring technology truly supports better decisions and, ultimately, better care.
Most standard AI evaluation methods come from older natural language processing (NLP) research. Metrics like BLEU or ROUGE compare the AI’s answer to a reference answer by counting word overlaps. This may work well for tasks like translation or summarization but quickly shows its limits for generative AI in health care (Novikova et al., 2017).
First, these metrics only measure word overlap, not whether the answer is actually correct or meaningful.
They cannot detect when the AI hallucinates information, uses the wrong tone, produces unclear text, or even generates potentially harmful advice.
They also tend to penalize answers that are phrased differently, even if they are equally correct.
Most importantly, they were never designed for high-risk medical content, where accuracy, clarity, and safety of each answer matter more than word similarity.
For this reason, we need a different approach. At Clinia, we have designed our own set of medical and linguistic criteria to evaluate responses from both the content perspective (Is the information correct, complete, and safe?) and the form perspective (Is it clear, respectful, and useful to the reader?). This ensures that our evaluation reflects trustworthiness and patient safety.
These guidelines ensure the semantic validity and scientific robustness of model outputs, which are the foundation of trustworthiness in health care AI.
In health care, even slightly off-topic content can waste time or mislead. Therefore, responses must directly address the user's query. We evaluate relevance on a graded scale:
✅ Relevant response – Response answers the question fully and precisely. 🟡 Contextually Related – Response relates to the topic but doesn’t address the exact question.
Irrelevant – Content is off-topic or misleading.
Let’s consider as an example the question What is hypoglossal nerve stimulation? and the three following responses:
✅ Hypoglossal nerve stimulation is a medical treatment used for obstructive sleep apnea (OSA). It involves the use of an implanted device that stimulates the hypoglossal nerve, which controls tongue movement. By stimulating this nerve during sleep, the device helps to keep the airway open, reducing apneic events and improving breathing.
🟡 Obstructive sleep apnea is a condition where the airway becomes blocked during sleep, causing breathing interruptions. Various treatments exist for this condition, including lifestyle changes, hypoglossal nerve stimulation, CPAP machines, and surgical options.
❌ The amygdala is a small, almond-shaped cluster of nuclei located deep within the temporal lobes of the brain.
This example highlights how relevance is not just about correctness, but about usefulness in context. A response that goes straight to the point enables faster, safer decision-making, while tangential or unrelated content risks distracting the user from what truly matters.
When we ask similar questions, a model should give answers that make sense together and provide the same medical guidance. For example:
What are the side effects of beta blockers?
What risks come with using beta blockers?
Both questions should lead to consistent, reliable information—no contradictions or surprises.
Corroborated Claims: All factual statements must be verifiable across multiple reputable sources. Single-source reliance is avoided unless the topic is rare or emerging. Uncertainty is explicitly noted when consensus is lacking.
Hallucination Prevention: Models must be grounded in factual retrieval, with sentence-level referencing. Outputs without verifiable sources are rejected. Beyond missing references, we also screen for fabricated details, misleading associations, or overconfident claims that extend beyond the evidence. Even subtle hallucinations—such as inventing plausible-sounding drug interactions, misattributing study results, or overgeneralizing from a narrow source—are rejected. The standard is simple: any detail that cannot be backed by evidence is immediately filtered out.
All passages in the response that include clinically relevant information—such as etiology, outcomes, prognosis, or potential treatments—must be supported by references to the retrieved articles that informed the answer. The credibility of these references is essential. In clinical contexts, citing a blog or generic website is unacceptable; medical professionals require primary evidence or trusted secondary sources. This helps ensure
timely medical guidance (e.g., COVID-19 protocols, drug recalls).
transparency, so users can assess the temporal relevance of each statement.
To make responses easy to read and reliable, we ensure they follow a clear and consistent structure. All outputs must follow a predictable format, includingprimary explanation or recommendation
alternate or edge-case explanations
summary (if applicable)
disclaimer (if necessary)
references
The way an answer is written matters as much as the information it provides. The language should read naturally; it must be clear, fluent, and grammatically correct. It should also avoid awkward phrasing or distracting typos, which can undermine trust in the response.
Responses should be concise, avoiding unnecessary repetition and long-winded explanations. Clear, focused answers help users quickly find the information they need, which is especially important for busy health professionals making time-sensitive decisions.
Here’s an example in practice, showing how an answer might look for a question about the common symptoms of a cold.
✅ Common symptoms of a cold include rhinorrhea, pharyngitis, coughing, and sneezing.
❌ The common symptoms of a cold, also known as an upper respiratory tract infection, typically include rhinorrhea, which is the medical term for a runny nose, and pharyngitis, which is when the throat feels sore or irritated. Patients often experience coughing, which is a reflex action to clear the airways, and sneezing, which is a sudden involuntary expulsion of air from the nose and mouth. These symptoms are generally associated with viral infections and can vary in severity from mild to more pronounced.
The contrast between the two responses highlights how well-structured answers, like the first one, convey the same essential information more efficiently, making it easier for users to understand and act quickly.
All content should strike the right balance between caution and clarity. This means avoiding language that is too forceful or prescriptive, while still providing helpful, actionable information when appropriate. For example, compare these two ways of explaining a treatment: **
❌ We recommend immediate steroid treatment. ✅ High-dose steroids may be considered, but should be discussed with a clinician.
The second version guides the reader without overstepping, showing the right mix of caution and clarity.
As our product is designed both for patient and health practitioners, our generated responses should match the complexity expected by the user:
For a Layperson, it should use simple phrasing, with plain-language definitions.
For a Clinician, use of clinical terminology and references to studies or metrics are expected.
For example, consider the question What are the potential side effects of taking beta blockers? This question can be addressed at different levels of complexity depending on the audience:
For a Patient: Beta blockers can make you feel tired, dizzy, or lightheaded. They might also cause trouble breathing, slow your heart rate, make you feel nauseous, or make your hands and feet feel colder.
For a Medical Expert: Beta blockers may cause side effects such as bradycardia, hypotension, and fatigue. Some patients might experience bronchospasm, especially those with a history of asthma or COPD. There can also be gastrointestinal disturbances, such as nausea, and peripheral vasoconstriction leading to cold extremities.
Another way to tailor a response to the user is to adapt to regional variations. Regional (or diatopic) variations are language differences that arise from geographic or cultural contexts. For the sake of consistency, the regional variation of the response should match that of the question—in other words, it should align with the user’s location. For example, a response should be written in British English for users in the UK and in American English for users in the United States.
Here is an example of different English variations of the same sentence:
British English: The patient is scheduled for an operation tomorrow and will need to be monitored in theatre. Make sure he has his blood group checked and has been fitted with a cannula.
American English: The patient is scheduled for surgery tomorrow and will need to be monitored in the operating room. Make sure his blood type is checked and he has been fitted with an IV.
Evaluating generative AI in health care goes far beyond traditional metrics like word overlap. At Clinia, we focus on both what the AI says and how it says it—ensuring that answers are accurate, safe, clear, and tailored to the user’s needs. By combining attention to clarity and tone, and adaptability to different audiences, our framework helps make AI a reliable partner in clinical decision-making.
Why We Need New Evaluation Metrics for NLG (Novikova et al., EMNLP 2017)