Clinia’s Evaluation Framework for Safer Answers

Blog Articles / Clinia’s Evaluation Framework for Safer Answers

Clinia’s Evaluation Framework for Safer Answers

Written by Clinia

Published 2025-10-03

Clinia’s Evaluation Framework for Safer Answers

When we ask an AI system a question, it can sometimes give an answer that sounds convincing but isn’t backed up by the right information. In everyday situations, this might not matter much, but in health care—where accuracy and trust are essential—the stakes are much higher.

One promising approach to reduce this risk is called Retrieval-Augmented Generation (RAG). Instead of relying only on what it has memorized during training, a RAG system first searches trusted sources—like medical guidelines or scientific articles—and then uses that information to generate its response. In other words, it doesn’t just invent an answer; it looks things up before speaking.

This design makes RAG especially valuable in health care, where knowledge is constantly evolving and every answer must be both reliable and safe. But it also highlights why careful evaluation is necessary: we need to ensure the system retrieves the right sources, interprets them correctly, and communicates its conclusions in a way health professionals can trust. Evaluating RAG in this context is about much more than performance—it’s about ensuring technology truly supports better decisions and, ultimately, better care.

Rethinking Evaluation Metrics for RAG Systems

Traditional automatic evaluation metrics such as BLEU and ROUGE—originally designed for machine translation and summarization—fall short when applied to retrieval-augmented generation (RAG) systems. These overlap-based metrics primarily capture surface lexical similarity rather than deeper dimensions such as grounding, factuality, completeness, or clinical relevance. Recent critiques emphasize that no single metric consistently correlates with human judgment across diverse generative tasks.¹ In the context of RAG, Zhu et al.² demonstrate how traditional methods fail to assess retrieval-generation alignment. A concern that is echoed in a comprehensive survey of RAG evaluation approaches³, underscoring the need for multi-dimensional evaluation strategies that reflect real-world performance. Beyond RAG, Weidinger et al.⁴ call for an “evaluation science” that prioritizes construct validity and transparency over metric benchmarking, This work, among others, signals a shift from metric-driven benchmarking toward meaning-driven evaluation, better aligned with the complex reasoning tasks RAG systems are designed to support.

Following this line of thought, at Clinia, we have designed our own set of medical and linguistic criteria to evaluate responses from both the content perspective (Is the information correct, complete, and safe?) and the form perspective (Is it clear, respectful, and useful to the reader?). This ensures that our evaluation reflects trustworthiness and patient safety.

Content Quality Guidelines

These guidelines ensure the semantic validity and scientific robustness of model outputs, which are the foundation of trustworthiness in health care AI.

Relevance of the Response

In health care, even slightly off-topic content can waste time or mislead. Therefore, responses must directly address the user's query.

We evaluate relevance on a graded scale:

✅ Relevant response – Response answers the question fully and precisely. 🟡 Contextually Related – Response relates to the topic but doesn’t address the exact question. ❌ Irrelevant – Content is off-topic or misleading.

Let’s consider as an example the question What is hypoglossal nerve stimulation? and the three following responses:

✅ Hypoglossal nerve stimulation is a medical treatment used for obstructive sleep apnea (OSA). It involves the use of an implanted device that stimulates the hypoglossal nerve, which controls tongue movement. By stimulating this nerve during sleep, the device helps to keep the airway open, reducing apneic events and improving breathing.

🟡 Obstructive sleep apnea is a condition where the airway becomes blocked during sleep, causing breathing interruptions. Various treatments exist for this condition, including lifestyle changes, hypoglossal nerve stimulation, CPAP machines, and surgical options.

❌ The amygdala is a small, almond-shaped cluster of nuclei located deep within the temporal lobes of the brain.

This example highlights how relevance is not just about correctness, but about usefulness in context. A response that goes straight to the point enables faster, safer decision-making, while tangential or unrelated content risks distracting the user from what truly matters.

Consistency

When we ask similar questions, a model should give answers that make sense together and provide the same medical guidance. For example:

What are the side effects of beta blockers?
What risks come with using beta blockers?

Both questions should lead to consistent, reliable information—no contradictions or surprises.

Accuracy

Corroborated Claims: All factual statements must be verifiable across multiple reputable sources. Single-source reliance is avoided unless the topic is rare or emerging. Uncertainty is explicitly noted when consensus is lacking.
Hallucination Prevention: Models must be grounded in factual retrieval, with sentence-level referencing. Outputs without verifiable sources are rejected. Beyond missing references, we also screen for fabricated details, misleading associations, or overconfident claims that extend beyond the evidence. Even subtle hallucinations—such as inventing plausible-sounding drug interactions, misattributing study results, or overgeneralizing from a narrow source—are rejected. The standard is simple: any detail that cannot be backed by evidence is immediately filtered out.

Source Quality and Currency

All passages in the response that include clinically relevant information—such as etiology, outcomes, prognosis, or potential treatments—must be supported by references to the retrieved articles that informed the answer. The credibility of these references is essential. In clinical contexts, citing a blog or generic website is unacceptable; medical professionals require primary evidence or trusted secondary sources. This helps ensure:

timely medical guidance (e.g., COVID-19 protocols, drug recalls).
transparency, so users can assess the temporal relevance of each statement.

Formatting & Presentation Guidelines

Structure of the response

To make responses easy to read and reliable, we ensure they follow a clear and consistent structure. All outputs must follow a predictable format, including primary explanation or recommendation

alternate or edge-case explanations
summary (if applicable)
disclaimer (if necessary)
references

Language Quality

The way an answer is written matters as much as the information it provides. The language should read naturally; it must be clear, fluent, and grammatically correct. It should also avoid awkward phrasing or distracting typos, which can undermine trust in the response.

Conciseness

Responses should be concise, avoiding unnecessary repetition and long-winded explanations. Clear, focused answers help users quickly find the information they need, which is especially important for busy health professionals making time-sensitive decisions.

Here’s an example in practice, showing how an answer might look for a question about the common symptoms of a cold.

✅ Common symptoms of a cold include rhinorrhea, pharyngitis, coughing, and sneezing.

❌ The common symptoms of a cold, also known as an upper respiratory tract infection, typically include rhinorrhea, which is the medical term for a runny nose, and pharyngitis, which is when the throat feels sore or irritated. Patients often experience coughing, which is a reflex action to clear the airways, and sneezing, which is a sudden involuntary expulsion of air from the nose and mouth. These symptoms are generally associated with viral infections and can vary in severity from mild to more pronounced.

The contrast between the two responses highlights how well-structured answers, like the first one, convey the same essential information more efficiently, making it easier for users to understand and act quickly.

Tone and Assertiveness

All content should strike the right balance between caution and clarity. This means avoiding language that is too forceful or prescriptive, while still providing helpful, actionable information when appropriate. For example, compare these two ways of explaining a treatment:

❌ We recommend immediate steroid treatment. ✅ High-dose steroids may be considered, but should be discussed with a clinician.

The second version guides the reader without overstepping, showing the right mix of caution and clarity.

Expertise Matching

As our product is designed both for patient and health practitioners, our generated responses should match the complexity expected by the user:

For a Layperson, it should use simple phrasing, with plain-language definitions.
For a Clinician, use of clinical terminology and references to studies or metrics are expected.

For example, consider the question What are the potential side effects of taking beta blockers? This question can be addressed at different levels of complexity depending on the audience:

For a Patient: Beta blockers can make you feel tired, dizzy, or lightheaded. They might also cause trouble breathing, slow your heart rate, make you feel nauseous, or make your hands and feet feel colder.
For a Medical Expert: Beta blockers may cause side effects such as bradycardia, hypotension, and fatigue. Some patients might experience bronchospasm, especially those with a history of asthma or COPD. There can also be gastrointestinal disturbances, such as nausea, and peripheral vasoconstriction leading to cold extremities.

Regional Variation

Another way to tailor a response to the user is to adapt to regional variations. Regional (or diatopic) variations are language differences that arise from geographic or cultural contexts. For the sake of consistency, the regional variation of the response should match that of the question—in other words, it should align with the user’s location. For example, a response should be written in British English for users in the UK and in American English for users in the United States.

Here is an example of different English variations of the same sentence:

British English: The patient is scheduled for an operation tomorrow and will need to be monitored in theatre. Make sure he has his blood group checked and has been fitted with a cannula.
American English: The patient is scheduled for surgery tomorrow and will need to be monitored in the operating room. Make sure his blood type is checked and he has been fitted with an IV.

Final Thoughts

At Clinia, we believe that improving the trustworthiness and usefulness of AI in health care requires more than sophisticated models—it demands rigorous evaluation and thoughtful design. By defining clear medical and linguistic criteria, we ensure that every response is accurate, relevant, and safe, while remaining understandable and actionable for the intended audience. Our approach to content quality, grounded in both evidence and usability, reflects our commitment to supporting clinicians and patients alike in making informed decisions. Ultimately, it’s not just about building technology—it’s about creating tools that meaningfully enhance care and confidence in an increasingly complex health care landscape.

References

Oliva MP, Correia A, Vankov I, Botev V. The illusion of a perfect metric: Why evaluating AI’s words is harder than it looks. arXiv. 2025;arXiv:2508.13816. doi:10.48550/arXiv.2508.13816.
Zhu K, Luo Y, Xu D, Yan Y, Liu Z, Yu S, Wang R, Wang S, Li Y, Zhang N, Han X, Liu Z, Sun M. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Long Papers). 2025:8520-8544. doi:10.18653/v1/2025.acl-long.418.
Evaluation of Retrieval-Augmented Generation: A Survey. arXiv. 2024;arXiv:2405.07437. doi:10.48550/arXiv.2405.07437.
Weidinger L, Raji ID, Wallach H, Mitchell M, Wang A, Salaudeen O, Bommasani R, Ganguli D, Koyejo S, Isaac W. Toward an Evaluation Science for Generative AI Systems. arXiv. 2025;arXiv:2503.05336. doi:10.48550/arXiv.2503.05336.