Blog EBHC

GRADE Meets AI: Towards Trustworthy Human–Machine Collaboration

20/11/2025 | Gordon Guyatt

Artificial intelligence (AI) is rapidly reshaping how we conduct medical research, from synthesizing evidence to developing clinical guidelines. Tasks that once took months—or even years—can now be accelerated using AI tools. These innovations span literature screening and data extraction, risk of bias assessment, and even the generation of certainty of evidence ratings. None are sufficiently user friendly, trustworthy, or affordable to be in widespread use, but the promise is nevertheless clear: faster processes, reduced costs, and near real-time updates.

In our recent BMJ series on the Core GRADE approach, we outlined the structured steps required to assess the certainty of evidence and to move from evidence to recommendations. These processes are widely used in systematic reviews, health technology assessments, and clinical practice guidelines. Yet, as we embrace AI to rate certainty of evidence and move from evidence to recommendations, a crucial challenge emerges: how do we preserve thoughtful, transparent, and patient-centered judgment in an environment increasingly powered by AI?

An increasing body of research is exploring the use of AI to automate components of the GRADE process. While AI tools, particularly large language models based on human-crafted prompts, show potential to streamline technical tasks, they cannot—at least not yet—replace the nuanced, context-sensitive deliberation that core GRADE requires. Our group is working on producing an AI instrument that will conduct trustworthy assessment of certainty of evidence for systematic reviews of paired interventions (the issue the Core GRADE series addressed).

In our work thus far, it has become evident that an AI/human intense collaboration is necessary. That is, there is much that the systematic reviewer must specify and check. Such elements include: the original PICO question; the source of information for baseline risk estimates; the threshold for rating certainty of evidence; if the chosen threshold is the MID, the specification of the MID for every outcome; if the threshold is the null determination from forest plots if the point estimate close enough to the null to warrant modifying the target to little or no effect; choice of a fixed or random effect statistical model; choice of the threshold for evidence dominated by high risk of bias studies when ascertaining whether to rate down for risk of bias; specifying subgroup analyses, including hypothesized group with larger effect – and the list goes on. In the end, we will require AI to justify all its key decisions, allowing a human check to ensure all has gone well. Clearly AI will greatly facilitate the process, but success will require major human input by someone with a deep understanding of GRADE methods – or at least Core GRADE methods. Moving from evidence to recommendations demands an integration of certainty of evidence, values and preferences, feasibility, costs, and equity —dimensions that remain well beyond the current reach of AI, even with the most advanced models. That may change, but for now getting AI-aided certainty of evidence rating – and specifying the goal as AI facilitated rather than AI taking over – is a sufficient challenge.

 

Gordon Guyatt
Affiliation: McMaster University - Canada

 

Don't miss this unique opportunity of meeting EBHC champions
and colleagues from all over the world.