Living Systematic Review Methods

Systematic reviews are the most comprehensive tool we have to describe what is known and unknown about the clinical effectiveness of a given health care intervention. These also allow an individual to understand how a single piece of research fits into the larger body of existing research.

A “living” systematic review is a newer and evolving method designed to keep the evidence up-to-date using systematic and predefined surveillance strategies. We use well-established methods to comprehensively identify, select, critique, and summarize findings from different studies that answer a specific clinical question.

Before starting each systematic review, we develop and post a protocol with details about how we will conduct the review (e.g., Post-Traumatic Stress Disorder). Each review undergoes peer review by at least 2 experts in the field. Consistent with national and international standards for conducting systematic reviews, the STEM team designs PICOTS (population, interventions, comparisons, outcomes, timing, settings, and study designs).

Our Technical Expert Panel (TEP) helps ensure we ask the right questions and use appropriate inclusion and exclusion criteria. We also conduct comprehensive literature searches, title and abstract and full-text review, risk of bias (study quality) assessment, and evaluate the certainty of the body of evidence for an outcome. The latter 2 aspects are described in more detail below.

Updating Our Living Systematic Reviews 

Once the STEM team has completed a systematic review, we will begin to implement the “living” phase of the review. We will conduct surveillance on each topic and search for new eligible studies at predefined intervals, such as every 3 months. This surveillance interval will vary based on topic specific information (e.g., number of eligible ongoing studies, number of eligible included studies). In a surveillance scenario where we will identify zero new studies or a couple studies that do not have an impact on our overall findings or conclusions, we would not conduct a full update (i.e., update the visual abstract, summary text, and full report). Instead, we would write a brief summary describing our search period, which sources we searched, and a summary of the studies we identified and why they were insufficient to warrant a full update. If present, this surveillance summary will be displayed at the top of the relevant review page. In a surveillance scenario where we identify a study or studies that we determine are meaningful (e.g., changes the conclusions of our previous review), we would conduct a full update of our systematic review, which means we would incorporate the new evidence in our visual abstract, summary text, and full report. The surveillance process will be iterative for the life cycle of each of our living systematic reviews.

Risk of Bias (Study Quality)

Flaws in a study’s design or the way it is reported can reduce confidence in its findings. Using standardized instruments, we evaluate the quality, also known as “risk of bias,” of each study. When conducting a systematic review, we use 2 independent assessors with a third senior researcher resolving disagreements, when needed.

Randomized Controlled Trials

Low-risk-of-bias RCTs generally include a clear description of the population, setting, intervention, and comparison groups; a random and concealed allocation of participants to study groups; low dropout rates; and intention-to-treat analyses (i.e., an analysis based on group assignment at baseline).

Moderate-risk-of-bias RCTs have incomplete information about methods that might mask important limitations or other biases such as moderate dropout rates.

High-risk-of-bias RCTs have clear flaws that could introduce significant bias, which might include an insufficient approach for randomization or allocation concealment, high rates of attrition without intention-to-treat analysis, or differences between personal characteristics between groups at baseline.

Cohort Studies

Low-risk-of-bias cohort studies include a sample representative of the source population, have low loss to follow-up, and measure and consider relevant confounding factors (e.g., age, income, health status).

Moderate-risk-of-bias cohort studies might not have measured all relevant confounding factors or adjusted for them in statistical analyses, have loss to follow-up that could bias findings, consist of a sample not representative of the source population, or have potential conflicts of interest that are not addressed.

High-risk-of-bias cohort studies have clear and serious bias that would affect findings, which might include not adjusting for all major confounders or have high loss to follow-up.

Case-Control Studies

Low-risk-of-bias case-control studies include appropriate and clear consideration and selection of cases and controls, valid measures of exposures in both groups, and statistical adjustment for all major confounding variables.

Moderate-risk-of-bias case-control studies might not have measured all relevant confounding factors or adjusted for them in statistical analyses, might include controls not fully representative of cases, or might not have addressed potential conflicts of interest.

High-risk-of-bias case-control studies have clear and serious bias that would affect findings, which might not be adjusted for all major confounders or selection of controls from a highly different population than cases.

The STEM research team will also consider relevant conflicts of interest in the development of research studies such as source of funding and relationships of authors with organizations.

Grading of Recommendations, Assessment, Development, and Evaluation (GRADE)

Several studies often examine the effects of a given treatment on a specific outcome. We evaluate and rate each study for certainty of evidence for the review. Certainty of evidence describes how well the entire body of evidence answers questions about an intervention’s effect on specific outcomes.

We determine certainty of evidence through:

  • Consistency of findings across studies;
  • Methodological quality of individual studies;
  • Directness of the populations and outcomes studied to the ones likely to be important in clinical practice;
  • Precision of effect estimates (often measured by the confidence interval of the summary estimate in a meta-analysis); and
  • Publication bias.

We use 2 independent raters when conducting a systematic review for each GRADE rating, with a third senior researcher resolving disagreements, when needed. The GRADE ratings we assign are based on the body of evidence for an outcome that provides a level of certainty that is:

  • High: Raters are very certain the estimate of the effect of the intervention on the outcome lies close to the true (unbiased) effect. When an outcome is rated as “High,” additional studies would not change the estimate of the effect of the intervention on that outcome.
  • Moderate: Raters are moderately certain in the estimate of the effect of the intervention on the outcome. The true effect is likely to be close to the estimate of the effect, but there is a possibility it is different. When an outcome is rated as “Moderate,” additional studies would slightly change the estimate of the effect of the intervention on the outcome, but it is unlikely to change the direction of the effect.
  • Low: Raters have low certainty in the estimate of the effect of the intervention on the outcome. The true effect might be substantially different from the estimate of the effect. When an outcome is rated as “Low,” additional studies will likely change the estimate of the effect of the intervention on the outcome, and could change the direction of the effect.
  • Very low (also called insufficient by other organizations): Raters have no certainty in the estimate of the effect of the intervention on the outcome. When an outcome is rated as “Very low,” additional studies will very likely change the estimate and direction of the effect.

When using GRADE to evaluate outcomes from RCTs, a rater will start at a rating of “High” certainty of evidence and can downgrade the evidence for:

Study risk of bias: The quality of the eligible studies for the outcome. If there are several studies with high risk of bias, the outcome might be downgraded 1 level (from high to moderate certainty of evidence).

  •  Imprecision: The variation or spread in the data as generally indicated by a 95% confidence interval. If the 95% confidence interval is wide, the rater might downgrade the outcome 1 level.
  •  Indirectness: The generalizability of body of evidence of the outcome to the intended population. For example, modern recreational cannabis might not be applicable to cannabis used in studies completed in the past. In this scenario, the rater might downgrade the outcome 1 level.
  •  Inconsistency: The between group differences in the estimate of the effect of the intervention. This might be shown through measures of heterogeneity in a meta-analysis (e.g., I2 statistic) or through observed clinical heterogeneity between studies. An example of clinical heterogeneity would be differences in health status in participants between studies. If heterogeneity is observed and cannot be explained by other study factors, then the rater might downgrade the outcome 1 level.
  •  Publication bias: A bias in which positive studies (i.e., showing a significant benefit of an intervention) are more likely to be published. If a funnel plot or a review of the literature shows negative (i.e., showing no significant benefit of an intervention) and smaller studies are not a part of the literature base, then the rater might want to downgrade the outcome 1 level.

For all the reasons above, if the issue is particularly severe, then the rater might downgrade the outcome by 2 levels instead of 1 (from high to low certainty of evidence).

GRADE criteria can upgrade the certainty of the evidence for an outcome. Upgrading is generally reserved for observational studies. Unlike RCTs, which start at “High,” observational studies start at “Low” certainty of evidence. The following are criteria to consider for upgrading the evidence:

  •  Large effect: The effect of the intervention on an outcome and if the magnitude of the effect is large (e.g., a risk ratio greater than 3), then a rater might increase the certainty of evidence by 1 level.
  •  Dose-response relationship: A scenario when an intervention increases or decreases in dose or frequency, then the outcome also increases or decreases. If this type of relationship is observed, then a rater might increase the certainty of evidence by 1 level.
  •  All plausible confounding: A scenario where all relevant confounding variables (i.e., a variable that distorts the relationship between the intervention and outcome) are accounted for in a statistical analysis. If this is observed in the analysis, then a rater might increase the certainty of evidence by 1 level.

Similarly to downgrading the certainty of the evidence, if one of the above criteria for upgrading is also substantial, then a rater might upgrade the rating by 2 levels.

GRADE can be applied to outcomes with data from a meta-analysis or through a qualitative (or narrative) synthesis.