Evaluation of evidence grades in psychiatry and psychotherapy guidelines

Background Information regarding the distribution of evidence grades in psychiatry and psychotherapy guidelines is lacking. Based on the German evidence- and consensus- based (S3) psychiatry and psychotherapy and the Scottish Intercollegiate Guidelines Network (SIGN) treatment guidelines, we aimed to specify how guideline recommendations are composed and to what extent recommendations are evidence-based. Methods Data was collected from all published evidence- and consensus-based S3-classified psychiatry and psychotherapy guidelines. As control conditions, data from German neurology S3-classified guidelines as well as data from recent SIGN guidelines of mental health were extracted. Two investigators reviewed the selected guidelines independently, extracted and analysed the numbers and levels of recommendations. Results On average, 45.1% of all recommendations are not based on strong scientific evidence in German guidelines of psychiatry and psychotherapy. A related pattern can be confirmed for SIGN guidelines, where the mean average of recommendations with lacking evidence is 33.9%. By contrast, in the German guidelines of neurology the average of such recommendations is 16.5%. A total of 24.5% of all recommendations in the guidelines of psychiatry and psychotherapy are classified as level A recommendations, compared to 31.6% in the field of neurology and 31.1% in the SIGN guidelines. Related patterns were observed for B and 0 level recommendations. Conclusion Guidelines should be practical tools to simplify the decision-making process based on scientific evidence. Up to 45% of all recommendations in the investigated guidelines of psychiatry and psychotherapy are not based on strong scientific evidence. The reasons for this high number remain unclear. Possibly, only a limited number of studies answer clinically relevant questions. Our findings thereby question whether guidelines should include non-evidence-based recommendations to be methodologically stringent and whether specific processes to develop expert-opinion statements must be implemented.


Background
Evidence-based medicine aims to integrate the best research evidence with clinical expertise and patient values [1]. In this regard, guidelines are systematically developed tools to assist physicians and other health-care professionals in the decision-making process and to support the transparency of medical decisions. Apart from summarising the current medical evidence, guidelines are instruments to weigh the risk-benefit ratios of diagnostic procedures and treatments. Guidelines are also instruments of quality management of the healthcare system and viewed to improve quality and effectiveness of clinical as well as costs of healthcare [2].
In Germany, the Association of the Scientific Medical Societies (AWMF) started coordinating the development of guidelines in 1995. Since 1997 guidelines are cost-free and publicly available online [3]. In contrast to other countries, clinical guidelines in Germany are not statefunded. For example, in the UK the National Institute of Clinical Excellence (NICE) orders the development of a clinical guideline from a group of independent experts and finances the process [4]. In Germany, the development is only financed by the subscriptions of the publishing medical societies [3] and by the cost-free work of the people involved. Because of the costly and timeintensive development process of high-quality clinical guidelines, different options for funding are currently in discussion. The objective would be an independent and sustainable financial support [5] to accelerate the development process, to consider the most recent study results and to evaluate the guidelines. Despite an important legislative initiative to improve the funding situation [6], it is not foreseeable that there will be a fundamental change regarding the way how guidelines are developed in Germany.
Regardless of any financial consideration, the development process of German guidelines follows standard operation procedures, developed by the AWMF aiming at providing the same level of quality across guidelines. More than 20 years ago clinical guidelines in Germany were not subject to standardisation, so the AWMF started an extensive program in 1998 to improve their quality [3]. For this purpose, the German Rating Instrument for Guideline Valuation (DELBI) has been developed based on the Appraisal of the Guidelines for Research and Evaluation instrument (AGREE, DELBI domains 1 to 6) and was adapted for the German healthcare system (DELBI domain 7) to minimise differences concerning quality and underlying methodology across guidelines and to ensure inter-comparability [7]. Similar efforts were made by the Scottish Intercollegiate Guidelines Network (SIGN) as summarised in the SIGN guidelines developer's handbook.
In that regard, three major classifications for guidelines can be distinguished in Germany. Groups of experts develop in an informal consensus ('expert recommendations'), which results in so called S1-classified guidelines. By comparison S2-classified guidelines need a formal multidisciplinary consensus. Since 2004, the differentiation between S2e-classified ('evidence-based') and S2k-classified ('consensus-based') exists. S2kguidelines are not evidence-based and recommendations are not graded, but the consensus process is standardised and has to follow standard operating procedures. For the development of S2e-guidelines, a systematic research process and the selection and evaluation of evidence are required. S3-classified guidelines are evidence-and consensus-based and have the highest methodological quality as well as the most expensive development process. After a structured literature research and rating of the evidence, they have to fulfil different DELBI criteria [3]. AWMF guidelines use either the Oxford Centre for Evidence Based Medicine (OCEBM) or the SIGN grading systems and AWMF S3-guidelines underlie a structured consensus process which finally leads to the recommendations (grades A, B, 0 and CCP, see Table 1). AWMF was one of the funding organisations of the Guidelines International network and from the international perspective, AWMF S3-guidelines are rooted in one of the most elaborated development processes.
Since 1998 the total number of published medical guidelines in Germany has been growing continuously, but over the last years the proportion of S1 to S3 guidelines has changed. In 2002, 445 S1-guidelines, 121 S2-guidelines and 17 S3-guidelines were available. In 2016, 108 S3-guidelines were available and 374 guidelines were newly registered [8].
Since 2014 the number of newly registered guidelines has decreased, but with a continuously increasing number of high-quality guidelines [8,9] which might confirm that the process of improving the quality of guidelines in general has been successful so far.
High methodological quality is essential to guarantee recommendations based on scientific evidence with practical relevance and to minimise possible conflict of interest biases. Moreover, the amount of recommendations and their respective grading levels are important from the view of usability. However, the way of how to grade evidence and to develop guidelines differs between countries. Nonetheless, most guidelines use the original or the adapted version of the Grading of Recommendations Assessment, Development and Evaluation (GRADE) process [10].
Despite important efforts to harmonise and improve the guideline development processes, guidelines do have a relevant proportion of non-evidence-based recommendations that are presumably expert opinions. In 2014 one study of the Scottish Intercollegiate Guidelines Network (SIGN) (somatic disciplines and mental health) evaluated the 42 national guidelines with respect to the quantity and quality of the given recommendations. The authors observed that only 24% of all recommendations were based on high scientific evidence (denoted as level A recommendations). Remarkably, more than a third of all recommendations were expert-based recommendations and more than a half of the recommendations were based on limited scientific evidence [11]. Moreover, these results highlight that the number of recommendations with low or absent evidence increases with the total amount of recommendations.
The challenge to develop high-quality guidelines in psychiatry and psychotherapy is all the greater compared to somatic disciplines due to the implementation of patient reported outcomes. So called "soft endpoints", which are composed of patient and observer reported outcomes collected via self-or observer-rated clinical rating scales, questionnaires, or interviews [12]. Moreover, psychiatric disorders are subject to a high complexity of diagnosis, treatment, extensive involvement of patients and relatives and outcome measurements. One could speculate that the desire of the guideline developers to deal with this high complexity could possibly lead to a high amount of not evidence-based recommendations. To test this hypothesis, we conducted an analysis of distribution of recommendation grades in evidence-and consensus-based German guidelines of psychiatry and psychotherapy compared to national guidelines of neurology and to the internationally established SIGN guidelines.

Methods
This study focuses on all available German S3-guidelines in psychiatry and psychotherapy in their respective most recent versions (status July 2019). The S3-guidelines of neurology were investigated to serve as a first control group to allow a comparison to a medical field with an overlap in diagnoses or treatments. The recent SIGN guidelines of mental health (status July 2019) were investigated as second control group taking into account the study from Baird and colleagues published in 2014 [11]. Moreover, SIGN guidelines are based on related development procedures to German S3-guidelines in evaluating and grading evidence allowing for a direct comparison. Indeed, there are differences between the SIGN-and the German S3-guideline developing process. The most important difference is the composition of the guideline developing group: the SIGN Executive discusses which specialties and professions should be represented on the guideline development group with the topic proposer(s). This process is advised from the appropriate specialty subgroup(s) and SIGN Council. SIGN guideline development groups vary in size depending on the scope of the topic under consideration, but generally comprise between 15 and 25 members. In contrast, the AWMF only proposes that all relevant interest groups should be represented, without recommendations about the size of the developing group. The respective medical societies are responsible for assembling the guideline group and developing the guideline. The AWMF provides the methodological framework for developing the guidelines and serves as the publisher.
To identify guidelines of interest, we screened the AWMF database, the webpages of the publishing medical societies and the SIGN webpage for all guidelines covering psychiatric or neurological disorders. Two independent investigators (MH and LL) selected and evaluated all AWMF S3-guidelines published by the German Society for Psychiatry and Psychotherapy (DGPPN) and the German association for Neurology (DGN) and the respective SIGN guidelines. If investigators had different results, these results were discussed with a third investigator (AH) until a final consensusdecision was reached.
15 from 17 guidelines, published by the DGPPN were S3-guidelines. The two excluded guidelines were Table 1 Grade of recommendation and underlying levels of evidence (SIGN and OCEBM); following the SIGN and the OCEBM grading system the levels of evidence are rated in four or five categories. Results of studies can be extrapolated or interpolated so that lower levels of evidence can underlie higher grades of recommendations and conversely guidelines classified as S1 and S2: the S1-guideline for disorders of sexual preferences and the S2-guideline for personality disorders. We included the two content related S3-guidelines "Functional body complaints" and "Post-traumatic stress disorder", published only by the AWMF, resulting in a total of 17 S3-guidelines of psychiatry and psychotherapy. The DGN published a total of 96 guidelines with only five S3-guidelines. Apart from recent guidelines (n = 18), we included expired guidelines and guidelines under review as well (n = 4). In contrast to the SIGN guidelines, where expired guidelines are automatically removed after 10 years, the German guidelines are still available and used in clinical practice years after their expiration (e.g. see guideline "Obsessive compulsive disorder", which expired 2018 and is now "under revision"). The dementia guideline as a joint publication of DGPPN and DGN was listed twice but was analysed as psychiatric guideline. 5 of 48 currently available SIGN guidelines were guidelines covering the domain of mental health. Taking the information of the SIGN 50 developer's handbook into consideration, these five guidelines were methodologically comparable to German S3-guidelines [13][14][15][16][17]. Please see Fig. 1 and Table 1 for more information reading the selection process.
Guidelines were analysed regarding the amount of recommendations, the distribution of recommendation grades, the size of the responsible consensus groups (the latter only applicable to German S3-guidelines). An analysis of the size of the SIGN consensus groups was not performed due to the mentioned differences in the development process and in the composition of these groups. Since guidelines sometimes applied different wording for grading non-evidence-based recommendations, we summarised those recommendations as "clinical consensus points" (CCP). This label included the terms "Klinischer Konsensus Punkt" (clinical consensus point), "Expertenkonsens" (expert consensus) and "GCP" (Good Clinical Practice). The recommendation levels A, B and 0 were consistently used in all analysed guidelines (also referring to Table 1 for more detailed information on the grading system). However, the term "statement" was used for all recommendations or declarations especially mentioned and highlighted in the guidelines without having a level of evidence or grading of recommendations. In contrast to A, B, 0 or CCP recommendations statements neither involve a recommendation grade nor are they based on scientific evidence. Nevertheless, they are based on an expert consensus and can be seen as additional material to the recommendations. Statements were counted separately and not included in the statistics. In some guidelines, recommendations with two gradings like 'A/CCP' were available. In such cases, both recommendation categories were taken into account for further analyses.

Statistics
We based all analyses on the long versions of the guidelines. The rational for using the long version was, that we initially expected that the long and short versions would include the same number of recommendations. However, as detailed below, we detected inconsistencies regarding the number of recommendations in seven of the German S3-guidelines. All analyses were computed using IBM SPSS 25 (IBM, Armonk, NY, USA) with a significance level of α = 0.05. Descriptive statistics include means, standard deviations and maximum and minimum counts. Mean relative frequencies of the four recommendation grades (A, B, 0, CCP) for each of the three study groups separately (AWMF psychiatry and psychotherapy, AWMF neurology, SIGN mental health) were calculated based on the individual relative frequencies of recommendation grades within each guideline (absolute count of a given recommendation grade (e.g. A or CCP)/absolute total count of all recommendations). Explorative Pearson correlation analyses were conducted to investigate correlations between the size of the consensus group and the relative frequencies of a given recommendation grade and between the amount of CCP recommendations and the total of recommendations in German S3-guidelines.

Results
In July 2019, 18 out of the 22 German guidelines were available as a recent or updated version: 14 of them were guidelines of psychiatry and psychotherapy and four guidelines of neurology. The four remaining guidelines were under revision. 5 of the available and current 48 SIGN guidelines were guidelines covering the domain of mental health. For 14 of the German S3-guidelines a short version was available. In 7 out of 11 [18][19][20][21][22][23][24] of the guidelines of psychiatry and psychotherapy and in three out of three neurology guidelines [25][26][27] we detected discrepancies in the amount of recommendations between long and short versions. Four of the German guidelines of psychiatry and psychotherapy and one neurology guideline are based on the SIGN grading system. The remaining guidelines used the OCEBM system to grade evidence.
Only five of the 22 German S3-guidelines, 2 from psychiatry and psychotherapy and 3 from neurology, had exclusively level A, B and 0 recommendations and no CCP recommendation [27][28][29][30][31]. However, the guideline of functional body complaints provided only CCP recommendations [23]. In the SIGN guidelines of mental health one out of five guidelines had no CCP recommendation (guideline on non-pharmaceutical management of depression).  Exploratory correlation analyses between the number CCP recommendations and the total amount of recommendations showed a statistically significant positive correlation (r = 0.625, p = 0.007) for the German guidelines of psychiatry and psychotherapy, but no statistically significant correlation for the neurology guidelines (r = 0.835, p = 0.078). For the SIGN guidelines of mental health, the analyses showed also a statistically significant positive correlation (r = 0.978, p = 0.004). Analysing the German S3 guidelines of psychiatry and neurology together, as well as all guidelines included in this study, the number of CCP recommendations correlated significantly with the total number of recommendations. (r = 0.706, p < 0.001; r = 0.745, p < 0.001).
For the AWMF guidelines of psychiatry and psychotherapy only the exploratory correlation analyses between the size of the consensus group and the relative frequencies of grade 0 recommendations showed a statistically significant negative correlation (r = − 0.516; p = 0.034). All other correlation analyses for both AWMF groups separately, did not show significant correlations between the relative frequency of a given recommendation grade and the size of the guideline group. Analysing all AWMF guidelines (N = 22), a statistically significant negative correlation for grades B (r = − 0.498 and p = 0.018) and 0 (r = − 0.632, p = 0.002) and a statistically positive correlation for CCP recommendations (r = 0.591; p = 0.004) with the size of the guideline group were detected. These correlations indicate that in guidelines with increasing size of the consensus groups, the number of CCP recommendations increases as well. For B and 0 recommendations, the reverse relationship was observed. For grade A recommendations, no significant correlation was detected (r = − 0.176; p = 0.435).

Conclusion
Our study shows for the first time that the relative mean frequency of recommendations in the German psychiatry and psychotherapy guidelines, which are not rooted in strong scientific evidence amounts to 45.1%. In neurology guidelines this frequency is only 16.5%. This observation of a high frequency of CCP recommendations in the German psychiatry and psychotherapy guidelines is even higher than in the SIGN guidelines of mental health with a relative mean frequency of 33.9%. Independent of the discipline, the relative mean frequency of CCP recommendations correlates positively with the size of the consensus group in the evaluated German guidelines. In general, an increasing number of recommendations correlates positively with a higher number of CCP recommendations.
The reasons for these phenomena remain elusive, but one could build several hypotheses to explain the high proportion of CCP recommendations in psychiatry and psychotherapy guidelines. First, one could assume that for some clinically relevant questions respective clinical trials and subsequent meta-analyses are yet lacking. In this regard, one might argue that the responsible guideline groups aim to ask and try to answer such questions that cannot be addressed ethically in clinical trials but are of particular importance for clinical practice. Such scenarios include e.g. suicidality, coercive treatment or the need of supervision as part of the therapeutic process. Being aware that we only investigated two different medical disciplines, one could speculate, that the differences in the numbers of evidence-and consensusbased guidelines between psychiatry/psychotherapy and neurology (only five of 96 were classified as an S3guideline) could indicate that in neurology the responsible society tries to avoid the development of such timeand money-consuming guidelines. Concerning the SIGN guidelines, the mean frequency of 33.9% of CCP recommendations shows that the observation of a high number of non-evidence-based recommendations in guidelines of psychiatry and psychotherapy seems to be an international phenomenon. Based on the additional analyses of the correlation between the total of recommendations and CCP recommendations, the findings of Baird et al. seem to affect only the German guidelines of psychiatry and psychotherapy. Analysing all SIGN guidelines Baird et al., detected a positive correlation between the increasing number of recommendations and the proportion of CCP recommendations. In our analyses we could confirm this relationship showing that this is an international phenomenon. This overall effect and the limited scientific evidence in a magnitude of psychiatry and psychotherapy guidelines are critical, because such effects may support the self-confidence and interests of expert groups [32]. The impact of conflict of interest on guideline recommendations is evident and has been extensively investigated [33,34]. A relevant risk of bias can be assumed based on our analyses of the number of recommendations in the S3-guidelines of psychiatry and psychotherapy and the SIGN guidelines. This conclusion cannot be generalized for those recommendations that were based on respective scientific evidence (mainly recommendations for pharmacological and psychotherapeutic interventions). One could speculate, that our exploratory finding of a positive correlation of the size of the consensus group and the relative frequency of CCP recommendations could support a possible relationship between financial and non-financial (e.g. membership in specific societies, affiliation to a specific therapeutic school, offering of specific courses) conflicts of interest and the approval of non-evidence-based recommendations. The greater the size of the consensus group, the more different groups of interest are represented. A high number of different stakeholders may possibly lead to an influence of the CCP recommendations by the different groups to maintain their own interests. However, we did not investigate the conflict of interest statements in the included guidelines, but taking into account earlier findings that in all German S3-guidelines only 76% contain a detailed declaration of the conflicts of interests and in total 55% of the participants of the guideline groups report any conflict of interest [32], one could speculate about a relationship between conflicts of interest and the amount of non-evidence-based recommendations. However, we could not establish this effect when both disciplines were analysed separately.
Additionally, it appears of importance to differentiate between "expert opinion" and "CCP" recommendation: "Expert opinion" describes a level of evidence, not a grade of recommendation. Even a CCP recommendation does not automatically mean, "lack of scientific evidence" nor it is only "eminence based" [35]. Though in the grading system the level of evidence on which CCP's are based, is "no level of evidence". This rather means, there is lack of literature research or the results of the investigated studies do not fully address the question of the guideline and have to be adjusted. For the most questions, studies or at least case reports exist. This points to some kind of a scientific background, which could be transferred to the questions of the guideline. One could speculate, that for recommendations where the guideline group has to transfer results, case reports or sometimes even practical knowledge in terms of "good clinical practice" room for interpretation is left. In these cases, the personal view of every member of a guideline group is more emphasized.
In this context, it may be possible that small groups of experts and stakeholders will drive the practice of a specialty in one direction. At the same time, respective high-quality clinical trials or meta-analyses in psychiatry and psychotherapy beyond studies regarding efficacy and effectiveness of pharmacological and psychotherapeutic interventions are lacking. In our view, there seems to be two possible options to address the outlined issue of conflicts of interest: The first option would be, not to provide any recommendations for clinical scenarios where no evidence is available yet. However and as discussed before, treatment guidelines without any nonevidence-based recommendations while having a higher methodological standard, would potentially ignore many clinically important scenarios and questions. The latter may result in a guideline that might be less useful in clinical practice due to the lack of importance and applicability for a variety of questions. Moreover, one should bear in mind that the absence of evidence does not equal the evidence of absence. For example, until today no randomised-controlled trial has been conducted to show that using a parachute is more effective to save lives than not using parachutes [36]. The second option would be to accept the aforementioned potential risk of bias and include recommendations based on expert consensus, as these can address clinically meaningful questions and scenarios. Obviously, such recommendations can be useful where clinical trials and meta-analyses are missing to facilitate the clinical decision-making process and to contextualise the decision-making of individuals with the views of experts. By contrast, a high number of such recommendations can foster uncertainty in the clinical decision-making process [37] due to the aforementioned increase in the risk of bias. Situations where recommendations based on expert consensus are necessary and helpful are not rare, neither in psychiatry and psychotherapy nor in other medical disciplines. Such areas are e.g. situations of ultra-treatment resistance, areas that are difficult to study (e.g. suicidality, coercive treatment), the management of rare side-effects, the combination treatment with different pharmacological compounds or the evaluation of long-term effectiveness and tolerability. In these cases, standard operating procedures (SOPs) on how to develop and approve non-evidence-based recommendations should be implemented for the guideline developing process. Such SOPs that provide a process for the development of CCP recommendations are missing in the AWMF and SIGN guideline developer's handbooks. These SOPs could e.g. describe exactly the development and composition of each CCP recommendation and include the total number of CCP recommendations (e.g. not more than 15% or 20% of all recommendations) to force guideline developers to limit these recommendations to a minimal clinically relevant amount or the need to involve independent international experts in the process (e.g. using the Delphi method).
Another aspect is that the composition of the guidelinedeveloping group could influence the amount of CCP recommendations. Following Baird and Lawrence [11], a high amount of recommendations based on expert opinions could be seen as a reflection how professional groups deal with uncertainty [11]. Moreover, they speculate [11] that expert groups may view their opinion as clinical experts being more authoritative than scientific evidence [38]. The more experts from different fields and interest groups are involved, the more differing opinions might exist and the more uncertainty might arise. Thus, the important development towards multi-professional guideline groups, involving stakeholders, relatives and other involved persons may foster the inflation of non-evidence-based recommendations. This may be especially true, as in many of the German S3-guidelines in psychiatry and psychotherapy involved groups the same persons are responsible for different guidelines (e.g. being a delegate for guidelines in general). Moreover, in contrast to e.g. SIGN, AWMF gives no restriction concerning the maximum size of the guideline group. Our general finding that the more groups are involved in a guideline, the more non-evidencebased recommendations are included supports to the described dilemma. Potential solutions could be to limit the size of the guideline groups, to limit the possibility for one person to participate in several guideline-development processes, to define standards of the qualifications of the involved person and to develop between-guideline standards for the here raised important issues.
Our analysis showed also that in some guidelines the number of recommendations differs between the short and the long version. Furthermore, in some guidelines the grading differed between both versions (e.g. in the guideline of Obsessive Compulsive Disorder recommendation number 4-10 is graded as CCP in the short version and as a statement in the long version [19]). Such findings question whether short versions of guidelines are necessary and might motivate guideline developers to pay attention to such discrepancies.
As a limitation, we only included guidelines of neurology as control group from Germany, which consisted only of five S3-guidelines. By comparison, analyses such as the one presented by Baird and Lawrence for the SIGN guidelines are not available for Germany. Thus, we cannot derive statements with respect to other medical disciplines. Moreover, with a total of 96 the number of neurology guidelines is very high and we assume that the only available five S3 guidelines may have had a stricter methodology approach than the remaining 91 resulting in an overestimation of the frequency of evidence-based recommendations. Secondly, our correlative findings are only significant when both disciplines are analysed. Another limitation is, that we solely analysed the current published SIGN guidelines of mental health for an international comparison, which consisted also only in five guidelines. Moreover, four of the German guidelines were out of date, while we used only current SIGN guidelines. This must be considered also as a potential limitation.
In summary, we were able to show a high frequency of non-evidence-based recommendations in S3-guidelines of psychiatry and psychotherapy. Furthermore, we could demonstrate that the size of the involved group seems to affect and promote the development of such recommendations. As guidelines should be practical tools to simplify the clinical decision-making processes based on scientific evidence in times of multiple treatment options, our findings support a view that new strategies to deal with clinically relevant questions where no evidence is available are needed. For reasons of practical applicability, guideline developers in general should be careful about the included amount of recommendations and the scientific evidence on which the recommendations are based.