In this secondary exploratory meta-analysis of the Cipriani dataset we tested whether the placebos of newer antidepressants were more effective than the placebos of the older drugs amitriptyline and trazodone. These two drugs, together with clomipramine, have been shown to be less well tolerated than the newer-generation antidepressants [2, 18]. Based on the unblinding of investigators documented in various studies [8, 10, 11], we therefore hypothesized that outcome-assessors in trials of these older drugs were more frequently unblinded due the drugs’ marked and observable side effects. By consequence, we assumed that the unblinded outcome-assessors would, consciously or unconsciously, underrate the response to placebos for these older drugs. In line with our reasoning, we found that the amitriptyline- and trazodone-placebos were rated less effective than the placebos of the newer, better tolerated, antidepressants, such as SSRIs (citalopram, escitalopram, fluoxetine, sertraline), SNRIs (duloxetine, desvenlafaxine, venlafaxine), and in particular the atypical noradrenergic and specific serotonergic antidepressant (NaSSA) mirtazapine. Because trial methodology, sample characteristics and the rate of positive trials have considerably changed over time [29, 30], we also controlled for important covariates such as study center, dosing schedule, study length, sample size, study year, publication status and sponsorship. Although the inferiority of the amitriptyline-placebo did not remain significant (95% CrIs including zero, notwithstanding the fact that it still indicated lower response) except in relation to mirtazapine-placebo, the differences for the trazodone-placebo compared to new-generation-placebos remained significant (95% CrIs excluding zero).
Our findings are compatible with the hypothesis that, due to unblinding, outcome-assessors may have overestimated the average drug-placebo difference for the older antidepressant drugs amitriptyline and trazodone. Other studies also support the view that unblinding may drive exaggerated response ratings for antidepressants relative to placebo. For instance, Khan and colleagues [14] found that the average response to depression treatments was higher when outcome-assessors were unblinded. The meta-analysis by Moncrieff and colleagues [15] found that the response to TCAs was poor when compared to active placebos (d = 0.17). Likewise, a meta-analysis by Greenberg and colleagues [13] found that the clinician-rated response to TCAs was small (d = 0.25) in “blinder” three-arm trials which contained an active-control in addition to placebo-control. Moreover, in these three-arm trials the response to the TCAs was close to zero (d = 0.06) when assessed with patient self-reports, suggesting that outcome-assessors see drug-placebo differences that the thus rated patients personally do not perceive.
The present findings are important for the interpretation of the comparative response to different antidepressants as provided by Cipriani and colleagues [2]. In their supplement, Cipriani and colleagues [2] reported that adjusting for the probability of receiving placebo increased the response to amitriptyline from OR = 2.13 to a striking OR = 3.16 (48% increase). Similarly, for trazodone, this resulted in an increase from OR = 1.51 to OR = 1.97 (30% increase). These findings clearly illustrate that the average treatment response for both amitriptyline and trazodone increases substantially when they were compared to placebo in a two-arm trial, presumably because including a placebo-arm makes it much easier for outcome-assessors to detect which participants received the investigational drug than in an active-controlled trial.
Consistent with our hypothesis that unblinding of outcome-assessors in trials of older drugs biases the average drug-placebo difference, a meta-analysis [31] of the placebo response has shown that the average placebo response in 2005 was more than twice larger than the placebo response in 1980 when assessed by outcome-assessors. However, no change over time was found for patient self-ratings [31], which again bolsters our findings detailed above that outcome-assessors rate drug-placebo differences differently to what patients personally perceive [13]. It is also important to stress that while the placebo response has considerably increased during the 1980s [32], since about 1991 the average placebo response remained largely constant around 35–40% when changes in trial design features are taken into account [20, 33].
We see no reason to assume that there is no unblinding in trials of SSRI, SNRI, or NaSSA antidepressants, although the bias is presumably less pronounced as the newer drugs are better tolerated than TCAs [18]. For example, mirtazapine, which has a unique dual mode of action as a noradrenergic and specific serotonergic antidepressant [34], has sedating effects due to its affinity to histamine receptors at low plasma concentrations [35]. This antihistamine effect, however, is offset at higher doses by increased noradrenergic transmission, which reduces its sedating effect [36,37,38]. Mirtazapine is further considered to have a lower risk of anticholinergic or serotonin-related adverse effects often associated with other antidepressants (such as sexual dysfunction, nausea, etc.), even lower than SSRIs, and may actually improve certain side effects when taken in conjunction with other antidepressants [39,40,41].
Nevertheless, new-generation antidepressants also cause side effects [42], which is why dropout rates due to adverse events are higher for new-generation antidepressants than placebo (but of course still lower than dropout rates of older antidepressants) [2]. Experienced clinicians may thus still be able to correctly guess, whether a participant receives placebo or active treatment. In accordance, in the re-analysis of the Hypericum Depression Trial, Chen et al. [43] showed that clinicians were better at correctly guessing placebo than sertraline or hypericum. In addition, side effects were more pronounced among participants for which the clinicians guessed active treatment (which indicates unblinding due to side effects), and improvements on active treatment relative to placebo were larger when the clinicians guessed active treatment. We therefore suggest that unblinding bias is also an issue in trials of newer antidepressants, although it is probably less pronounced than in trials of the poorer tolerated older antidepressants.
Finally, it is important to note that our analysis cannot fully rule out alternative explanations. For instance, instead of unblinding, another reason could be the transformation of trial protocols over time. To name just one example, inclusion and exclusion criteria of antidepressant trials have become more restrictive over time, meaning that trial participants are increasingly unrepresentative [44]. Although controlling for study year certainly reduces this confounding effect in part, it cannot remove it altogether. To confirm our hypothesis, a preregistered prospective study is required. Given that side effects that are observable for an outcome assessor even when not reported by the patient (e.g., dry mouth, tremor, drowsiness, somnolence) are presumably those causing unblinding, it would be worthwhile to examine whether these specific side effects (relative to less detectable side effects such as sexual dysfunction and lack of appetite) lead to correct identification of treatment received and whether they are negatively correlated with depression ratings in the placebo arm.
The main implication of our study is that unblinding should be systematically assessed and reported in antidepressant trials. This would allow to statistically control for unblinding effects and it would also be possible to conduct a confirmatory study as detailed above. If our hypothesis holds, it would imply that inert placebos are a poor control and thus the use of active placebos should be reconsidered. Another implication would be that efficacy rankings based on NMA must be interpreted with caution.
Limitations
A limitation of the present analysis is that it was not based on a written protocol, but merely followed the findings of Naudet and colleagues [19].
Another limitation inherent in the present data set is that the placebos can only be interpreted based on their comparisons with the corresponding antidepressants to which they are bound in the network. Here, we focused on the single-comparison placebos, since the double-comparison placebos are hard to interpret and therefore only presented in the supplement. It should therefore be kept in mind that 24% of the trials also including double-comparisons were not included in the present interpretation.
Anther limitation concerns the evidence summarized in this special placebo NMA, in that all comparisons between placebos rely on indirect evidence only, and not on a mixture of direct and indirect comparisons as for most of the antidepressants; though, in mixed treatment comparisons, a main part of the evidence is also often based on indirect evidence [45]. The consistency hypothesis, assuming that effects between direct and indirect comparisons are the same, can therefore not be verified. Though, it is impossible in this placebo-context to verify this hypothesis, one cannot be sure of the validity of the comparisons considering that indirect comparisons may not be robust and prone to vibration of effects [46].
A methodological limitation is the problem of multiplicity in the present NMA. Standard NMA models usually do not account for multiple comparisons in estimating relative treatment effects, which might lead to exaggerated and overconfident statements regarding relative treatment effects. The present analysis therefore applied the Bayesian approximation to reduce that problem described by Efthimiou and White [26], where treatment effects are modelled exchangeable, and hence estimates are shrunk away from large values.
A more general limitation is that the reliance on the similarity hypothesis that assumes that all trials are similar enough to be pooled together. Cipriani et al. [2] considered this hypothesis to be valid, but still some unmeasured characteristics might have influenced our findings, such as differences between in- and outpatients or any other surrogate of depression severity at study entry.