The efficacy of duloxetine: A comprehensive summary of results from MMRM and LOCF_ANCOVA in eight clinical trials

Background A mixed-effects model repeated measures approach (MMRM) was specified as the primary analysis in the Phase III clinical trials of duloxetine for the treatment of major depressive disorder (MDD). Analysis of covariance using the last observation carried forward approach to impute missing values (LOCF_ANCOVA) was specified as a secondary analysis. Previous research has shown that MMRM and LOCF_ANCOVA yield identical endpoint results when no data are missing, while MMRM is more robust to biases from missing data and thereby provides superior control of Type I and Type II error compared with LOCF_ANCOVA. We compared results from MMRM and LOCF_ANCOVA analyses across eight clinical trials of duloxetine in order to investigate how the choice of primary analysis may influence interpretations of efficacy. Methods Results were obtained from the eight acute-phase clinical trials that formed the basis of duloxetine's New Drug Application for the treatment of MDD. All 202 mean change analyses from the 20 rating scale total scores and subscales specified a priori in the various protocols were included in the comparisons. Results In 166/202 comparisons (82.2%), MMRM and LOCF_ANCOVA agreed with regard to the statistical significance of the differences between duloxetine and placebo. In 25/202 cases (12.4%), MMRM yielded a significant difference when LOCF_ANCOVA did not, while in 11/202 cases (5.4%), LOCF_ANCOVA produced a significant difference when MMRM did not. In 110/202 comparisons (54.4%) the p-value from MMRM was lower than that from LOCF_ANCOVA, while in 69/202 comparisons (34.2%), the p-value from LOCF_ANCOVA was lower than that from MMRM. In the remaining 23 comparisons (11.4%), the p-values from LOCF_ANCOVA and MMRM were equal when rounded to the 3rd decimal place (usually as a result of both p-values being < .001). For the HAMD17 total score, the primary outcome in all studies, MMRM yielded 9/12 (75%) significant contrasts, compared with 6/12 (50%) for LOCF_ANCOVA. The expected success rate was 80%. Conclusions Important differences exist between MMRM and LOCF_ANCOVA. Empirical research has clearly demonstrated the theoretical advantages of MMRM over LOCF_ANCOVA. However, interpretations regarding the efficacy of duloxetine in MDD were unaffected by the choice of analytical technique.


Background
Treatment effects are often evaluated by comparing change over time in outcome measures. However, valid analyses of longitudinal data can be problematic, particularly if some data are missing for reasons related to the outcome measure [1,2]. Since the problem of missing data is almost ever-present in clinical trials, numerous methods for handling missingness have been proposed, examined, and implemented [3].
A common method of analyzing clinical trial data is to use analysis of variance or analysis of covariance (ANOVA or ANCOVA) with missing data imputed by the last observation carried forward approach (LOCF_ANCOVA). The popularity of LOCF_ANCOVA may be due to its simplicity, and also the belief that violations of the restrictive assumptions inherent to LOCF_ANCOVA lead to a conservative analysis [4]. Considerable advances in statistical methodology, and in our ability to implement these methods, have been made in recent years. Thus, methods that require less restrictive assumptions than LOCF_ANCOVA are now readily implemented. For example, likelihood-based repeated measures approaches have a number of theoretical and practical advantages for analysis of longitudinal data with dropout [4].
One such method, termed MMRM (Mixed Model Repeated Measures [5]), has been studied extensively in the context of neuropsychiatric clinical trials [6][7][8][9]. In these studies, MMRM was found to be more robust to biases from missing data than LOCF_ANCOVA, and thereby provided superior control of Type I and Type II errors. The LOCF_ANCOVA method was shown to underestimate treatment group differences in some scenarios, while overestimating differences in others. When no data were missing, the two methods yielded identical results.
The MMRM approach was specified as the primary analysis in the Phase III clinical trials of duloxetine for the treatment of major depressive disorder (MDD), while LOCF_ANCOVA was specified as a secondary analysis. In the present investigation, we provide a comprehensive summary of results from MMRM and LOCF_ANCOVA in the eight acute-phase clinical trials that formed the basis for duloxetine's New Drug Application (NDA) for MDD. The primary objective of this investigation was to determine whether differences in results between MMRM and LOCF_ANCOVA influenced conclusions regarding the efficacy of duloxetine.

Data
The data source for this investigation was the eight acutephase clinical trials in which duloxetine was compared with placebo in the treatment of MDD. Relevant details of these studies are highlighted in Table 1.
Results are summarized from all rating scale total scores, subscales, and global assessments that were specified a priori in the various protocols to be analyzed for mean change from baseline to endpoint, and were collected at more than one postbaseline time point (Table 2). Efficacy measures that were assessed only at baseline and endpoint were not included in this summary because repeated measures analyses were not possible for these outcomes. Thus, the present investigation included every rating scale total score and subscale from every clinical trial relevant to duloxetine's NDA for an indication in major depression. In total, 20 efficacy and health outcome variables were included in the summary of MMRM and LOCF_ANCOVA. Some of the eight trials included multiple dose arms; therefore, some outcomes were assessed in as many as 12 comparisons with placebo.
Comparisons of MMRM and LOCF_ANCOVA focused on contrasts between duloxetine and placebo. However, six of the studies also included known effective antidepressants approved for marketing in the United States and other countries. Contrasts between duloxetine and the active comparators are not included in this summary since these results may draw attention to the drug versus drug results and detract from the primary focus of comparing MMRM with LOCF_ANCOVA.

Statistical analysis
This summary makes no attempt to provide formal statistical comparisons of results from MMRM and LOCF_ANCOVA. Previous research has demonstrated conclusively that in the absence of missing data the two methods yield identical endpoint contrasts, while differences do exist in the presence of subject dropout [6][7][8][9]. Furthermore, formal statistical comparisons are typically applied to random samples obtained from larger populations in order to assess the uncertainty associated with the sampling. However, the eight studies included in this summary are not a sample, but rather represent all of the acute-phase, double-blind, placebo-controlled trials of duloxetine. Thus, there is no uncertainty associated with sampling. Consequently, results from the two methods need only be summarized in order to assess how differences between the methods may influence overall conclusions regarding the efficacy of duloxetine.
Three overall summary measures were used to compare results from the two analytic techniques: 1) With regard to statistical significance of the difference between duloxetine and placebo, the proportion of outcomes showing agreement between MMRM and LOCF_ANCOVA was compared with the proportion of outcomes for which   [28] Hamilton Anxiety Rating Scale total score [29] Clinical Global Impression of Severity [26] Patient Global Impression of Improvement [26] Visual Analog Scale for pain [30] Overall pain severity Headaches Back pain Shoulder pain Time in pain while awake Interference with daily activities Somatic Symptom Inventory [31] 26 Item total score 28 Item total score (includes 2 additional questions on painful physical symptoms) Quality of Life in Depression Scale total score [32] MMRM and LOCF_ANCOVA yielded disparate results; 2) The proportion of outcomes for which MMRM yielded the lowest p-value was compared with the corresponding proportion for LOCF_ANCOVA; 3) The number of outcomes for which "substantial evidence of efficacy" was demonstrated. In regulatory settings, the criterion for substantial evidence of efficacy is frequently the demonstration of a statistically significant advantage over placebo in two or more studies. This criterion was utilized here to define substantial evidence of efficacy for a particular outcome.
The frequency of lower p-values provides a "fine-tuned" measure of sensitivity of the two analytic methods. However, in certain cases such an assessment may actually be misleading. For example, to distinguish between p= .800 and p = .810, or between p =. 002 and p =. 003, implies that the methods yielded different results when in fact the similarities far outweigh the differences. Hence, it is equally appropriate to simply categorize based upon the presence or absence of a significant difference. Furthermore, given the large number of outcomes assessed across the eight studies, it would not be surprising to see the two methods disagree with regard to statistical significance on at least a small number of outcomes. Therefore, perhaps the most clinically meaningful summary measure is the number of outcomes for which substantial evidence of efficacy was demonstrated.
Three outcomes were selected for more detailed presentation of results: the 17-item Hamilton Rating Scale for Depression (HAMD 17 ) total score, the HAMD 17 Maier subscale, and the Visual Analog Scale (VAS) for overall pain severity. The HAMD 17 total score was an obvious choice as it was the primary outcome in all studies. The other outcomes were selected since they are frequently focal points in manuscripts and presentations regarding duloxetine's efficacy. Finally, we provide case studies to help explain how and why results from MMRM and LOCF_ANCOVA may differ.
MMRM and LOCF_ANCOVA analyses were specified in the Phase III duloxetine protocols as follows. In LOCF_ANCOVA analyses, change from baseline to the last observation was the dependent variable. Treatment and investigative site were included as categorical independent variables, and baseline severity was included as a covariate. In MMRM analyses, change from baseline to all postbaseline times was the dependent variable. Independent variables included the fixed, categorical effects of investigative site, treatment, time, and the treatment-bytime interaction, along with the continuous covariates of baseline severity and the baseline severity by time interaction. Parameters were estimated using Restricted Maximum Likelihood with the Newton-Raphson algorithm. The protocols specified an algorithm for choosing the best fitting covariance structure. In all cases an unstructured matrix provided the best fit. Hence, within-patient errors were modeled using an unstructured covariance matrix.

Results
The protocols for the eight studies in the duloxetine NDA specified a priori a total of 202 mean change analyses for the 20 rating scale total scores or subscales. The frequency of significant outcomes and the frequency of higher/lower p-values for each analytic technique are summarized in Table 3. MMRM and LOCF_ANCOVA agreed with regard to substantial evidence of efficacy for 18 of the 20 outcomes, with each analysis yielding substantial evidence for 15 outcomes. That is, MMRM and LOCF_ANCOVA both found substantial evidence of efficacy for 14 outcomes; both methods did not find substantial evidence for four outcomes; and each method found substantial evidence when the other did not for one outcome (Table  3).
In 166/202 outcomes (82.2%), MMRM and LOCF_ANCOVA agreed with regard to the statistical significance of the difference between duloxetine and placebo. In 25 cases (12.4%) MMRM yielded a significant difference whereas LOCF_ANCOVA did not, while in 11 cases (5.4%) LOCF_ANCOVA yielded a significant difference when MMRM did not.
Both methods tended to yield significance more frequently in depression rating scales and subscales than in outcomes related to somatic and painful physical symptoms. The studies were generally underpowered for these secondary somatic and pain outcomes owing to the greater variance in changes score for these outcomes. For example, the variance in VAS overall pain severity was approximately nine-fold greater than the variance in HAMD 17 total scores, leading to a three-fold greater standard error.
In 110 of the 202 outcomes (54.4%) the p-value from MMRM was lower than that from LOCF_ANCOVA, while in 69 cases (34.2%) the p-value from LOCF_ANCOVA was lower than that from MMRM. In the remaining 23 cases (11.4%) the p-values from LOCF_ANCOVA and MMRM were equal when rounded to the 3 rd decimal place (usually as a result of both p-values being < .001).
More detailed results from the three focus outcomes (HAMD 17 total score, HAMD 17 Maier subscale, and VAS overall pain) are presented in Table 4. In the case of the HAMD 17 total score, the advantage of duloxetine over placebo in mean change from baseline to endpoint from MMRM analyses was greater than the corresponding advantage from LOCF_ANCOVA in 9/12 comparisons (Table 4). In 9/12 comparisons the p-value from MMRM was lower than that from LOCF_ANCOVA, while LOCF_ANCOVA yielded a smaller p-value in one case, and p-values were identical in the two remaining cases. In 3/12 comparisons, MMRM yielded a significant difference when LOCF_ANCOVA did not, but in no instance did LOCF_ANCOVA produce a significant difference when MMRM did not. When averaging results across all eight studies, the advantage of duloxetine over placebo in HAMD 17 1. Some of the eight trials included more than one dose arm. Therefore, an individual outcome could be assessed in as many as 12 comparisons with placebo.  Similar results were obtained for the Maier subscale. Thus, the advantage of duloxetine over placebo in mean change from baseline to endpoint from MMRM was greater than that from LOCF_ANCOVA in 9/12 comparisons ( Table  4). The p-value from MMRM was lower than that from LOCF_ANCOVA in 6/12 comparisons, while LOCF_ANCOVA produced a lower p-value in 2 cases, and p-values were identical in the remaining four cases. In 2/ 12 comparisons MMRM yielded a significant difference when LOCF_ANCOVA did not, while there was one instance in which LOCF_ANCOVA yielded a significant difference when MMRM did not. Averaged over all eight studies, the advantage of duloxetine over placebo in mean Maier subscale score was 1.5 from MMRM analyses compared with 1.3 from LOCF_ANCOVA. Thus, the advantage of duloxetine over placebo based on LOCF_ANCOVA results was approximately 87% as large as the advantage from MMRM.
For VAS overall pain severity, the advantage of duloxetine over placebo from MMRM analyses was greater than the corresponding advantage from LOCF_ANCOVA in 2/10 comparisons ( Table 4). The p-value from MMRM was lower than that from LOCF_ANCOVA in 2/10 comparisons, while in the remaining 8 comparisons the p-value from LOCF_ANCOVA was lower than that from MMRM. In 1 comparison MMRM yielded a significant difference when LOCF_ANCOVA did not, while in 2 comparisons LOCF_ANCOVA yielded a significant difference when MMRM did not. Over all eight studies, the average advantage of duloxetine over placebo in VAS overall pain severity was 3.9 from MMRM analyses compared with 5.4 from LOCF_ANCOVA. Thus, the advantage of duloxetine based on LOCF_ANCOVA results was approximately 138% as large as the advantage from MMRM.

Case Studies
Mean changes in HAMD 17 total score and VAS overall pain severity from two studies (Studies 1 and 2) are used to further illustrate MMRM and LOCF_ANCOVA (analysis of variance with missing data imputed via last observation carried forward) results. Results from these studies were originally reported by Detke et al [10,11]. In both studies the advantage of duloxetine over placebo in HAMD 17 total score tended to increase over time whereas duloxetine's advantage in VAS overall pain was greatest at intermediate visits (Tables 5 and 6).
In the case of the HAMD 17 total score, advantages for duloxetine over placebo at endpoint (Week 9) from MMRM in Studies 1 and 2 were 4.86 (p < .001) and 2.17 (p = .024), respectively. The corresponding advantages from LOCF_ANCOVA were 3.80 (p < .001) and 1.73 (p = .048). Although the differences were significant for both methods in both studies, MMRM yielded treatment contrasts that were approximately 25% greater than LOCF_ANCOVA. For VAS overall pain, the advantage of duloxetine over placebo at endpoint from MMRM in Studies 1 and 2 were 5.88 (p = .055) and 4.40 (p = .135), respectively. The corresponding advantages from LOCF_ANCOVA were 6.91 (p = .019) and 5.17 (p = .037). In both studies, the endpoint differences were significant from LOCF_ANCOVA, but not from MMRM. The LOCF_ANCOVA treatment contrasts were approximately 15% greater than those from MMRM.
Standard errors from LOCF_ANCOVA were approximately 5% smaller than the Week 9 standard errors from MMRM for both the HAMD 17 total score and VAS overall pain.

Discussion
In many areas of clinical research, the impact of missing data can be profound [2,[12][13][14]. Traditional approaches to analyses of data from clinical trials with dropouts, such as LOCF_ANCOVA, have focused on ease of implementation and interpretation. However, simple methods rely upon assumptions that are often unrealistic. For example, LOCF_ANCOVA assumes that patient dropout is completely random, i.e. it is unrelated to the outcome being analyzed. Hence, in an analysis of efficacy data, LOCF_ANCOVA assumes that patients do not drop out due to lack of efficacy. The LOCF_ANCOVA approach also assumes that, for those patients who drop out, their observations would not have changed had they stayed in the trial. When these assumptions do not hold true, estimates of treatment effects and associated standard errors may be biased [2][3][4]6,7,[15][16][17].
Considerable advances in statistical methodology, and in our ability to implement these methods, have been made in recent years. Methods such as MMRM, which require less restrictive assumptions regarding missing data, may now be easily implemented with standard software [4,5,18,19].
No universally superior approach to analysis of longitudinal data exists. However, a series of studies [6][7][8][9] demonstrated empirically what may have been anticipated from statistical theory -namely that the MMRM approach, while providing no guarantee of immunity from bias due to subject dropout, was a sensible analytic choice for many clinical trial scenarios. MMRM has repeatedly been shown to provide adequate control of Type I (false positive) and Type II (false negative) errors in a wide variety of situations modeled after neuropsychiatric clinical trials.
In these head-to-head comparisons involving 456,000 data sets, the LOCF_ANCOVA approach did not perform as well as MMRM. We therefore specified MMRM as the primary analysis and LOCF_ANCOVA as a secondary analysis in the Phase III clinical trials of duloxetine in the treatment of MDD.
Similar results regarding control of Type I and Type II error for LOCF_ANCOVA and mixed-effects model analyses have been obtained independently [16,[20][21][22]. Furthermore, following an independent investigation of data from two placebo-and active-comparator controlled duloxetine trials, in which treatments were coded A, B, C, etc. to blind analysts to the treatment names, Molenberghs et al [4] concluded that MMRM analysis was a sensible choice for those data. The theoretical differences between MMRM and LOCF_ANCOVA have been summarized [5,18], established empirically [6][7][8][9], and proven mathematically [4]. However, we are unaware of any previous investigations of how these differences manifest themselves in efficacy assessments of a new medicinal product.
The VAS pain results highlight a limitation of endpoint analyses of any type, namely that they provide only a snapshot view of the response profile. From LOCF_ANCOVA analysis, one can only conclude that drug was superior to placebo at endpoint. However, MMRM analysis reveals that drug had a significant effect early in the trials, but that advantage was somewhat transitory as the placebo group tended to "catch up" over time. In order to understand the response profile of a drug, the entire longitudinal profile should be considered [2]. From MMRM, the entire profile can be assessed from the same analysis that provided the primary result (the contrast at endpoint).
In the duloxetine database, results from MMRM and LOCF_ANCOVA were in general agreement regarding substantial evidence of efficacy and frequency of significant differences. However, MMRM tended to be more sensitive to drug-placebo differences for outcomes related to overall depressive symptoms and core emotional symptoms of depression, with mean advantages over placebo that were 10% to 20% greater than LOCF_ANCOVA. However, MMRM did not universally increase duloxetine's advantage over placebo in comparison to results from LOCF_ANCOVA. For example, in somatic and painful physical symptom outcomes, results from LOCF_ANCOVA showed mean advantages over placebo that were approximately 40% greater than that from MMRM.
Therefore, while the overall conclusions regarding the efficacy of duloxetine were unaffected by the choice of analytic method, this should not mask the important differences between MMRM and LOCF_ANCOVA. The advantages of MMRM and similar methods over LOCF_ANCOVA have been conclusively demonstrated in many studies and are evident in the duloxetine data.
Khan et al [23] compiled a database from FDA summaries of efficacy for all antidepressants approved between 1985 and 2000. Less than half of the studies -which were analyzed using LOCF_ANCOVA as the primary analysisfound significant advantages for drug over placebo. These studies were generally anticipated to have at least 80% power and, if the analysis worked as expected, the success rate would be 80%. Therefore, LOCF_ANCOVA was less sensitive to the drug effects than anticipated. In the duloxetine database, MMRM yielded a 75% success rate for the primary outcome measure (HAMD 17 total score), while LOCF_ANCOVA produced a 50% success rate, in comparison to the expected rate of 80%. While many factors may reduce the success rate in Phase III clinical trials, the use of a statistical method with known inflation of Type II error (false negative results) is an obvious suspect.
An unduly high rate of false negative results could be especially problematic in early phases of drug development where only one or two chances exist to make the correct decision regarding the efficacy of a drug. It is noteworthy that one of the instances when MMMRM yielded a significant difference on the primary outcome (HAMD 17 total score) when LOCF did not was in the Phase II study (Study 3).
Also consider that across all therapeutic areas only about 50% of the molecules that enter Phase III testing receive regulatory approval. Many factors may reduce the success rate of Phase III development. However, the use in Phase II of a statistical method with known inflation of Type I error (false positive results) is an obvious suspect. Thus, the unexpectedly low success rate in Phase III is consistent with the conclusion that LOCF_ANCOVA inflates Type I error (as a result of an unduly high rate of false positive results in Phase II).
Hence, results from the duloxetine NDA are consistent with research suggesting a move away from LOCF_ANCOVA and other simple analytic techniques to methods such as MMRM that are more robust to the biases from missing data.

Conclusion
Important differences exist between MMRM and LOCF_ANCOVA. Research has clearly demonstrated the advantages of MMRM over LOCF_ANCOVA. However, interpretations regarding the efficacy of duloxetine in MDD were unaffected by the choice of analytical technique.