This article has Open Peer Review reports available.
External validation of the international risk prediction algorithm for major depressive episode in the US general population: the PredictD-US study
© The Author(s). 2016
Received: 17 February 2016
Accepted: 14 July 2016
Published: 22 July 2016
Multivariable risk prediction algorithms are useful for making clinical decisions and for health planning. While prediction algorithms for new onset of major depression in the primary care attendees in Europe and elsewhere have been developed, the performance of these algorithms in different populations is not known. The objective of this study was to validate the PredictD algorithm for new onset of major depressive episode (MDE) in the US general population.
Longitudinal study design was conducted with approximate 3-year follow-up data from a nationally representative sample of the US general population. A total of 29,621 individuals who participated in Wave 1 and 2 of the US National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) and who did not have an MDE in the past year at Wave 1 were included. The PredictD algorithm was directly applied to the selected participants. MDE was assessed by the Alcohol Use Disorder and Associated Disabilities Interview Schedule, based on the DSM-IV criteria.
Among the participants, 8 % developed an MDE over three years. The PredictD algorithm had acceptable discriminative power (C-statistics = 0.708, 95 % CI: 0.696, 0.720), but poor calibration (p < 0.001) with the NESARC data. In the European primary care attendees, the algorithm had a C-statistics of 0.790 (95 % CI: 0.767, 0.813) with a perfect calibration.
The PredictD algorithm has acceptable discrimination, but the calibration capacity was poor in the US general population despite of re-calibration. Therefore, based on the results, at current stage, the use of PredictD in the US general population for predicting individual risk of MDE is not encouraged. More independent validation research is needed.
Major depression is a prevalent mental disorder in the general population and imposes considerable burden on society [1–3]. According to the Global Burden of Disease study, major depression is a leading cause of disability at all ages worldwide . By 2030, major depression is expected to rank first in disease burden in the high-income countries . The average lifetime and 12-month prevalence of major depression were 14.6 % and 5.5 % in high-income income countries, respectively . In the US general population, the lifetime prevalence of major depression was 16 % .
The prevalence of major depression is influenced by incidence and episode duration . Major depression is highly recurrent in general populations and clinical settings. It is well recognized that the risk of recurrence increases with the number of previous episodes. Preventing new or incident cases of major depression can reduce the overall disease burden of major depression on society . One of the challenges in the prevention of major depression is its multi-factorial etiology. In the past decades, population-based studies across the world have identified a number of risk factors for major depression, including age, sex, educational level, marital status, employment status, ethnicity, living alone or with others, physical illness, lifetime depression, stress, financial strain, self-rated physical and mental health, alcohol use, childhood adversity, major life events, poor social support and experiences of discrimination on grounds of sex, age, ethnicity, appearance, disability, or sexual orientation [8–12].
For the purpose of early identification and early intervention, health professionals and policy makers need tools that can accurately identify individuals who are at high risk of developing major depression in the future so that preventive actions can be taken. In the clinical setting, predictive risk algorithms are embedded in clinicians’ daily practice as the primary tool to estimate individuals’ risks of future disease. There have been multivariable risk prediction algorithms for first onset , new onset [9, 13] and recurrent  major depressive episode (MDE) in different populations and settings. The PredictD algorithm was developed in primary care attendees in 6 European countries, who were between the ages of 18 and 75 and who did not have MDE in the past 6 months . The algorithm was developed to predict individuals’ risks of MDE in the next 12 months. Because the predictive performance of a model based on the development data is often optimistic, it is important that the developed model is validated in different populations, in different geographic regions or in different time periods [14, 15]. This addresses the accuracy of a model in individuals from a different but plausibly related population. However, most reports evaluating prediction models focus on the issue of internal validity, leaving the important issue of external validity behind. The PredictD international algorithm had good performance in the development data. It was validated with Chilean data as part of the PredictD study. To our knowledge, the algorithm has not been validated in populations besides those in the PredictD study. In the present study, the objective was to validate the PredictD algorithm in the US general population.
Study design and population
We used the data from the longitudinal cohort of the US National Epidemiological Survey on Alcohol and Related Conditions (NESARC). The NESARC was a nationally representative survey of the US general population funded by the National Institute on Alcohol Abuse and Alcoholism. Wave 1 of the NESARC was conducted between 2001 and 2002 and included 43 093 respondents aged 18 years and older. Wave 2 of the NESARC was conducted between 2004 and 2005, about 3 years after Wave 1. 34 653 participants of the original Wave 1 sample completed interviews at Wave 2. Of the 34,653 NESARC participants, we included 29,621 participants who were aged 18 to 75 years and who did have MDE in the past year at Wave 1, which resembled the sample of the PredictD study. A detailed description of the design and field procedures of the NESARC has described elsewhere [16, 17]. The NESARC data were collected using face-to-face computer-assisted interviews by trained lay interviewers. As current study was a secondary data analysis of public use data, ethics review was waived by the Conjoint Health Research Ethics Review Board of University of Calgary.
Assessment of mental disorders
MDE and other Axis-I and Axis-II mental disorders were assessed using the Alcohol Use Disorder and Associated Disabilities Interview Schedule (AUDADIS), based on the DSM-IV criteria [18, 19], a fully structured diagnostic interview that can be used by trained lay interviewers. Lifetime and past-year diagnoses were assessed at Wave 1. At Wave 2, diagnoses since Wave 1 were assessed.
Educational status was defined as completing beyond secondary, secondary/high school, primary/no education and trade/other education.
For the predictor “Difficulties in paid and unpaid work”, the NESARC did not include questions about work stress as measured by the Job Content Questionnaire in the PredictD. We used the answers to the questions: experiencing difficulties with boss or co-workers, and being fired or laid off in the past 12 months, as a proxy predictor. It was dichotomized as having or not having difficulties for paid or unpaid work.
Physical component score (PCS) measures physical quality of life in the past month, which was assessed by the Medical Outcomes study—Short Form (SF-12, version 2)  in both the NESARC and the PredictD study.
Mental component score (MCS) measures past month mental quality of life. It was assessed by Medical Outcomes study—Short Form (SF-12, version 2)  in both the NESARC and the PredictD. The PCS and MCS scores were standardized, ranging from 0 to 100.
History of depression in first-degree relatives was assessed as part of the AUDADIS . Same as the PredictD study, the NESARC participants were asked about whether their biological parents and siblings ever had depression (yes/no).
Experience of discrimination was assessed using NESARC questions on the grounds of physical disability, race-ethnicity, gender, sexual orientation, religion and being overweight. These questions were asked at Wave 2 and accommodated two time periods: the past 12 months, and prior to the past 12 months. In this study, we assumed that people’s experience of discrimination did not have a significant change over a short period of time (e.g., 2 years). Therefore, we used participants’ answers about experience of discrimination prior to the past 12 months as an indicator for discrimination. Same as the PredictD study, the experience of discrimination was categorized into three levels: no discrimination, having discrimination in one of the above grounds/area, and in more than one area.
Country: As we validated the PredictD model in the US population, in our validation, we entered “0” for the coefficient of “country”, assuming that the NESARC participants were similar with the UK sample.
Risk factors in the PredictD algorithm and the regression coefficients after shrinkage
Levels in factor
Beyond secondary education
Difficulties in paid and unpaid work
No difficulties or often supported
Difficulties without support
Each point on SF-12 subscale score
Each point on SF-12 subscale score
First-degree relative with emotional problem
In one area
In more than one area
We applied the prediction model directly to the selected NESARC participants with and without re-calibration. Re-calibration is a method of adjusting an existing model to predict risk in a new setting. It involves estimating only two new parameters that are expected to produce reasonable predictions beyond the dataset used for recalibration. The logit risk score (Z) was recalibrated to predict onset of MDE by fitting a logistic model with Z as the predictor variable, i.e. the slope (a) and intercept (b) were estimated for the model logit = a + bZ .
We assessed the model performance by discrimination and calibration. Discrimination is the ability of a prediction model to separate those who experienced the outcome events from those who did not. We quantified discrimination by calculating the C statistic, which is identical to the area under a receiver operating characteristic (ROC) curve when the outcome is binary, also known as AUC. Calibration measures how closely the predicted outcomes agree with actual outcomes (or accuracy). For this we used the Hosmer–Lemeshow (H–L) χ2 statistics. A χ2 statistic was calculated to compare the differences between the mean predicted and the observed risks; large P-value (i.e., greater than 0.05) indicates good calibration.
We also assessed the calibration by grouping individuals into deciles of risk and visually comparing the observed and the predicted risk, so that the overall calibration, and the areas with over or under prediction could be identified. We re-calibrated the algorithm to improve the agreement between the predicted and observed risks. All analyses were performed using Stata release 13 (Stata Corp. LP, USA).
Demographic characteristics of the US and European population
Age (year), mean (SD)
Married or living together
16 532 (55.8)
Separated or divorced
House hold status
Not living alone
Employed/full time student
Unable to work
Born in country of residence
External validation of the PredictD study in the US population
We validated the PredictD algorithm for the new onset of MDE over three years in the US NESARC sample. The validation results showed that the PredictD algorithm had acceptable discrimination (C = 0.708) but poor calibration in the US general population. When the PredictD algorithm was applied in the NESARC, it under estimated the risk of MDE overall and in high risk groups. The PredictD was independently validated at Chilean sites as part of the PredictD study. To our knowledge, the current study was the first attempt to validate the PredictD algorithm in a different population. The absolute differences between the mean predicted and the observed risk of MDE were improved with re-calibration.
In prediction research, external validation is necessary because prediction models tend to perform better in data on which the model was developed than on new data. This difference in performance might be an indication of the optimism in the apparent performance in the derivation set. C-index provides a standardized way of comparing the discriminative power that uses different measurement units in different settings. While the distance between the predicted outcome and actual outcome (i.e., calibration) is a central to quantify overall model performance . The PredictD multivariable algorithm seemed to perform reasonably well in terms of discrimination in the US general population (C = 0.708), which is consistent with the C statistic when the algorithm was applied to the Chilean data (C = 0.710) in the PredictD study [13, 22, 23]. Although the agreement between the predicted and the observed risk was improved with re-calibration, the overall calibration of the PredictD algorithm was still poor with the NESARC data. Similarly, when the PredictD algorithm was validated with the Chilean data, poor calibration was also indicated .
The difference in the C statistics between the PredictD study and this validation may be due to many factors. First, the PredictD algorithm was developed to predict the risk of MDE in the next 12 months, while the PredictD in NESARC was validated to predict the risk of MDE over three years. Second, the PredictD model was developed in the primary care attendees, where the incidence of MDE might be high. In the present study, we validated the PredictD in a general population sample. Third, the PredictD model included a predictor of “country” (i.e., United Kingdom (reference), Spain, Slovenia, Estonia, the Netherlands, and Portugal). To validate the algorithm, a value for “country” needs to be entered. The present validation study used the same coefficient as the UK, assuming the NESARC participants were similar with the UK sample. Fourth, we used ‘experiencing difficulties with boss or coworker and laid off’ as a proxy of ‘difficulties in paid and unpaid work’, which might partly explain the difference in C statistics. Finally, the differences in the model performance may be due to different distributions of predictors in the European and American populations.
The PredictD algorithm might perform well in the general population as much as in the primary care setting. But the calibration with the NESARC data was poor. In risk prediction research, calibration should receive more attention because it determines the model’s potential clinical utility, in combination with the model’s discriminative ability [24–27]. The validation results showed that direct application of the PredictD algorithm would under estimate the risk of MDE in the NESARC participants, leading to more false negatives. With re-calibration, the performance of the PredictD algorithm improved but was still poor. This indicated that re-calibration and/or re-estimation might be needed to achieve optimal performance prior to applying a risk prediction algorithm in a new population. Different predictors included in the prediction algorithm may also contribute to poor calibration. Wang et al’s prediction algorithm for first onset of MDE among NESARC participants had excellent calibration . The model included predictors such as childhood adversities, traumatic experience, past panic attack, generalized anxiety disorder symptom, and suicidal behavior . In PredictD model, these predictors were not important factors for MDE in the primary care attendees .
Adding other risk factors when training these models may refine risk assessment and improve the accuracy of the PredictD model in the general population. Furthermore, the development of sex-specific prediction algorithms for MDE might be important as the predictors for the risk of MDE and their predicted values may differ by sex .
The strength of this study is that the NESARC data were population-based and the sample size was large. To our knowledge, this is the first time that the PredictD algorithm was validated in a general population sample outside of Europe. This study also had limitations, including the fact that the NESARC relied on self-report. So reporting and recalling biases were possible. Such biases may also contribute to the inconsistencies in the predictive power of some factors in different populations. However, the instruments used in the NESARC have been validated and standardized as those in the PredictD study.
The PredictD algorithm has acceptable discrimination, but the calibration capacity was poor in the US general population. Despite of re-calibration, the PredictD algorithm under estimated the risk of MDE in the NESARC sample. Therefore, based on the results, at current stage, the use of PredictD in the US general population is not encouraged. In psychiatry, there have been many attempts in developing risk prediction algorithms. However, the developed tools need to be independently validated in different populations to ensure the generalizability of the models. More independent validation research is needed.
MDE, major depressive episode; NESARC: National Epidemiological Survey on Alcohol and Related Conditions; AUDADIS: alcohol use disorder and associated disabilities interview schedule; DSM-IV, diagnostic statistical manual of mental disorders-4th edition; PCS, physical component score; MCS, mental component score; SF-12, medical outcomes study—short form; H-L χ2, Hosmer-Lemeshow chi-square test
The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) is funded by the National Institute on Alcohol Abuse and Alcoholism (NIAAA) with supplemental support from the National Institute on Drug Abuse (NIDA). This research was supported by an operating grant from the Canadian Institutes of Health Research (grant number: MOP-114970, PI: JianLi Wang).
This research was supported by an operating grant from the Canadian Institutes of Health Research (CIHR) (grant number: MOP-114970, PI: JianLi Wang). However, CIHR plays no role in the design of the study, interpretation of the data and writing of the manuscript.
Availability of data and materials
Shared upon request.
YTN participated in the conception, design of the study, performed the statistical analysis and wrote the manuscript. YL participated in the design of the study and contributed to the writing of the manuscript. JW participated in the conception, design of the study and contributed to the writing of the manuscript. All authors have read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
As current study used publicly available data, ethics review was waived by the Conjoint Health Research Ethics Review Board of the University of Calgary.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Kessler RC, Barber C, Beck A, Berglund P, Cleary PD, McKenas D, et al. The World Health Organization Health and Work Performance Questionnaire (HPQ). J Occup Environ Med. 2003;45:156–74.View ArticlePubMedGoogle Scholar
- Mathers CD, Loncar D. Projections of global mortality and burden of disease from 2002 to 2030. PLoS Med. 2006;3:e442.View ArticlePubMedPubMed CentralGoogle Scholar
- Bromet E, Andrade LH, Hwang I, Sampson NA, Alonso J, de Girolamo G, et al. Cross-national epidemiology of DSM-IV major depressive episode. BMC Med. 2011;9:90.View ArticlePubMedPubMed CentralGoogle Scholar
- Whiteford HA, Degenhardt L, Rehm J, Baxter AJ, Ferrari AJ, Erskine HE, et al. Global burden of disease attributable to mental and substance use disorders: findings from the Global Burden of Disease Study 2010. Lancet. 2013;382:1575–86.View ArticlePubMedGoogle Scholar
- Kessler RC, Bromet EJ. The epidemiology of depression across cultures. Annu Rev Public Health. 2013;34:119–38.View ArticlePubMedPubMed CentralGoogle Scholar
- Bockting CL, Spinhoven P, Koeter MW, Wouters LF, Schene AH, Group DELTAS. Prediction of recurrence in recurrent depression and the influence of consecutive episodes on vulnerability for depression: a 2-year prospective study. J Clin Psychiatry. 2006;67:747–55.View ArticlePubMedGoogle Scholar
- Wang JL, Patten S, Sareen J, Bolton J, Schmitz N, MacQueen G. Development and validation of a prediction algorithm for use by health professionals in prediction of recurrence of major depression. Depress Anxiety. 2014;31:451–7.View ArticlePubMedGoogle Scholar
- Wang J, Sareen J, Patten S, Bolton J, Schmitz N, Birney A. A prediction algorithm for first onset of major depression in the general population: development and validation. J Epidemiol Community Health. 2014;68:418–24.View ArticlePubMedGoogle Scholar
- Wang JL, Manuel D, Williams J, Schmitz N, Gilmour H, Patten S, et al. Development and validation of prediction algorithms for major depressive episode in the general population. J Affect Disord. 2013;151:39–45.View ArticlePubMedGoogle Scholar
- Djernes JK. Prevalence and predictors of depression in populations of elderly: a review. Acta Psychiatr Scand. 2006;113:372–87.View ArticlePubMedGoogle Scholar
- Anstey KJ, von Sanden C, Sargent-Cox K, Luszcz MA. Prevalence and risk factors for depression in a longitudinal, population-based study including individuals in the community and residential care. Am J Geriatr Psychiatry. 2007;15:497–505.View ArticlePubMedGoogle Scholar
- Patten SB, Wang JL, Williams JV, Lavorato DH, Khaled SM, Bulloch AG. Predictors of the longitudinal course of major depression in a Canadian population sample. Can J Psychiatry. 2010;55:669–76.View ArticlePubMedGoogle Scholar
- King M, Walker C, Levy G, Bottomley C, Royston P, Weich S, et al. Development and validation of an international risk prediction algorithm for episodes of major depression in general practice attendees: the PredictD study. Arch Gen Psychiatry. 2008;65:1368–76.View ArticlePubMedGoogle Scholar
- Chekroud AM, Zotti RJ, Shehzad Z, Gueorguieva R, Johnson MK, Trivedi MH, et al. Cross-trial prediction of treatment outcome in depression: a machine learning approach. Lancet Psychiatry. 2016;3:243–50.View ArticlePubMedGoogle Scholar
- Kessler RC, van Loo HM, Wardenaar KJ, Bossarte RM, Brenner LA, Cai T, et al. Testing a machine-learning algorithm to predict the persistence and severity of major depressive disorder from baseline self-reports. Mol Psychiatry 2016: doi: 10.1038/mp.2015.198. In press.
- Grant BF, Goldstein RB, Chou SP, Huang B, Stinson FS, Dawson DA, et al. Sociodemographic and psychopathologic predictors of first incidence of DSM-IV substance use, mood and anxiety disorders: results from the Wave 2 National Epidemiologic Survey on Alcohol and Related Conditions. Mol Psychiatry. 2009;14:1051–66.View ArticlePubMedGoogle Scholar
- Hasin DS, Goodwin RD, Stinson FS, Grant BF. Epidemiology of major depressive disorder: results from the National Epidemiologic Survey on Alcoholism and Related Conditions. Arch Gen Psychiatry. 2005;62:1097–106.View ArticlePubMedGoogle Scholar
- Ruan WJ, Goldstein RB, Chou SP, Smith SM, Saha TD, Pickering RP, et al. The alcohol use disorder and associated disabilities interview schedule-IV (AUDADIS-IV): reliability of new psychiatric diagnostic modules and risk factors in a general population sample. Drug Alcohol Depend. 2008;92:27–36.View ArticlePubMedGoogle Scholar
- American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 4th ed. Washington, DC: American Psychiatric Association; 1994.Google Scholar
- Jenkinson C, Layte R, Jenkinson D, Lawrence K, Petersen S, Paice C, et al. A shorter form health survey: can the SF-12 replicate results from the SF-36 in longitudinal studies? J Public Health Med. 1997;19:179–86.View ArticlePubMedGoogle Scholar
- Steyeberg E. Clinical prediction models. A Practical Approach to Development, Validation, and Updating. New York: Springer; 2009.Google Scholar
- Bellon JA, de Dios LJ, Moreno B, Monton-Franco C, GildeGomez-Barragan MJ, Sanchez-Celaya M, et al. Psychosocial and sociodemographic predictors of attrition in a longitudinal study of major depression in primary care: the predictD-Spain study. J Epidemiol Community Health. 2010;64:874–84.View ArticlePubMedGoogle Scholar
- King M, Bottomley C, Bellon-Saameno J, Torres-Gonzalez F, Svab I, Rotar D, et al. Predicting onset of major depression in general practice attendees in Europe: extending the application of the predictD risk algorithm from 12 to 24 months. Psychol Med. 2013;43:1929–39.View ArticlePubMedGoogle Scholar
- Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. 2014;35:1925–31.View ArticlePubMedPubMed CentralGoogle Scholar
- Steyerberg EW, Van Calster B, Pencina MJ. Performance measures for prediction models and markers: evaluation of predictions and classifications. Rev Esp Cardiol. 2011;64:788–94.View ArticlePubMedGoogle Scholar
- Steyerberg EW, Harrell Jr FE. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245–7.View ArticlePubMedGoogle Scholar
- Van Calster B, Steyerberg EW, Harrell FH. Risk Prediction for Individuals. JAMA. 2015;314:1875.View ArticlePubMedGoogle Scholar