Skip to main content

Measurement invariance testing of the PHQ-9 in a multi-ethnic population in Europe: the HELIUS study



In Western European countries, the prevalence of depressive symptoms is higher among ethnic minority groups, compared to the host population. We explored whether these inequalities reflect variance in the way depressive symptoms are measured, by investigating whether items of the PHQ-9 measure the same underlying construct in six ethnic groups in the Netherlands.


A total of 23,182 men and women aged 18–70 of Dutch, South-Asian Surinamese, African Surinamese, Ghanaian, Turkish or Moroccan origin were included in the HELIUS study and had answered to at least one of the PHQ-9 items. We conducted multiple group confirmatory factor analyses (MGCFA), with increasingly stringent model constraints (i.e. assessing Configural, Metric, Strong and Strict measurement invariance (MI)), and regression analysis, to confirm comparability of PHQ-9 items across ethnic groups.


A one-factor model, where all nine items reflect a single underlying construct, showed acceptable model fit and was used for MI testing. In each subsequent step, change in goodness-of-fit measures did not exceed 0.015 (RMSEA) or 0.01 (CFI). Moreover, strict invariance models showed good or acceptable model fit (Men: RMSEA = 0.050; CFI = 0.985; Women: RMSEA = 0.058; CFI = 0.979), indicating between-group equality of item clusters, factor loadings, item thresholds and residual variances. Finally, regression analysis did not indicate potential ethnicity-related differential item functioning (DIF) of the PHQ-9.


This study provides evidence of measurement invariance of the PHQ-9 regarding ethnicity, implying that the observed inequalities in depressive symptoms cannot be attributed to DIF.

Peer Review reports


Depression is one of the leading causes of disease burden worldwide, and its prevalence is only expected to increase further [1]. In 2010, major depressive disorder (MDD) accounted globally for 8.2% of years lived with disability. The prevalence of depression differs across demographic groups. For example, meta-analyses showed that individuals with low socioeconomic status (SES) have a higher risk of suffering from a depression, compared to individuals with high SES, while the disease more often has a chronic course in the low SES group [2, 3]. Ethnic inequalities in depression have also been reported in European countries: ethnic minority groups show an increased risk of poor mental health in general, and depression in particular, compared to the host population [4,5,6,7].

Increased depression rates among ethnic minority populations are of particular concern, since increases in migration were observed for most western European countries over the past decades [8]. Migrants from non-EU countries face the largest challenges, with regard to socioeconomic conditions and health [8]. For example, in the Netherlands, considerable differences in the prevalence of depression and depressive symptoms were reported for Moroccan and Turkish immigrants, compared to the Dutch population [5, 9]. The 1-month prevalence of depressive disorders was 4% in adults of Dutch ethnic origin, whereas it was 7% in adults of Moroccan origin and even 15% among adults of Turkish origin [5].

A key question that emerges from the ethnic variation in prevalence rates, is whether it reflects actual differences in the occurrence of depression or whether it is due to differences in interpretation or presentation of depressive symptoms in the questionnaire. It is not unlikely that true differences in prevalence rates exist, as depression among ethnic minority groups may be caused by difficulties experienced during or after migration [8, 10, 11]. Perceived ethnic discrimination, for instance, has been shown to contribute to depressive symptoms among ethnic minority groups [9]. Also, the higher prevalence of physical health problems among ethnic minority groups, compared with those of Dutch origin [12], may contribute to increased levels of depression among the minority groups. However, after accounting for differences in SES, perceived ethnic discrimination or in physical disorders and limitations, ethnic differences in depression rates are still observed [7, 9, 13]. It is important to explore the possibility that the interpretation or presentation of depressive symptoms, as assessed by a questionnaire, differ by ethnic background. The current study aims to explore whether differential item functioning may be an explanation for the observed ethnic variation in depression rates.

Differential item functioning (DIF) may occur when people from different ethnic groups report to questions about their mental health in a different way. Depressive symptoms – such as feelings of sadness or disappointment – occur in all cultures and ethnic groups [14]. However, the way they are experienced and expressed may differ across cultures. For example, Chinese people may report feelings of boredom, pain or fatigue, rather than sadness [14]. Non-Western populations in general are often claimed to ‘somatize’ their mood disturbances [15], although other studies have shown that somatizing is rather a global tendency [16,17,18].

In order to draw conclusions regarding ethnic differences in the prevalence of depressive symptoms, one should verify whether items of a depression questionnaire measure the same concept in all groups, i.e. confirm that the questionnaire is measurement invariant. Measurement invariance implies that individuals’ characteristics which are not part of the construct of interest, such as gender or ethnicity, do not affect individual item scores, other than via the construct of depression [19, 20]. If the assumption of measurement invariance is violated, this implies that the items function differently across ethnic groups. For example, if two individuals of Turkish and Dutch ethnic origin who have a similar level of underlying depression are asked whether they have experienced fatigue, they should have the same probability of responding ‘more than half of the days’. If the expression of fatigue as a symptom of depression is more common for a Turkish individual, this may influence his total depressive symptom score, despite an equal underlying level of depression. Since DIF in one or multiple items would affect the extent to which these items correlate with the remaining items, establishing measurement invariance warrants a systematic analysis of the correlational patterns across items.

Following DSM-5 guidelines, a diagnosis of depression is based on the experience of at least five out of nine symptoms of depression in the last two weeks. We use the PHQ-9 (Patient Health Questionnaire) to assess depressive symptoms in this study, which has the advantage that the items precisely reflect these nine DSM-5 symptoms [21]: depressed mood, anhedonia, trouble sleeping, feeling tired, change in appetite, guilt or worthlessness, trouble concentrating, feeling slowed down or restless, or suicidal thoughts. The PHQ-9 is a well-known and often used measure of depressive symptoms and can be used to assess (significant) depressed mood [22], or as a continuous measure with scores ranging from 0 to 27 [21, 23].

Teresi et al. performed a review on DIF studies in depression measures, which were mainly - but not exclusively – focused on the Center for Epidemiological Studies Depression Scale (CES-D). They found that several items of these scales showed DIF with regard to demographic characteristics, such as age, gender and ethnicity [24]. However, findings on ethnicity-related DIF were inconsistent and none of the reviewed studies examined DIF across ethnic groups in Europe. More recently, Hirsch et al. studied measurement invariance of the PHQ-9 in a selection of about 350 primary care patients with at least one chronic disease. They compared Russian immigrants living in Germany with native-born Russians living in Russia and native-born Germans living in Germany, and concluded that the PHQ-9 measured the level of depressive symptoms in a similar way in these groups [25]. Baas et al. compared about 300 patients of Dutch Surinamese ethnic origin with patients of Dutch origin. They found that in women the PHQ-9 was measurement invariant for ethnicity, but in men, it was only partially measurement invariant [26].

There is a need for further research on whether the PHQ-9 measures depressive symptoms in a similar way across ethnic groups in the general population. Apart from the fact that they compared only two ethnic groups, the two studies mentioned above included patients with a high risk of depression [26], or with at least one chronic disease [25]. This does not provide evidence on whether the PHQ-9 assesses depressive symptoms similarly across ethnic groups in the general population, which is the usual approach to obtain prevalence rates in the population. The current study aimed to address this need, by examining ethnic-related measurement invariance of the PHQ-9, using data from the Dutch HELIUS study. Various ethnic groups, representative for current migrant groups in Europe (Turkish, Moroccan, South-Asian Surinamese, African Surinamese and Ghanaian origin), were included. This epidemiological study included large random samples of these five groups and a comparison group of Dutch ethnic origin (~24,000 in total and 2500–4600 per group), drawn from the general population of Amsterdam. Measurement invariance regarding ethnicity were assessed separately for men and women, since there is a consistent gender difference in the prevalence of depressive symptoms [27, 28].



The aim and design of the HELIUS (HEalthy LIfe in an Urban Setting) study have been described in detail elsewhere [29, 30]. In brief, the HELIUS study is a multi-ethnic cohort study conducted in Amsterdam, the Netherlands. Subjects were randomly, stratified by ethnicity, selected from the Amsterdam municipality register, and were sent an invitation letter (and a reminder after 2 weeks) by mail. We were able to contact 55% of those invited (55% among Dutch, 62% among Surinamese, 57% among Ghanaians, 46% among Turks, 48% among Moroccans), either by response card or after a home visit by an ethnically-matched interviewer. Of those, 50% agreed to participate (participation rate; 60% among Dutch, 51% among Surinamese, 61% among Ghanaians, 41% among Turks, 43% among Moroccans). Therefore, the overall response rate was 28% with some variations across ethnic groups. After a positive response, participants received a confirmation letter of the appointment for the physical examination, including a digital or paper version of the questionnaire (depending on the preference of the subject). Participants who were unable to complete the questionnaire themselves were offered assistance from a trained ethnically-matched interviewer. The Medical Ethics Committee of the Academic Medical Center (AMC) approved the study protocols. Written informed consent was obtained from all participants involved in the study.

Of the 23,942 participants who filled in the HELIUS questionnaire, we excluded 586 respondents who did not belong to the six largest ethnic groups and an additional 174 respondents who did not fill in any of the PHQ-9 items. Excluded respondents due to missing data were most often of Ghanaian origin, and more often had low or unknown education level, compared to included respondents. All respondents who missed some but not all items were retained in the measurement invariance analyses (n = 463), but some were excluded when computing PHQ-9 sum scores (for details we refer to the Measurements section). The majority of those missed only one item (n = 396), while the mean number of completed items ranged from 7.5–8 across ethnic groups.

The final sample consisted of 23,182 respondents of Dutch origin (n = 4635), South-Asian Surinamese origin (n = 3355), African Surinamese origin (n = 4428), Ghanaian origin (n = 2444), Turkish origin (n = 4028) and Moroccan origin (n = 4292).


The Dutch version of the PHQ-9 was included in the HELIUS questionnaire [21]. All nine items have four response categories: 0 “not at all”, 1 “on several days”, 2 “on more than half of the days” and 3 “nearly every day”. Total sum scores range from 0 to 27. A participant was considered to have depressed mood when having a sum score greater than 9 and significant depressed mood when one or both of items 1 and 2 were answered with at least ‘more than half of the days’, and at least 5 of the 9 items were answered ‘more than half of the days’ or ‘nearly every day’. The final item (suicidal ideation) already counted if answered with ‘several days’ [22]. Only for calculating the sum score we replaced missing item scores, or excluded some individuals with more than one missing item. If one of the items was missing, we replaced it by the mean score of the other items and the sum score was calculated as usual. If more than one item was missing, the sum score was not calculated (missing). In subsequent measurement invariance analyses, participants with missing items were all included while missing items were not replaced.

Item 8 of the PHQ-9 originally contained two questions combined in a single item (“Moving or speaking so slowly that other people could have noticed? Or the opposite — being so fidgety or restless that you have been moving around a lot more than usual”), which appeared very difficult to answer when we pre-tested the HELIUS questionnaire. Therefore, in the HELIUS questionnaire, item 8 is divided into 2 items. For all analyses, these items were first combined into a single item, to make this item resemble the one from the original instrument. In all 9 items, we collapsed adjacent response categories in case they contained <5% of the sample, to ensure that endorsement rates were high enough for measurement invariance analyses. This resulted in one dichotomous item (item 9), four items with three categories (items 2, 6, 7 and 8) and four items with four categories (items 1, 3, 4 and 5) (Table 1).

Table 1 Item responses (%) of the PHQ-9a

Ethnicity was defined according to the country of birth of the participants as well as that of their parents [31]. Specifically, a participant was considered of non-Dutch ethnicity if either of the following criteria was fulfilled: (1) born outside the Netherlands and at least one parent born outside the Netherlands (i.e., first generation); or (2) born in the Netherlands, but both parents born outside the Netherlands (i.e., second generation). In addition, as the Surinamese population consists of different ethnic groups which cannot be distinguished from each other on the basis of country of birth, self-reported ethnicity was used to determine Surinamese subgroups (either African or South-Asian origin). In order to be sure that the respondents report their geographic origin, rather than the group they feel belonging to, the question on self-identification was phrased in objective terms [31].

Overall, there were three different modes of questionnaire completion: internet (43%) or paper version (31%), or paper version with interviewer assistance (26%). Participants in the Dutch and in both Surinamese groups completed the questions in Dutch. Of the Ghanaian and Turkish subsamples, 78 and 32%, respectively, completed the questions in English or Turkish. Of the Moroccan subsample, about 33% were assisted by an interviewer who filled in the questionnaire in Dutch, but who often spoke Moroccan Arabic or Berber with the respondent. Unfortunately, no detailed information was available on the language of the interview of Moroccan respondents. Sensitivity analyses were performed to examine if the PHQ-9 was measurement invariant regarding language (English vs. Dutch and Turkish vs. Dutch) and interview mode (internet, paper, or interview).

Statistical analysis

Multiple group confirmatory factor analysis (MGCFA)

Multiple group confirmatory factor analysis (MGCFA) was applied to investigate measurement invariance, because it enables the assessment of measurement invariance at different hierarchic levels, and in multiple groups at the same time [20]. In all analyses, ethnic minority groups were compared with the Dutch ethnic origin reference group.

MGCFA is a special case of confirmatory factor analysis (CFA), which requires a prespecified measurement model to be tested. Several studies assume or have shown that the PHQ-9 is unidimensional, indicating that all items measure the symptoms of a single underlying construct (depression) [21, 26, 32, 33]. However, others could not replicate its unidimensional structure [34,35,36,37], and provided evidence for a somatic and a non-somatic component, for instance [37]. To improve model fit, previous researchers have added residual covariances [32, 34] or excluded one ore more items [34, 35]. Since there is inconsistency in the best fitting factor model, we first verified the unidimensionality of the PHQ-9, by comparing the fit of three models: 1) a one-factor model, 2) a two-factor model with items 3,4,5,7,8 loading on factor 1 (somatic) and items 1,2,6,9 loading on factor 2 (non-somatic), and 3) a two-factor model based on exploratory factor analysis (EFA). The best of these three models was used as the baseline model for subsequent measurement invariance tests.

Testing of measurement invariance

Four hierarchic levels of measurement invariance were tested [20]. Each level implies that more constraints are added to the model (i.e. parameters are equally estimated across groups), with the fit of the model with more constraints being compared to the fit of the less constrained model (i.e. for the non-reference group these parameters were set free). If the more constrained model does not fit significantly worse in comparison with the model that has fewer constraints, this indicates measurement invariance at the tested level.

At the least stringent level, configural invariance indicates that the clustering of items and the factors that they represent is similar across groups. This was investigated by evaluating model fit of the baseline model separately for all ethnic groups. Metric invariance entails the similarity of factor loadings, and was tested by comparing a model that constrained all factor loadings to be equal across groups, with a configural model where factor loadings were freely estimated across groups. If metric invariance holds, the items load on the latent construct to the same extent for all groups. Strong (or scalar) invariance additionally entails the equality of item thresholds. If strong invariance holds this is evidence that there is no additive response bias, indicating that item responses are not systematically higher or lower in one group compared with the other group(s). Finally, strict invariance is the most stringent level and reflects that the residual variances, or error terms, of each item are similar across groups.

For all MGCFA analyses we applied Weighted Least Squares Means and Variance adjusted (WLSMV) estimation with theta parameterization in Mplus version 7.4 for statistical analysis with latent variables [38], in which the items were treated as ordinal variables [39]. For each successive step of MI testing, we applied the parameterization described in the Mplus manual [38].

Assessment of goodness-of-fit

Goodness-of-fit statistics were estimated for each model and standard criteria were used to evaluate them. The χ2 statistic indicates the discrepancy between the covariance matrix of the observed data and the one that is predicted by the factor model. This statistic is sensitive to sample size and often rejects a good fitting model [40, 41]. Therefore, and because it is recommended to use several indices simultaneously [42], we additionally evaluated RMSEA (Root Mean Square Error of Approximation) and CFI (Comparative Fit Index) values which are less sensitive to sample size [43]. A better model fit is indicated by a low RMSEA value and a high CFI value. RMSEA values lower than 0.08 or 0.05 indicate acceptable and good model fit, respectively. CFI values higher than 0.95 and 0.97 indicate acceptable and good model fit, respectively [44].

Differences between successive measurement invariance models were tested using the DIFFtest procedure in Mplus. Similarly to χ2, the DIFFtest is influenced to a large extent by sample size, and thus often rejects good fitting models [41]. Therefore, we also evaluated ΔCFI and ΔRMSEA between the more and less constrained models. Only a few simulation studies have reported cut-offs that indicate significant measurement non-invariance, and none of those examined more than two groups [40, 41, 45]. We decided to apply the most conservative cut-offs. Declines in CFI larger than 0.01 and increases in RMSEA larger than 0.015 indicated a significant worsening of fit [40, 41].

Impact of DIF on demographic health inequalities

With MGCFA we tested whether differences in overall factor structure were present. This method may be less powerful to detect DIF of individual items, because all item parameters are constrained across groups at the same time. We therefore performed additional tests which were targeted at individual items to explore more subtle levels of DIF which may remain undetected by the MGCFA approach. In case significant DIF at the item level was found, we examined the impact that adjustment for this DIF had on the magnitude of inequalities in depressive symptoms.

First, we conducted regression analysis to detect significant DIF at the item level. To that end, we first saved individual factor scores from each strict invariance model. With logistic regression, we predicted each dichotomized item score with the corresponding factor score and saved the residuals. The residuals represent the variation in item scores not explained by the underlying factor. Subsequently, we performed linear regression with the residuals as the dependent variable, and ethnicity and ethnicity*factor score as independent variables. This was done to conduct one overall test for uniform DIF (analogous to strong invariance) and non-uniform DIF (analogous to metric MI), respectively [46]. The explained variance (R2) of this model represents the predictive value of ethnicity for the item score, over and above the predictive value of the underlying factor, and was interpreted as indicative of DIF. Items with an R2 of 2% or higher and significant regression coefficients for the predictors ethnicity or ethnicity*factor (p-value below 0.05) were selected as items with DIF [46].

Second, if DIF in any of the items was found, we returned to the MGCFA analysis and estimated the impact of adjusting for this DIF. We aimed to compare ethnic inequalities in factor scores, from models that did and did not adjust for DIF. Factor scores from the previously described strict invariance models were regarded as unadjusted for DIF. Adjustment for DIF was done by adapting the strict invariance model so that for items with DIF all threshold constraints across groups were set free. Using means and variances of unadjusted and adjusted factor scores, we estimated two sets of standardized mean differences (Cohen’s d) across ethnic groups. We evaluated whether 95% confidence intervals around d’s unadjusted for DIF and adjusted for DIF showed overlap, which would indicate that the statistically significant DIF that was observed had low impact on the magnitude of demographic health inequalities. Cohen’s d was calculated using the pooled sd as the denominator; conventional thresholds were used to interpret effect sizes as small (d = 0.2), medium (d = 0.5) and large (d = 0.8) [47].


Sample characteristics

Table 2 shows the demographic characteristics and distribution of the PHQ-9 in each ethnic group, and by gender. In both genders, PHQ-9 sum scores were highest among respondents with Turkish ethnic origin and lowest among the group with Ghanaian ethnic origin. A similar pattern of ethnic differences emerged for the prevalence of (significant) depressed mood.

Table 2 Sample characteristics by ethnicity

Measurement invariance analyses

Three different factor models were compared, to obtain an adequate baseline model for further analysis (Table 3): a one factor model, a two-factor model based on the literature, and a two-factor model based on EFA. The EFA two-factor model was slightly different, and had better fit, compared with the two-factor model that was examined in previous studies. Although the two-factor models generally showed better fit as compared to the one-factor model, we decided to continue with the one-factor model for two reasons. First, in both models the two factors showed a high correlation, indicating that they reflect two largely overlapping constructs. Second, the one-factor model had good model fit according to CFI, and also adequate model fit according to RMSEA after residual covariances (between items 1 and 2, items 3 and 4 and items 7 and 8) were added to the model. The fit of this one-factor model is shown for each ethnic group and gender in Table 4. Model fit was better in men as compared to women, but in all groups RMSEA and CFI values were indicative of acceptable or good model fit.

Table 3 Comparing the model fit of one-factor and two-factor models
Table 4 Model fit of the baseline one-factor modela in each subgroup

Results from the MGCFA are shown in Table 5. Adding constraints for equal factor loadings, item thresholds and residual variances did not lead to significantly reduced model fit, compared to the least constrained (configural) model. The final strict measurement invariance models for both men and women showed adequate model fit (Men: RMSEA = 0.050; CFI = 0.985; Women: RMSEA = 0.058; CFI = 0.979), while ΔRMSEA and ΔCFI for increasingly stringent test of measurement invariance never exceeded the critical values of 0.015 and 0.01, respectively. Since model fit – according to RMSEA - differed more between ethnic groups among women than among men (Table 4), we examined whether this was due to DIF with respect to gender in some but not in other ethnic groups. However, the results showed that this was not the case: items of the PHQ-9 were measurement invariant for gender in all ethnic groups (Table 6).

Table 5 Measurement invariance tests regarding ethnicity, by gender

The additional regression analyses, targeted at individual items, revealed no items with DIF related to ethnicity (Tables 7, 8 and 9). Furthermore, sensitivity analyses confirmed that the PHQ-9 was measurement invariant with regard to language and interview mode (Tables 10 and 11).


Measurement invariance of the PHQ-9 regarding ethnicity was examined in a population-based sample including over 23,000 participants. Our results indicated that the PHQ-9 was measurement invariant across groups with Dutch, South-Asian Surinamese, African Surinamese, Ghanaian, Turkish and Moroccan ethnic origin. As such, the observed ethnic differences in PHQ-9 scores may be attributed to true differences in depressive symptoms, and not to factors related to the measurement of these symptoms.

Our results should be interpreted in view of some limitations. Firstly, non-response to this study may in particular be a concern in those with the poorest mental health, the lowest proficiency of the Dutch language, or in the least acculturated individuals. These factors may influence how the PHQ-9 is responded to, and as such non-response may influence the generalizability of our results. Secondly, this study investigated ethnicity-related DIF for the PHQ-9 and the results can therefore not be generalized to other demographic characteristics or to other depression instruments. For example, Schrier et al. found DIF in five items of the CIDI when comparing respondents with Turkish and Dutch origin in the Netherlands [48]. In addition, in their review Teresi et al. (2008) concluded that several items of depression scales showed DIF with regard to demographic or health characteristics. None of the reviewed studies examined DIF across ethnic groups in Europe, however.

Our selection of statistical approaches and criteria for significance and relevance may be of influence on the conclusions that were drawn. We applied MGCFA, which has been shown to perform well to detect different levels of DIF [49], using model fit parameters that were recommended in previous studies [40, 41, 44, 45]. However, little is known about which criteria should be used when sample sizes are large, or when more than two groups are compared at the same time. We recommend that more research is done in this field, to guide researchers regarding which methods and criteria for significant and relevant DIF should or should not be applied. In the current study the results of both MGCFA analysis and logistic regression analysis pointed in the same direction, which strengthens our conclusion about the absence of ethnicity-related DIF for items of the PHQ-9.

Empirical evidence of measurement invariance is essential for making valid health comparisons across demographic groups. Our results imply that the ethnic inequalities in depressive symptoms, that were observed in our study as well as in other studies [5, 6], reflect true differences, and are not likely the result of measurement bias. Thus, the PHQ-9 can be used to make comparisons regarding the prevalence of (significant) depressed mood in groups with different ethnic background in the Netherlands. The Dutch had the lowest prevalence of 3% for significant depressed mood, and the Turks had the highest rate (11% in men, 15% in women), with the rates for the other ethnic minority groups lying in between. Interestingly, the GBD 2010 data indicate that these ethnic minority groups (except Ghanaians) have lower MDD prevalence in their countries of origin [1], which may suggest that adverse circumstances in the host societies (e.g., ethnic discrimination, acculturative stress) might be at play here.

The pattern of ethnic inequalities in (significant) depressed mood that we observed is somewhat similar to what was found by de Wit et al. who used the CIDI (Composite International Diagnostic Interview) to assess depression. They reported the 1-month prevalence of depressive disorders (MDD or dysthymia) in respondents with Surinamese (1%), Dutch (4%), Moroccan (7%) and Turkish ethnic origin (15%) [5]. The pattern of inequalities – increased prevalence in ethnic minorities, with the lowest rates in Ghanaians and African Surinamese, and higher rates in Turks, Moroccans and South-Asian Surinamese - suggests that migration-related factors may be ethnic-specific, and that ethnic minority groups should not be combined without taking the differences between these groups into account [50]. Future studies could be designed to investigate to what extent genetic vs. cultural variation contributes to these ethnic differences in the prevalence of depression.

To our knowledge, our study is the first to assess ethnicity-related measurement invariance of the PHQ-9 in a population-based sample. Previous studies on ethnicity related DIF included people with at least one chronic disease [25], with HIV [32], or with a high risk of depression [26, 33, 51]. In two studies this was done by administering the full PHQ-9 only if respondents endorsed at least one of the key items, for example anhedonia and depressed mood [33, 51]. In particular the inclusion of high-risk patients provides less information on the ethnic diversity among respondents that do not have a high level of depression but nevertheless might respond differently to the questionnaire. This influences the rates of depression in the general population that are found. Moreover, this study assessed measurement invariance in a variety of ethnic groups that are representative for migrant groups in Europe. In a previous study, Baas et al. (2011) compared two ethnic groups in the Netherlands, both including individuals with a high risk of depression. They found that the item on psychomotor problems (item 8) had a higher factor loading and threshold among Surinamese men, compared to Dutch men. This item originally contains two parts (moving or speaking slowly, or being fidgety and restless), which appeared very difficult to answer when we pre-tested the questionnaire. In the HELIUS questionnaire item 8 was therefore divided into 2 items (see Table 1), and it might be that this adaptation has led to the absence of reporting differences between ethnic groups, whereas they were present in the study by Baas et al.

A strong point of this study is that we were able to additionally study possible DIF due to language and interview mode, given the heterogeneity in our sample regarding these factors. We compared Turks who completed the PHQ-9 in Turkish vs. Dutch, and Ghanaians who completed the PHQ-9 in English vs. Dutch. In addition, we compared groups who completed the questionnaire through the internet, on paper, or with the help of an interviewer. We found that the PHQ-9 was measurement invariant regarding language and interview mode. This result is reassuring and confirms the applicability of the PHQ-9 in different samples and settings.


With the growing ethnic diversity in European populations there is a need for evidence on the reliability of instruments to study the mental health of ethnic minority groups. The PHQ-9 is often used to measure depressive symptoms in clinical practice or for research purposes. This study provides evidence for measurement invariance of the PHQ-9 in an ethnically diverse sample in the Netherlands. This implies that items of the PHQ-9 function similarly in people with South-Asian Surinamese, African Surinamese, Ghanaian, Turkish and Moroccan ethnic background, as compared to those with Dutch ethnic origin. Moreover, we showed that language (Turkish vs. Dutch in Turks, and English vs. Dutch in Ghanaians) and interview mode (interview, paper, or internet) did not result in measurement bias, indicating that the PHQ-9 can be used in a variety of settings to compare the level of depressive symptoms across ethnic groups. In conclusion, differences in depression scores and rates of depression across ethnic groups are unlikely to be due to assessment bias suggesting that the contribution of other factors such as migration history and migration status should be explored in future studies.



Confirmatory Factor Analysis


Comparative Fit Index


Composite International Diagnostic Interview


Differential Item Functioning


Major Depressive Disorder


Multiple Group Confirmatory Factor Analysis


Measurement Invariance


Patient Health Questionnaire-9


Root Mean Square Error of Approximation


Socioeconomic status


  1. 1.

    Ferrari AJ, Charlson FJ, Norman RE, Patten SB, Freedman G, Murray CJL, Vos T, Whiteford HA. Burden of depressive disorders by country, sex, age, and year: findings from the global burden of disease study 2010. PLoS Med. 2013;10(11):e1001547.

    Article  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Lorant V, Deliège D, Eaton W, Robert A, Philippot P, Ansseau M. Socioeconomic inequalities in depression: a meta-analysis. Am J Epidemiol. 2003;157(2):98–112.

    CAS  Article  PubMed  Google Scholar 

  3. 3.

    Fryers T, Melzer D, Jenkins R. Social inequalities and the common mental disorders. Soc Psychiatry Psychiatr Epidemiol. 2003;38(5):229–37.

    Article  PubMed  Google Scholar 

  4. 4.

    Tinghög P, Hemmingsson T, Lundberg I. To what extent may the association between immigrant status and mental illness be explained by socioeconomic factors? Soc Psychiatry Psychiatr Epidemiol. 2007;42(12):990–6.

    Article  PubMed  Google Scholar 

  5. 5.

    de Wit MAS, Tuinebreijer WC, Dekker J, Beekman A-JTF, Gorissen WHM, Schrier AC, Penninx BWJH, Komproe IH, Verhoeff AP. Depressive and anxiety disorders in different ethnic groups. Soc Psychiatry Psychiatr Epidemiol. 2008;43(11):905–12.

    Article  PubMed  Google Scholar 

  6. 6.

    Missinne S, Bracke P. Depressive symptoms among immigrants and ethnic minorities: a population based study in 23 European countries. Soc Psychiatry Psychiatr Epidemiol. 2012;47(1):97–109.

    Article  PubMed  Google Scholar 

  7. 7.

    Levecque K, Lodewyckx I, Vranken J. Depression and generalised anxiety in the general population in Belgium: a comparison between native and immigrant groups. J Affect Disord. 2007;97(1):229–39.

    Article  PubMed  Google Scholar 

  8. 8.

    Rechel B, Mladovsky P, Ingleby D, Mackenbach JP, McKee M. Migration and health in an increasingly diverse Europe. Lancet. 2013;381(9873):1235–45.

    Article  PubMed  Google Scholar 

  9. 9.

    Ikram UZ, Snijder MB, Fassaert TJL, Schene AH, Kunst AE, Stronks K. The contribution of perceived ethnic discrimination to the prevalence of depression. The European Journal of Public Health. 2015;25(2):243–8.

    Article  PubMed  Google Scholar 

  10. 10.

    Bhugra D. Migration and mental health. Acta Psychiatr Scand. 2004;109(4):243–58.

    CAS  Article  PubMed  Google Scholar 

  11. 11.

    Bhugra D. Migration and depression. Acta Psychiatr Scand. 2003;108:67–72.

    Article  Google Scholar 

  12. 12.

    Agyemang C, Denktas S, Bruijnzeels M, Foets M. Validity of the single-item question on self-rated health status in first generation Turkish and Moroccans versus native Dutch in the Netherlands. Public Health. 2006;120(6):543–50.

    Article  PubMed  Google Scholar 

  13. 13.

    Van der Wurff F, Beekman A, Dijkshoorn H, Spijker J, Smits C, Stek M, Verhoeff A. Prevalence and risk-factors for depression in elderly Turkish and Moroccan migrants in the Netherlands. J Affect Disord. 2004;83(1):33–41.

    CAS  Article  PubMed  Google Scholar 

  14. 14.

    Kleinman A, Good B. Culture and depression. N Engl J Med. 2004;351:951–2.

    CAS  Article  PubMed  Google Scholar 

  15. 15.

    Spijker J, van der Wurff FB, Poort EC, Smits CHM, Verhoeff AP, Beekman ATF. Depression in first generation labour migrants in Western Europe: the utility of the Center for Epidemiologic Studies Depression Scale (CES-D). International Journal of Geriatric Psychiatry. 2004;19(6):538–44.

    CAS  Article  PubMed  Google Scholar 

  16. 16.

    Kirmayer LJ. Cultural variations in the clinical presentation of depression and anxiety: implications for diagnosis and treatment. J Clin Psychiatry. 2001;62:22–30.

    PubMed  Google Scholar 

  17. 17.

    Simon GE, Goldberg DP, Von Korff M, Üstun TB. Understanding cross-national differences in depression prevalence. Psychol Med. 2002;32(04):585–94.

    CAS  Article  PubMed  Google Scholar 

  18. 18.

    Simon GE, VonKorff M, Piccinelli M, Fullerton C, Ormel J. An international study of the relation between somatic symptoms and depression. N Engl J Med. 1999;341(18):1329–35.

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Groenvold M, Bjorner JB, Klee MC, Kreiner S. Test for item bias in a quality of life questionnaire. J Clin Epidemiol. 1995;48(6):805–16.

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Gregorich SE. Do self-report instruments allow meaningful comparisons across diverse population groups? Testing measurement invariance using the confirmatory factor analysis framework. Med Care. 2006;44(11 Suppl 3):S78.

    Article  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Kroenke K, Spitzer RL, Williams JBW. The PHQ-9. J Gen Intern Med. 2001;16(9):606–13.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Wittkampf KA, Naeije L, Schene AH, Huyser J, van Weert HC. Diagnostic accuracy of the mood module of the patient health questionnaire: a systematic review. Gen Hosp Psychiatry. 2007;29(5):388–95.

    Article  PubMed  Google Scholar 

  23. 23.

    Kroenke K, Spitzer RL, Williams JB, Löwe B. The patient health questionnaire somatic, anxiety, and depressive symptom scales: a systematic review. Gen Hosp Psychiatry. 2010;32(4):345–59.

    Article  PubMed  Google Scholar 

  24. 24.

    Teresi JA, Ramirez M, Lai J-S, Silver S. Occurrences and sources of differential item functioning (DIF) in patient-reported outcome measures: description of DIF methods, and review of measures of depression, quality of life and general health. Psychol Sci Q. 2008;50(4):538–8.

  25. 25.

    Hirsch O, Donner-Banzhoff N, Bachmann V. Measurement equivalence of four psychological questionnaires in native-born Germans, Russian-speaking immigrants, and native-born Russians. J Transcult Nurs. 2013;24(3):225–35.

    Article  PubMed  Google Scholar 

  26. 26.

    Baas KD, Cramer AO, Koeter MW, van de Lisdonk EH, van Weert HC, Schene AH. Measurement invariance with respect to ethnicity of the patient health Questionnaire-9 (PHQ-9). J Affect Disord. 2011;129(1):229–35.

    Article  PubMed  Google Scholar 

  27. 27.

    Kessler RC, McGonagle KA, Swartz M, Blazer DG, Nelson CB. Sex and depression in the National Comorbidity Survey I: lifetime prevalence, chronicity and recurrence. J Affect Disord. 1993;29(2–3):85–96.

    CAS  Article  PubMed  Google Scholar 

  28. 28.

    Nolen-Hoeksema S, Larson J, Grayson C. Explaining the gender difference in depressive symptoms. J Pers Soc Psychol. 1999;77(5):1061.

    CAS  Article  PubMed  Google Scholar 

  29. 29.

    Snijder MB, Galenkamp H, Prins M, Derks EM, Peters RJ, Zwinderman AH, Stronks K. Cohort Profile: the Healthy Life in an Urban Setting (HELIUS) study in Amsterdam, the Netherlands. BMJ Open. 2017. in press.

  30. 30.

    Stronks K, Snijder MB, Peters RJ, Prins M, Schene AH, Zwinderman AH. Unravelling the impact of ethnicity on health in Europe: the HELIUS study. BMC Public Health. 2013;13(1):1–10.

    Article  Google Scholar 

  31. 31.

    Stronks K, Kulu-Glasgow I, Agyemang C. The utility of ‘country of birth’ for the classification of ethnic groups in health research: the Dutch experience. Ethnicity & Health. 2009;14(3):255–69.

    Article  Google Scholar 

  32. 32.

    Crane P, Gibbons L, Willig J, Mugavero M, Lawrence S, Schumacher J, Saag M, Kitahata M, Crane H. Measuring depression levels in HIV-infected patients as part of routine clinical care using the nine-item patient health questionnaire (PHQ-9). AIDS Care. 2010;22(7):874–85.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Huang FY, Chung H, Kroenke K, Delucchi KL, Spitzer RL. Using the patient health questionnaire-9 to measure depression among racially and ethnically diverse primary care patients. J Gen Intern Med. 2006;21(6):547–52.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Forkmann T, Gauggel S, Spangenberg L, Brähler E, Glaesmer H. Dimensional assessment of depressive severity in the elderly general population: psychometric evaluation of the PHQ-9 using Rasch analysis. J Affect Disord. 2013;148(2):323–30.

    Article  PubMed  Google Scholar 

  35. 35.

    Kendel F, Wirtz M, Dunkel A, Lehmkuhl E, Hetzer R, Regitz-Zagrosek V. Screening for depression: Rasch analysis of the dimensional structure of the PHQ-9 and the HADS-D. J Affect Disord. 2010;122(3):241–6.

    Article  PubMed  Google Scholar 

  36. 36.

    Beard C, Hsu K, Rifkin L, Busch A, Björgvinsson T. Validation of the PHQ-9 in a psychiatric sample. J Affect Disord. 2016;193:267–73.

    CAS  Article  PubMed  Google Scholar 

  37. 37.

    Elhai JD, Contractor AA, Tamburrino M, Fine TH, Prescott MR, Shirley E, Chan PK, Slembarski R, Liberzon I, Galea S. The factor structure of major depression symptoms: a test of four competing models using the patient health Questionnaire-9. Psychiatry Res. 2012;199(3):169–73.

    Article  PubMed  Google Scholar 

  38. 38.

    Muthén LKM, B. O. Mplus User’s guide. Seventh edition. In. Muthén & Muthén: Los Angeles, CA; 1998-2015.

    Google Scholar 

  39. 39.

    Muthén B, Asparouhov T. Latent variable analysis with categorical outcomes: multiple-group and growth modeling in Mplus. Mplus web notes. 2002;4(5):1–22.

    Google Scholar 

  40. 40.

    Chen FF. Sensitivity of goodness of fit indexes to lack of measurement invariance. Struct Equ Model. 2007;14(3):464–504.

    Article  Google Scholar 

  41. 41.

    Cheung GW, Rensvold RB. Evaluating goodness-of-fit indexes for testing measurement invariance. Struct Equ Model. 2002;9(2):233–55.

    Article  Google Scholar 

  42. 42.

    Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equ Model Multidiscip J. 1999;6(1):1–55.

    Article  Google Scholar 

  43. 43.

    Fan X, Sivo SA. Sensitivity of fit indices to model misspecification and model types. Multivar Behav Res. 2007;42(3):509–29.

    Article  Google Scholar 

  44. 44.

    Schermelleh-Engel K, Moosbrugger H, Müller H. Evaluating the fit of structural equation models: tests of significance and descriptive goodness-of-fit measures. Methods of psychological research online. 2003;8(2):23–74.

    Google Scholar 

  45. 45.

    Meade AW, Johnson EC, Braddy PW. Power and sensitivity of alternative fit indices in tests of measurement invariance. J Appl Psychol. 2008;93(3):568.

    Article  PubMed  Google Scholar 

  46. 46.

    Bjorner JB, Kosinski M, Ware JE Jr. Calibration of an item pool for assessing the burden of headaches: an application of item response theory to the headache impact test (HIT™). Qual Life Res. 2003;12(8):913–33.

    Article  PubMed  Google Scholar 

  47. 47.

    Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. New Jersey: Lawrence Erlbaum Associates; 1988.

    Google Scholar 

  48. 48.

    Schrier AC, de Wit MAS, Rijmen F, Tuinebreijer WC, Verhoeff AP, Kupka RW, Dekker J, Beekman ATF. Similarity in depressive symptom profile in a population-based study of migrants in the Netherlands. Soc Psychiatry Psychiatr Epidemiol. 2010;45(10):941–51.

    Article  PubMed  Google Scholar 

  49. 49.

    Stark S, Chernyshenko OS, Drasgow F. Detecting differential item functioning with confirmatory factor analysis and item response theory: toward a unified strategy. J Appl Psychol. 2006;91(6):1292.

    Article  PubMed  Google Scholar 

  50. 50.

    Swinnen SGHA, Selten J-P. Mood disorders and migration. Meta-analysis. 2007;190(1):6–10.

    Google Scholar 

  51. 51.

    Hepner KA, Morales LS, Hays RD, Edelen MO, Miranda J. Evaluating differential item functioning of the PRIME-MD mood module among impoverished black and white women in primary care. Womens Health Issues. 2008;18(1):53–61.

    Article  PubMed  Google Scholar 

Download references


We acknowledge the AMC Biobank for their support in biobank management and high-quality storage of collected samples. We are most grateful to the participants of the HELIUS study and the management team, research nurses, interviewers, research assistants and other staff who have taken part in gathering the data of this study.


The HELIUS study is conducted by the Academic Medical Center Amsterdam and the Public Health Service of Amsterdam. Both organisations provided core support for HELIUS. The HELIUS study is also funded by the Dutch Heart Foundation (2010 T084), the Netherlands Organization for Health Research and Development (ZonMw: 200500003), the European Union (FP-7: 278901), and the European Fund for the Integration of non-EU immigrants (EIF: 2013EIF013). The study reported here was additionally supported by a grant from the Netherlands Organisation for Scientific Research (NWO: 319–20-002). The funders had no role in study design, data collection, analysis, interpretation of data, or in writing the manuscript.

Availability of data and materials

The data that support the findings of this study are available from the HELIUS research cohort, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Dr. Snijder is the Data Collection Coordinator of HELIUS and may be contacted with further questions ( Additionally, researchers interested in further collaboration with HELIUS may see the following URL:

Ethics approval and consent to participate

The HELIUS study is conducted in accordance with the 1964 Helsinki Declaration and has been approved by the Academic Medical Center (AMC) Ethical Review Board. Written informed consent was obtained from all participants involved in the study.

Author information




HG, KS, MBS and EMD conceived and designed the work. HG performed the analyses and drafted the manuscript. All authors critically revised the manuscript for important intellectual content and approved the version to be published.

Corresponding author

Correspondence to Henrike Galenkamp.

Ethics declarations

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Table 6 Measurement invariance tests regarding gender, by ethnic group
Table 7 Results linear regression analysis on item-specific DIFa
Table 8 Linear regression: Residuals predicted by ethnicitya
Table 9 Linear regression: Residuals predicted by ethnicity and ethnicity*factor scoresa
Table 10 Sensitivity analyses: model fit in separate groups regarding interview mode and language
Table 11 Sensitivity analyses: measurement invariance analyses regarding interview mode and language

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Galenkamp, H., Stronks, K., Snijder, M.B. et al. Measurement invariance testing of the PHQ-9 in a multi-ethnic population in Europe: the HELIUS study. BMC Psychiatry 17, 349 (2017).

Download citation


  • Measurement invariance
  • Differential item functioning
  • Confirmatory factor analysis
  • PHQ-9
  • Depressive symptoms
  • HELIUS study