Measurement invariance testing of the PHQ-9 in a multi-ethnic population in Europe: the HELIUS study

Background In Western European countries, the prevalence of depressive symptoms is higher among ethnic minority groups, compared to the host population. We explored whether these inequalities reflect variance in the way depressive symptoms are measured, by investigating whether items of the PHQ-9 measure the same underlying construct in six ethnic groups in the Netherlands. Methods A total of 23,182 men and women aged 18–70 of Dutch, South-Asian Surinamese, African Surinamese, Ghanaian, Turkish or Moroccan origin were included in the HELIUS study and had answered to at least one of the PHQ-9 items. We conducted multiple group confirmatory factor analyses (MGCFA), with increasingly stringent model constraints (i.e. assessing Configural, Metric, Strong and Strict measurement invariance (MI)), and regression analysis, to confirm comparability of PHQ-9 items across ethnic groups. Results A one-factor model, where all nine items reflect a single underlying construct, showed acceptable model fit and was used for MI testing. In each subsequent step, change in goodness-of-fit measures did not exceed 0.015 (RMSEA) or 0.01 (CFI). Moreover, strict invariance models showed good or acceptable model fit (Men: RMSEA = 0.050; CFI = 0.985; Women: RMSEA = 0.058; CFI = 0.979), indicating between-group equality of item clusters, factor loadings, item thresholds and residual variances. Finally, regression analysis did not indicate potential ethnicity-related differential item functioning (DIF) of the PHQ-9. Conclusions This study provides evidence of measurement invariance of the PHQ-9 regarding ethnicity, implying that the observed inequalities in depressive symptoms cannot be attributed to DIF.


Background
Depression is one of the leading causes of disease burden worldwide, and its prevalence is only expected to increase further [1]. In 2010, major depressive disorder (MDD) accounted globally for 8.2% of years lived with disability. The prevalence of depression differs across demographic groups. For example, meta-analyses showed that individuals with low socioeconomic status (SES) have a higher risk of suffering from a depression, compared to individuals with high SES, while the disease more often has a chronic course in the low SES group [2,3]. Ethnic inequalities in depression have also been reported in European countries: ethnic minority groups show an increased risk of poor mental health in general, and depression in particular, compared to the host population [4][5][6][7].
Increased depression rates among ethnic minority populations are of particular concern, since increases in migration were observed for most western European countries over the past decades [8]. Migrants from non-EU countries face the largest challenges, with regard to socioeconomic conditions and health [8]. For example, in the Netherlands, considerable differences in the prevalence of depression and depressive symptoms were reported for Moroccan and Turkish immigrants, compared to the Dutch population [5,9]. The 1-month prevalence of depressive disorders was 4% in adults of Dutch ethnic origin, whereas it was 7% in adults of Moroccan origin and even 15% among adults of Turkish origin [5].
A key question that emerges from the ethnic variation in prevalence rates, is whether it reflects actual differences in the occurrence of depression or whether it is due to differences in interpretation or presentation of depressive symptoms in the questionnaire. It is not unlikely that true differences in prevalence rates exist, as depression among ethnic minority groups may be caused by difficulties experienced during or after migration [8,10,11]. Perceived ethnic discrimination, for instance, has been shown to contribute to depressive symptoms among ethnic minority groups [9]. Also, the higher prevalence of physical health problems among ethnic minority groups, compared with those of Dutch origin [12], may contribute to increased levels of depression among the minority groups. However, after accounting for differences in SES, perceived ethnic discrimination or in physical disorders and limitations, ethnic differences in depression rates are still observed [7,9,13]. It is important to explore the possibility that the interpretation or presentation of depressive symptoms, as assessed by a questionnaire, differ by ethnic background. The current study aims to explore whether differential item functioning may be an explanation for the observed ethnic variation in depression rates.
Differential item functioning (DIF) may occur when people from different ethnic groups report to questions about their mental health in a different way. Depressive symptomssuch as feelings of sadness or disappointmentoccur in all cultures and ethnic groups [14]. However, the way they are experienced and expressed may differ across cultures. For example, Chinese people may report feelings of boredom, pain or fatigue, rather than sadness [14]. Non-Western populations in general are often claimed to 'somatize' their mood disturbances [15], although other studies have shown that somatizing is rather a global tendency [16][17][18].
In order to draw conclusions regarding ethnic differences in the prevalence of depressive symptoms, one should verify whether items of a depression questionnaire measure the same concept in all groups, i.e. confirm that the questionnaire is measurement invariant. Measurement invariance implies that individuals' characteristics which are not part of the construct of interest, such as gender or ethnicity, do not affect individual item scores, other than via the construct of depression [19,20]. If the assumption of measurement invariance is violated, this implies that the items function differently across ethnic groups. For example, if two individuals of Turkish and Dutch ethnic origin who have a similar level of underlying depression are asked whether they have experienced fatigue, they should have the same probability of responding 'more than half of the days'. If the expression of fatigue as a symptom of depression is more common for a Turkish individual, this may influence his total depressive symptom score, despite an equal underlying level of depression. Since DIF in one or multiple items would affect the extent to which these items correlate with the remaining items, establishing measurement invariance warrants a systematic analysis of the correlational patterns across items.
Following DSM-5 guidelines, a diagnosis of depression is based on the experience of at least five out of nine symptoms of depression in the last two weeks. We use the PHQ-9 (Patient Health Questionnaire) to assess depressive symptoms in this study, which has the advantage that the items precisely reflect these nine DSM-5 symptoms [21]: depressed mood, anhedonia, trouble sleeping, feeling tired, change in appetite, guilt or worthlessness, trouble concentrating, feeling slowed down or restless, or suicidal thoughts. The PHQ-9 is a well-known and often used measure of depressive symptoms and can be used to assess (significant) depressed mood [22], or as a continuous measure with scores ranging from 0 to 27 [21,23].
Teresi et al. performed a review on DIF studies in depression measures, which were mainly -but not exclusivelyfocused on the Center for Epidemiological Studies Depression Scale (CES-D). They found that several items of these scales showed DIF with regard to demographic characteristics, such as age, gender and ethnicity [24]. However, findings on ethnicity-related DIF were inconsistent and none of the reviewed studies examined DIF across ethnic groups in Europe. More recently, Hirsch et al. studied measurement invariance of the PHQ-9 in a selection of about 350 primary care patients with at least one chronic disease. They compared Russian immigrants living in Germany with native-born Russians living in Russia and native-born Germans living in Germany, and concluded that the PHQ-9 measured the level of depressive symptoms in a similar way in these groups [25]. Baas et al. compared about 300 patients of Dutch Surinamese ethnic origin with patients of Dutch origin. They found that in women the PHQ-9 was measurement invariant for ethnicity, but in men, it was only partially measurement invariant [26].
There is a need for further research on whether the PHQ-9 measures depressive symptoms in a similar way across ethnic groups in the general population. Apart from the fact that they compared only two ethnic groups, the two studies mentioned above included patients with a high risk of depression [26], or with at least one chronic disease [25]. This does not provide evidence on whether the PHQ-9 assesses depressive symptoms similarly across ethnic groups in the general population, which is the usual approach to obtain prevalence rates in the population. The current study aimed to address this need, by examining ethnic-related measurement invariance of the PHQ-9, using data from the Dutch HELIUS study. Various ethnic groups, representative for current migrant groups in Europe (Turkish, Moroccan, South-Asian Surinamese, African Surinamese and Ghanaian origin), were included. This epidemiological study included large random samples of these five groups and a comparison group of Dutch ethnic origin (~24,000 in total and 2500-4600 per group), drawn from the general population of Amsterdam. Measurement invariance regarding ethnicity were assessed separately for men and women, since there is a consistent gender difference in the prevalence of depressive symptoms [27,28].

Sample
The aim and design of the HELIUS (HEalthy LIfe in an Urban Setting) study have been described in detail elsewhere [29,30]. In brief, the HELIUS study is a multiethnic cohort study conducted in Amsterdam, the Netherlands. Subjects were randomly, stratified by ethnicity, selected from the Amsterdam municipality register, and were sent an invitation letter (and a reminder after 2 weeks) by mail. We were able to contact 55% of those invited (55% among Dutch, 62% among Surinamese, 57% among Ghanaians, 46% among Turks, 48% among Moroccans), either by response card or after a home visit by an ethnically-matched interviewer. Of those, 50% agreed to participate (participation rate; 60% among Dutch, 51% among Surinamese, 61% among Ghanaians, 41% among Turks, 43% among Moroccans). Therefore, the overall response rate was 28% with some variations across ethnic groups. After a positive response, participants received a confirmation letter of the appointment for the physical examination, including a digital or paper version of the questionnaire (depending on the preference of the subject). Participants who were unable to complete the questionnaire themselves were offered assistance from a trained ethnically-matched interviewer. The Medical Ethics Committee of the Academic Medical Center (AMC) approved the study protocols. Written informed consent was obtained from all participants involved in the study.
Of the 23,942 participants who filled in the HELIUS questionnaire, we excluded 586 respondents who did not belong to the six largest ethnic groups and an additional 174 respondents who did not fill in any of the PHQ-9 items. Excluded respondents due to missing data were most often of Ghanaian origin, and more often had low or unknown education level, compared to included respondents. All respondents who missed some but not all items were retained in the measurement invariance analyses (n = 463), but some were excluded when computing PHQ-9 sum scores (for details we refer to the Measurements section). The majority of those missed only one item (n = 396), while the mean number of completed items ranged from 7.5-8 across ethnic groups.

Measurements
The Dutch version of the PHQ-9 was included in the HELIUS questionnaire [21]. All nine items have four response categories: 0 "not at all", 1 "on several days", 2 "on more than half of the days" and 3 "nearly every day". Total sum scores range from 0 to 27. A participant was considered to have depressed mood when having a sum score greater than 9 and significant depressed mood when one or both of items 1 and 2 were answered with at least 'more than half of the days' , and at least 5 of the 9 items were answered 'more than half of the days' or 'nearly every day'. The final item (suicidal ideation) already counted if answered with 'several days' [22]. Only for calculating the sum score we replaced missing item scores, or excluded some individuals with more than one missing item. If one of the items was missing, we replaced it by the mean score of the other items and the sum score was calculated as usual. If more than one item was missing, the sum score was not calculated (missing). In subsequent measurement invariance analyses, participants with missing items were all included while missing items were not replaced.
Item 8 of the PHQ-9 originally contained two questions combined in a single item ("Moving or speaking so slowly that other people could have noticed? Or the oppositebeing so fidgety or restless that you have been moving around a lot more than usual"), which appeared very difficult to answer when we pre-tested the HELIUS questionnaire. Therefore, in the HELIUS questionnaire, item 8 is divided into 2 items. For all analyses, these items were first combined into a single item, to make this item resemble the one from the original instrument. In all 9 items, we collapsed adjacent response categories in case they contained <5% of the sample, to ensure that endorsement rates were high enough for measurement invariance analyses. This resulted in one dichotomous item (item 9), four items with three categories (items 2, 6, 7 and 8) and four items with four categories (items 1, 3, 4 and 5) ( Table 1).
Ethnicity was defined according to the country of birth of the participants as well as that of their parents [31]. Specifically, a participant was considered of non-Dutch ethnicity if either of the following criteria was fulfilled: (1) born outside the Netherlands and at least one parent born outside the Netherlands (i.e., first generation); or (2) born in the Netherlands, but both parents born outside the Netherlands (i.e., second generation). In addition, as the Surinamese population consists of different ethnic groups which cannot be distinguished from each other on the basis of country of birth, self-reported ethnicity was used to determine Surinamese subgroups (either African or South-Asian origin). In order to be sure that the respondents report their geographic origin, rather than the group they feel belonging to, the question on self-identification was phrased in objective terms [31]. Overall, there were three different modes of questionnaire completion: internet (43%) or paper version (31%), or paper version with interviewer assistance (26%). Participants in the Dutch and in both Surinamese groups completed the questions in Dutch. Of the Ghanaian and Turkish subsamples, 78 and 32%, respectively, completed the questions in English or Turkish. Of the Moroccan subsample, about 33% were assisted by an interviewer who filled in the questionnaire in Dutch, but who often spoke Moroccan Arabic or Berber with the respondent. Unfortunately, no detailed information was available on the language of the interview of Moroccan respondents. Sensitivity analyses were performed to examine if the PHQ-9 was measurement invariant regarding language (English vs. Dutch and Turkish vs. Dutch) and interview mode (internet, paper, or interview).

Statistical analysis
Multiple group confirmatory factor analysis (MGCFA) Multiple group confirmatory factor analysis (MGCFA) was applied to investigate measurement invariance, because it enables the assessment of measurement invariance at different hierarchic levels, and in multiple groups at the same time [20]. In all analyses, ethnic minority groups were compared with the Dutch ethnic origin reference group.
MGCFA is a special case of confirmatory factor analysis (CFA), which requires a prespecified measurement model to be tested. Several studies assume or have shown that the PHQ-9 is unidimensional, indicating that all items measure the symptoms of a single underlying construct (depression) [21,26,32,33]. However, others could not replicate its unidimensional structure [34][35][36][37], and provided evidence for a somatic and a non-somatic component, for instance [37]. To improve model fit, previous researchers have added residual covariances [32,34] or excluded one ore more items [34,35]. Since there is inconsistency in the best fitting factor model, we first verified the unidimensionality of the PHQ-9, by comparing the fit of three models: 1) a onefactor model, 2) a two-factor model with items 3,4,5,7,8 loading on factor 1 (somatic) and items 1,2,6,9 loading on factor 2 (non-somatic), and 3) a two-factor model based on exploratory factor analysis (EFA). The best of these three models was used as the baseline model for subsequent measurement invariance tests.

Testing of measurement invariance
Four hierarchic levels of measurement invariance were tested [20]. Each level implies that more constraints are added to the model (i.e. parameters are equally estimated across groups), with the fit of the model with more constraints being compared to the fit of the less constrained model (i.e. for the non-reference group these parameters were set free). If the more constrained model does not fit significantly worse in comparison with the model that has fewer constraints, this indicates measurement invariance at the tested level.
At the least stringent level, configural invariance indicates that the clustering of items and the factors that they represent is similar across groups. This was investigated by evaluating model fit of the baseline model separately for all ethnic groups. Metric invariance entails the similarity of factor loadings, and was tested by comparing a model that constrained all factor loadings to be equal across groups, with a configural model where factor loadings were freely estimated across groups. If metric invariance holds, the items load on the latent construct to the same extent for all groups. Strong (or scalar) invariance additionally entails the equality of item thresholds. If strong invariance holds this is evidence that there is no additive response bias, indicating that item responses are not systematically higher or lower in one group compared with the other group(s). Finally, strict invariance is the most stringent level and reflects that the residual variances, or error terms, of each item are similar across groups.
For all MGCFA analyses we applied Weighted Least Squares Means and Variance adjusted (WLSMV) estimation with theta parameterization in Mplus version 7.4 for statistical analysis with latent variables [38], in which the items were treated as ordinal variables [39]. For each successive step of MI testing, we applied the parameterization described in the Mplus manual [38].

Assessment of goodness-of-fit
Goodness-of-fit statistics were estimated for each model and standard criteria were used to evaluate them. The χ 2 statistic indicates the discrepancy between the covariance matrix of the observed data and the one that is predicted by the factor model. This statistic is sensitive to sample size and often rejects a good fitting model [40,41]. Therefore, and because it is recommended to use several indices simultaneously [42], we additionally evaluated RMSEA (Root Mean Square Error of Approximation) and CFI (Comparative Fit Index) values which are less sensitive to sample size [43]. A better model fit is indicated by a low RMSEA value and a high CFI value. RMSEA values lower than 0.08 or 0.05 indicate acceptable and good model fit, respectively. CFI values higher than 0.95 and 0.97 indicate acceptable and good model fit, respectively [44].
Differences between successive measurement invariance models were tested using the DIFFtest procedure in Mplus. Similarly to χ 2 , the DIFFtest is influenced to a large extent by sample size, and thus often rejects good fitting models [41]. Therefore, we also evaluated ΔCFI and ΔRMSEA between the more and less constrained models. Only a few simulation studies have reported cut-offs that indicate significant measurement noninvariance, and none of those examined more than two groups [40,41,45]. We decided to apply the most conservative cut-offs. Declines in CFI larger than 0.01 and increases in RMSEA larger than 0.015 indicated a significant worsening of fit [40,41].

Impact of DIF on demographic health inequalities
With MGCFA we tested whether differences in overall factor structure were present. This method may be less powerful to detect DIF of individual items, because all item parameters are constrained across groups at the same time. We therefore performed additional tests which were targeted at individual items to explore more subtle levels of DIF which may remain undetected by the MGCFA approach. In case significant DIF at the item level was found, we examined the impact that adjustment for this DIF had on the magnitude of inequalities in depressive symptoms.
First, we conducted regression analysis to detect significant DIF at the item level. To that end, we first saved individual factor scores from each strict invariance model. With logistic regression, we predicted each dichotomized item score with the corresponding factor score and saved the residuals. The residuals represent the variation in item scores not explained by the underlying factor. Subsequently, we performed linear regression with the residuals as the dependent variable, and ethnicity and ethnicity*factor score as independent variables. This was done to conduct one overall test for uniform DIF (analogous to strong invariance) and nonuniform DIF (analogous to metric MI), respectively [46]. The explained variance (R 2 ) of this model represents the predictive value of ethnicity for the item score, over and above the predictive value of the underlying factor, and was interpreted as indicative of DIF. Items with an R 2 of 2% or higher and significant regression coefficients for the predictors ethnicity or ethnicity*factor (p-value below 0.05) were selected as items with DIF [46].
Second, if DIF in any of the items was found, we returned to the MGCFA analysis and estimated the impact of adjusting for this DIF. We aimed to compare ethnic inequalities in factor scores, from models that did and did not adjust for DIF. Factor scores from the previously described strict invariance models were regarded as unadjusted for DIF. Adjustment for DIF was done by adapting the strict invariance model so that for items with DIF all threshold constraints across groups were set free. Using means and variances of unadjusted and adjusted factor scores, we estimated two sets of standardized mean differences (Cohen's d) across ethnic groups. We evaluated whether 95% confidence intervals around d's unadjusted for DIF and adjusted for DIF showed overlap, which would indicate that the statistically significant DIF that was observed had low impact on the magnitude of demographic health inequalities. Cohen's d was calculated using the pooled sd as the denominator; conventional thresholds were used to interpret effect sizes as small (d = 0.2), medium (d = 0.5) and large (d = 0.8) [47]. Table 2 shows the demographic characteristics and distribution of the PHQ-9 in each ethnic group, and by gender. In both genders, PHQ-9 sum scores were highest among respondents with Turkish ethnic origin and lowest among the group with Ghanaian ethnic origin. A similar pattern of ethnic differences emerged for the prevalence of (significant) depressed mood.

Measurement invariance analyses
Three different factor models were compared, to obtain an adequate baseline model for further analysis (Table 3): a one factor model, a two-factor model based on the literature, and a two-factor model based on EFA. The EFA two-factor model was slightly different, and had better fit, compared with the two-factor model that was examined in previous studies. Although the two-factor models generally showed better fit as compared to the one-factor model, we decided to continue with the one-factor model for two reasons. First, in both models the two factors showed a high correlation, indicating that they reflect two largely overlapping constructs. Second, the one-factor model had good model fit according to CFI, and also adequate model fit according to RMSEA after residual covariances (between items 1 and 2, items 3 and 4 and items 7 and 8) were added to the model. The fit of this one-factor model is shown for each ethnic group and gender in Table 4. Model fit was better in men as compared to women, but in all groups RMSEA and CFI values were indicative of acceptable or good model fit.
Results from the MGCFA are shown in Table 5. Adding constraints for equal factor loadings, item thresholds and residual variances did not lead to significantly reduced model fit, compared to the least constrained (configural) model. The final strict measurement invariance models for both men and women showed adequate model fit (Men: RMSEA = 0.050; CFI = 0.985; Women: RMSEA = 0.058; CFI = 0.979), while ΔRMSEA and ΔCFI for increasingly stringent test of measurement invariance never exceeded the critical values of 0.015 and 0.01, respectively. Since model fitaccording to RMSEA -differed more between ethnic groups among women than among men (Table 4), we examined whether this was due to DIF with respect to gender in some but not in other ethnic groups. However, the results showed that this was not the case: items of the PHQ-9 were measurement invariant for gender in all ethnic groups ( Table 6).
The additional regression analyses, targeted at individual items, revealed no items with DIF related to ethnicity (Tables 7, 8 and 9). Furthermore, sensitivity analyses confirmed that the PHQ-9 was measurement invariant with regard to language and interview mode (Tables 10 and 11).

Discussion
Measurement invariance of the PHQ-9 regarding ethnicity was examined in a population-based sample including over 23,000 participants. Our results indicated that the PHQ-9 was measurement invariant across groups with Dutch, South-Asian Surinamese, African Surinamese, Ghanaian, Turkish and Moroccan ethnic origin. As such, the observed ethnic differences in PHQ-9 scores may be attributed to true differences in depressive symptoms, and not to factors related to the measurement of these symptoms.
Our results should be interpreted in view of some limitations. Firstly, non-response to this study may in particular be a concern in those with the poorest mental health, the lowest proficiency of the Dutch language, or in the least acculturated individuals. These factors may  influence how the PHQ-9 is responded to, and as such non-response may influence the generalizability of our results. Secondly, this study investigated ethnicityrelated DIF for the PHQ-9 and the results can therefore not be generalized to other demographic characteristics or to other depression instruments. For example, Schrier et al. found DIF in five items of the CIDI when comparing respondents with Turkish and Dutch origin in the Netherlands [48]. In addition, in their review Teresi et al. (2008) concluded that several items of depression scales showed DIF with regard to demographic or health characteristics. None of the reviewed studies examined DIF across ethnic groups in Europe, however. Our selection of statistical approaches and criteria for significance and relevance may be of influence on the conclusions that were drawn. We applied MGCFA, which has been shown to perform well to detect different levels of DIF [49], using model fit parameters that were recommended in previous studies [40,41,44,45]. However, little is known about which criteria should be used when sample sizes are large, or when more than two groups are compared at the same time. We recommend that more research is done in this field, to guide researchers regarding which methods and criteria for significant and relevant DIF should or should not be applied. In the current study the results of both MGCFA analysis and logistic regression analysis pointed in the same direction, which strengthens our conclusion about the absence of ethnicity-related DIF for items of the PHQ-9. Table 4 Model fit of the baseline one-factor model a in each subgroup  Empirical evidence of measurement invariance is essential for making valid health comparisons across demographic groups. Our results imply that the ethnic inequalities in depressive symptoms, that were observed in our study as well as in other studies [5,6], reflect true differences, and are not likely the result of measurement bias. Thus, the PHQ-9 can be used to make comparisons regarding the prevalence of (significant) depressed mood in groups with different ethnic background in the Netherlands. The Dutch had the lowest prevalence of 3% for significant depressed mood, and the Turks had the highest rate (11% in men, 15% in women), with the rates for the other ethnic minority groups lying in between. Interestingly, the GBD 2010 data indicate that these ethnic minority groups (except Ghanaians) have lower MDD prevalence in their countries of origin [1], which may suggest that adverse circumstances in the host societies (e.g., ethnic discrimination, acculturative stress) might be at play here.
The pattern of ethnic inequalities in (significant) depressed mood that we observed is somewhat similar to what was found by de Wit et al. who used the CIDI (Composite International Diagnostic Interview) to assess depression. They reported the 1-month prevalence of depressive disorders (MDD or dysthymia) in respondents with Surinamese (1%), Dutch (4%), Moroccan (7%) and Turkish ethnic origin (15%) [5]. The pattern of inequalitiesincreased prevalence in ethnic minorities, with the lowest rates in Ghanaians and African Surinamese, and higher rates in Turks, Moroccans and South-Asian Surinamese -suggests that migration-related factors may be ethnic-specific, and that ethnic minority groups should not be combined without taking the differences between these groups into account [50]. Future studies could be designed to investigate to what extent genetic vs. cultural variation contributes to these ethnic differences in the prevalence of depression.
To our knowledge, our study is the first to assess ethnicity-related measurement invariance of the PHQ-9 in a population-based sample. Previous studies on ethnicity related DIF included people with at least one chronic disease [25], with HIV [32], or with a high risk of depression [26,33,51]. In two studies this was done by administering the full PHQ-9 only if respondents endorsed at least one of the key items, for example anhedonia and depressed mood [33,51]. In particular the inclusion of high-risk patients provides less information on the ethnic diversity among respondents that do not have a high level of depression but nevertheless might respond differently to the questionnaire. This influences the rates of depression in the general population that are found. Moreover, this study assessed measurement invariance in a variety of ethnic groups that are representative for migrant groups in Europe. In a previous study, Baas et al. (2011) compared two ethnic groups in the Netherlands, both including individuals with a high risk of depression. They found that the item on psychomotor problems (item 8) had a higher factor loading and threshold among Surinamese men, compared to Dutch men. This item originally contains two parts (moving or speaking slowly, or being fidgety and restless), which appeared very difficult to answer when we pre-tested the questionnaire. In the HELIUS questionnaire item 8 was therefore divided into 2 items (see Table 1), and it might be that this adaptation has led to the absence of reporting differences between ethnic groups, whereas they were present in the study by Baas et al.
A strong point of this study is that we were able to additionally study possible DIF due to language and interview mode, given the heterogeneity in our sample regarding these factors. We compared Turks who completed the PHQ-9 in Turkish vs. Dutch, and Ghanaians who completed the PHQ-9 in English vs. Dutch. In addition, we compared groups who completed the questionnaire through the internet, on paper, or with the help of an interviewer. We found that the PHQ-9 was measurement invariant regarding language and interview mode. This result is reassuring and confirms the applicability of the PHQ-9 in different samples and settings.

Conclusion
With the growing ethnic diversity in European populations there is a need for evidence on the reliability of instruments to study the mental health of ethnic minority groups. The PHQ-9 is often used to measure depressive symptoms in clinical practice or for research purposes. This study provides evidence for measurement invariance of the PHQ-9 in an ethnically diverse sample in the Netherlands. This implies that items of the PHQ-9 function similarly in people with South-Asian Surinamese, African Surinamese, Ghanaian, Turkish and Moroccan ethnic background, as compared to those with Dutch ethnic origin. Moreover, we showed that language (Turkish vs. Dutch in Turks, and English vs. Dutch in Ghanaians) and interview mode (interview, paper, or internet) did not result in measurement bias, indicating that the PHQ-9 can be used in a variety of settings to compare the level of depressive symptoms across ethnic groups. In conclusion, differences in depression scores and rates of depression across ethnic groups are unlikely to be due to assessment bias suggesting that the contribution of other factors such as migration history and migration status should be explored in future studies.  Outcome variable in these linear regression analyses: Residuals that were obtained in logistic regression models with PHQ-9 item scores as outcome variables, and PHQ-9 factor score as the predictor (for all regression coefficients, see Tables 8 and 9) Residuals were obtained in a logistic regression models with PHQ-9 item scores as outcome variables, and PHQ-9 factor score as the predictor   Table 2 since we excluded those for which the questionnaire language was uncertain * Significant χ 2 test or χ 2 difference test (P < .001) (compared to the reference model)