Psychometric properties of EURO-D, a geriatric depression scale: a cross-cultural validation study

Background Many of the assessment tools used to study depression among older people are adaptations of instruments developed in other cultural setting. There is a need to validate those instruments in low and middle income countries (LMIC). Methods A one-phase cross-sectional survey of people aged [greater than or equal to] 65 years from LMIC. EURO-D was checked for psychometric properties. Calibration with clinical diagnosis was made using ICD-10. Optimal cutpoint was determined. Concurrent validity was assessed measuring correlations with WHODAS 2.0. Results 17,852 interviews were completed in 13 sites from nine countries. EURO-D constituted a hierarchical scale in most sites. The most commonly endorsed symptom in Latin American sites was depression; in China was sleep disturbance and tearfulness; in India, irritability and fatigue and in Nigeria loss of enjoyment. Two factor structure (affective and motivation) were demonstrated. Measurement invariance was demonstrated among Latin American and Indian sites being less evident in China and Nigeria. At the 4/5 cutpoint, sensitivity for ICD-10 depressive episode was 86% or higher in all sites and specificity exceeded 84% in all Latin America and Chinese sites. Concurrent validity was supported, at least for Latin American and Indian sites. Conclusions There is evidence for the cross-cultural validity of the EURO-D scale at Latin American and Indian settings and its potential applicability in comparative epidemiological studies.


Background
Depression is a common and burdensome psychiatric disorder in older people [1][2][3]. In Low and Middle Income Countries (LMIC) it is difficult to assess its prevalence because of the lack of culturally adapted and validated assessments.
Clinical diagnostic criteria for depression including DSM-5 [4] and ICD-10 [5] are applied to adults of all ages. These may, however, miss clinically significant episodes among older people who do not meet these specific criteria. Some investigators have suggested a syndrome of depression without sadness, thought to be more common in older adults [6,7], and a depletion syndrome manifested by withdrawal, apathy, and lack of vigour [8,9].
Depression symptom scales have been widely used in population surveys to quantify depression burden as a continuum, or to screen for depression of clinical significance in the first phase of a two phase survey design [10][11][12][13][14][15]. However, only the Geriatric Depression Scale [10,11] and the EURO-D [12] were developed specifically for use in older people, and evidence for their validity comes mainly from high income countries [16][17][18][19][20][21] [12,22].
We set out to assess the construct validity of the EURO-D in large population-based survey samples of older people living in Latin America, India, China and Nigeria, aiming to assess whether this scale measures the same construct in low and middle income countries with diverse cultures and languages. Measurement invariance would be supported by similar measurement properties, and a common 'nomological net' of proximate identifiers of the depression symptom score.

Setting, design and procedures
Comprehensive, one-phase, catchment area populationbased surveys were conducted according to the same standardised protocol by the 10/66 Dementia Research Group. The full 10/66 study protocol has been published elsewhere [23]. Surveys were carried out in thirteen sites from nine countries (Cuba, Dominican Republic, Puerto Rico, Peru, Mexico, Venezuela, China, India and Nigeria). Peru, Mexico, China and India included both urban and rural catchment areas; the Nigerian catchment area was predominately rural, while in the other countries participants were recruited only from urban catchment areas. All assessments were carefully translated and adapted into the relevant local languages. All the EURO-D items are derived from the GMS, which is part of the 10/66 assessment. All aspects of assessment methodology, including translation and adaptation have been reported in detail in a previous publication [24]. In brief, the GMS was translated and back translated into Spanish, Mandarin, Hindi, Tamil and Ibo. Meta-analysis of 26 publications of exploratory factor analysis of the GDS reported 'strong evidence of language differences in the factor structure of the GDS', being language strongly confounded by other aspects of culture [25]. Acceptability and conceptual equivalence were assessed and reviewed by local informants. Interviews were carried out in participants' own homes and lasted on average two to three hours. Interviewers were fully trained on the 10/66 protocol by the local principal investigator (PI) and the local study coordinator (SC). The study protocol and the consent procedures, including the witnessed consent procedure, were approved by the King's College London research ethics committee and in all local countries: 1-Medical Ethics Committee of Peking University the Sixth Hospital (Institute of Mental Health, China); 2-the Memory, Depression Institute and Risk Diseases (IMEDER) Ethics Committee (Peru); 3-Finlay Albarran Medical Faculty of Havana Medical University Ethical Committee (Cuba); 4-Hospital Universitario de Caracas Ethics Committee (Venezuela); 5-Ethics Committee of Nnamdi Azikiwe University Teaching Hospital (Nigeria); 6-Consejo Nacional de Bioética y Salud (CONABIOS, Dominican Republic); 7-Christian Medical College (Vellore) Research Ethics Committee (India); 8-Instituto Nacional de Neurología y Neurocirugía Ethics Committee (Mexico); 9-Nnamdi Azikiwe University Teaching Hospital Nnewi Anambra State Ethics Committee, Nigeria. Participants were recruited on the basis of informed signed or witnessed consent; 9-. Ethics committes approved the witnessed consent procedure. The use of the 10/66 Dementia Research Group dataset was approved by the 10/66 principal investigators.

Depression assessment
Depression was assessed using the Geriatric Mental State (GMS) [26]. Symptoms are ascertained with respect to the last one month. Internationally, the GMS is the most widely used comprehensive clinical mental health assessment for older people. A computerised diagnostic algorithm, the AGECAT (Automated Geriatric Examination for Computer Assisted Taxonomy), groups symptoms to form patterns recognised by a psychiatrist as illness, and identifies them as syndrome cases [27]. Items are later added together to generate affective disorder diagnoses according to ICD-10, and DSM-IV criteria [26,28]. The reliability and validity of the GMS has been demonstrated for in-patient, out-patient and community samples, and in various languages and cultures including Spanish and Chinese. The validity of the GMS/AGECAT algorithm has been investigated in several studies [29,30].
The EURO-D symptom scale was originally developed to compare symptoms of late-life depression across 11 European countries in the EURODEP Concerted Action Programme [12]. The 12 EURO-D items (depressed mood, pessimism, wishing death, guilt, sleep, interest, irritability, appetite, fatigue, concentration, enjoyment and tearfulness) were all taken from the Geriatric Mental State [31]; each item is scored 0 (symptom not present) or 1 (symptom present), generating a simple ordinal scale with a maximum score of 12. In the EURODEP study, internal consistency of the EURO-D, was moderately high with a Cronbach's alpha ranging from 0.61 to 0.75. However, Principal Components Analysis generated two factors common to nearly every centre: an affective suffering factor (depression, tearfulness, pessimism and wishing death) and a motivation factor (interest, concentration and enjoyment) [12]. The optimum cut-point for the identification of DSM-IV major depression and GMS/AGECAT depression was > =4. Evidence for internal consistency and construct validity of the EURO-D scale was strengthened following its use in the 10 nation European Survey of Health, Ageing, and Retirement in Europe (SHARE) [32]. It was shown to be a hierarchical scale with similar rank ordering of item calibration values across countries. The previously observed two factor structure fitted well in all countries, with similar factor loadings.
Clinical diagnoses of depressive episode (mild, moderate or severe) were classified according to the International Classification of Disease-10 (ICD-10) as a mood disorder with symptoms of sadness, negative self-regard, loss of interest in life, and disruptions of sleep, appetite, thinking, and energy level for more than two weeks that interfere with daily living [5]. ICD-10 diagnoses were derived from the GMS interview, through the application of a computerised algorithm.

Concurrent validators
We used three indicators to assess the concurrent validity of the EURO-D:

Analyses
We used the 10/66 data archive (release 3.0) for all analyses. EURO-D total scale score distributions were summarised according to their mean, median and interquartile range, after inspecting histograms and box plots. The internal consistency of the scale was assessed in each site using Cronbach's alpha. For each site, the proportion of participants endorsing each of the 12 items ('item difficulties') was reported and ranked from 1 (the most frequently endorsed item) to 12 (the least frequently endorsed item) by site.
Mokken analysis was used to test the extent to which the EURO-D items conformed to hierarchical scaling principles in each site. Mokken scaling involves the application of a non-parametric item response model [35] to measure the hierarchical properties of items in a scale, assessing if the items can be ordered by degree of difficulty, so that any individual who endorses a particular item will also endorse all the items ranked lower in difficulty. Three basic assumptions are required for a monotone homogeneity model (MHM): 1) unidimensionality (one latent variable summarises the variation in the item scores in the questionnaire), 2) local independence (after conditioning on the position on the latent trait, the item scores are statistically independent), and 3) monotonicity (for all items the probability of a positive response increases monotonically with increasing values of the latent trait). These assumptions being met, an individual's position on the latent trait can conveniently be estimated as the rank of the highest item in the hierarchy that they endorse, or their total number of positive responses [36]. Double monotonicity models (DMM) require in addition that for any value of the latent trait, the probability of a positive response decreases with the difficulty of the item. This means that the order of item difficulties remains invariant over all values of the latent trait and thus, that the item response function curves do not intersect [37,38]. To assess single monotonicity, we estimated Loevinger coefficients for each item (Hi) and for the whole scale (H), where values between 0.3 and 0.4 suggest weak scalability, values between 0.4 and 0.5 moderate, and values above 0.5 strong scalability. We also tested for violations of monotonicity (using the StataloevH monotonicity command) and nonintersection (using the StataloevH nipmatrix command) between pairs of items (minimum violation 0.03, alpha = 0.05), using overall criteria values as an indication of the likelihood of assumption violation; ≤40 'satisfactory', 40 to 79 'questionable violation', 80 and over 'strongly suggesting an assumption violation' [39]. Measurement invariance, with respect to hierarchical scale properties was assessed according to the Spearman (non-parametric) correlation between item difficulty ranks between all pairs of sites.
Principal component analysis (PCA) of EURO-D items was carried out using PASW version 18, and confirmatory factor analysis (CFA) using AMOS version 4.0. For PCA varimax rotation was carried out with an Eigenvalue of one as initial extraction criterion. The cut off used to assume that an item loaded on a given factor was 0.60, with a threshold of 0.50 signifying borderline loading. Given the a priori hypothesis of an underlying two-factor solution [40] we then tested and compared between sites the goodness-of-fit of the two factor solution identified in the European SHARE survey, using confirmatory factor analysis. CFA models contain parameters that are (a) fixed to a certain value, (b) constrained to be equal to other parameters, and (c) free to take on any unknown value [41]. In testing for psychometric invariance across sites, two models were fitted and then compared for goodness-of-fit; one in which the factor loadings are unconstrained, that is estimated separately for all countries, and the second in which they are constrained to be equal across countries, the null hypothesis being that items load to a similar extent on the same latent trait or traits across countries. Markedly superior fit of the first model would challenge the hypothesis of measurement invariance. We assessed goodness-offit using Akaike's Information Criterion (AIC) [40], the Tucker-Lewis Index (TLI) [42] and the Root Mean Square Error of Approximation (RMSEA). The lower the AIC value, the better the fit of the model [42]; for the TLI values near 1.0 indicate good fit and those greater than 0.90 are considered satisfactory [43,44]; for the RMSEA values of less than 0.05 indicate close fit and 0.05 to 0.08 reasonable fit for the model [45]. In the final stage of the analysis, we compared the goodness of fit of the two factor solution derived from the European SHARE study with that of a one factor solution, with loadings constrained across sites.
We assessed the psychometric properties of the EURO-D scale, in each site, running receiver operating characteristic (ROC) curve analyses using ICD-10 depressive episode as the reference criterion, plotting sensitivity against false positive rate (1-sensitivity) and estimated the area under the ROC curve (AUROC) with 95% confidence intervals. To calibrate the EURO-D score against ICD-10 depressive episode diagnosis, we used maximum Youden's index ((sensitivity + specificity)-1) as the criterion for determining the optimal cut-point in each site. The optimal cutpoint for most sites was then applied to all sites, and the sensitivity, specificity and Youden's index at that cut-point was reported against ICD-10 depressive episode. It is important to note that the EURO-D scale score and ICD-10 diagnosis were both derived from a single GMS interview, administered by the same research worker, with some overlap in the symptoms ascertained. Therefore, this does not represent an independent validation of the EURO-D scale, but rather an attempt to compare its calibration with ICD-10 clinical diagnosis among sites.
The concurrent validity of the EURO-D scale in each site was assessed by measuring Spearman rank correlations with global self-rated health (an inverse correlation hypothesised), WHODAS 2.0 disability (a positive correlation hypothesised) and happiness (an inverse correlation hypothesised).

Sample characteristics
Overall, 17,852 interviews were completed in 13 sites from nine countries. A high response rate was obtained, at least 80% in all sites, and exceeding 90% in several sites. Table 1 summarizes the sample demographic characteristics, by country. Women predominate over men in all sites. Educational levels varied widely between sites, the proportion not completing primary education was higher in sites in India, China and Nigeria in comparison to those in Latin America, and was also generally higher in rural than urban sites.
Histograms of EURO-D score distributions (data not provided) indicated that the modal score in all sites, other than urban India, was zero, indicating no depression symptoms. In all sites the distribution was markedly positively skewed. In rural India, the score distribution was biphasic, with peaks at zero to one and five to seven. Mean scores ranged between 1.7 and 3.2, other than in urban China (0.5) and rural China (0.2). Median scores ranged between 1 and 3, and 75th centiles between 3 and 6, other than in urban China (1) and rural China (0). Relatively high score distributions were seen in the Dominican Republic, and India.
The internal consistency of the EURO-D scale Cronbach's alpha ranged from 0.64 to 0.87, and exceeded 0.70 in almost all sites.

EURO-D hierarchical scaling properties
Loevinger's H coefficients indicated a weak hierarchical scale in Cuba, Dominican Republic, Puerto Rico and China, a moderate hierarchical scale in India and a strong hierarchical scale in Nigeria (Table 2). In Peru, Venezuela and Mexico, Loevinger's H coefficient fell just below the threshold to support hierarchality. In none of the countries were any significant violations of monotonicity assumptions noted. There were several statistically significant violations of the more stringent double monotone homogeneity (non-intersection) assumptions, but strong evidence of violation was only seen for a minority of symptoms in certain sites. The pattern of itemspecific Loevinger's H coefficients and non-intersection violations did not suggest that any particular items could be omitted to generate a more effective hierarchical scale across countries.
The proportion of participants in each site endorsing each of the EURO-D symptoms is summarized in Table 3. The symptoms are ranked, within each site, in order of frequency of endorsement. The prevalence of individual symptoms and their rank order were similar across Latin American and Indian sites. The prevalence of all symptoms was strikingly lower in Chinese sites, other than tearfulness, which was commonly endorsed in the rural Chinese site. The rank order of symptoms was also somewhat different from that observed in Latin American and Indian sites. The rank order of symptoms in the Nigerian site was strikingly different from those in all other sites. Thus, depressed mood was the most commonly endorsed symptom in all Latin American sites, and the second or third most endorsed symptom in Indian sites. Sleep disturbance and tearfulness were the other commonly endorsed symptoms in those sites. However, in China depressed mood was the fifth endorsed symptom, while the more commonly endorsed symptoms were sleep disturbance, fatigue and irritability in urban China and tearfulness, loss of concentration and loss of interest in rural China. In Nigeria, depressed mood was the fourth most commonly endorsed item, the most frequently endorsed items being loss of enjoyment, loss of interest and fatigue. There was more Missing values

Factor structure
Bartlett's tests of sphericity and Kaiser-Meyer-Olkin Measure of Sampling Adequacy suggested that factor analysis was appropriate and feasible in all countries ( Table 5). The principal components factor analysis yielded three factors with eigenvalues over one in most countries, with a two factor solution in Cuba, and a four factor solution in Mexico. The first two factors dominated in all countries (cumulative variance 36.4-45.8%). The third factors contributed between 8.4% and 9.3% of scale variance, with eigenvalues between 1.0 and 1.1. In most countries, the first factor was dominated by loadings of the depression and tearfulness items (seven countries), accompanied by lower level and less consistent loadings from items addressing suicidality (five countries), and sleep, appetite and pessimism (four countries each). The second factor was most commonly dominated by loadings of interest and enjoyment items (eight countries), with occasional lower level loadings of concentration (three countries). In Venezuela the second factor was dominated by depression and tearfulness, and the third by enjoyment and interest, while in Nigeria the pattern was reversed. In both of these countries the first factor was dominated by pessimism and concentration, with guilt and suicidality also loading in Nigeria. In other   sites, the third factor was loaded on by a variety of items; guilt, with or without suicidality and irritability (five countries). In China, the third factor was loaded upon by somatic items, sleep, appetite and fatigue. Given that the findings from the PCA were broadly consistent with the two factor (affective suffering and motivation) model previously identified and found to fit well across European SHARE study countries, we formally tested the goodness of fit of this factor structure across 10/66 countries, using confirmatory factor analysis (Table 6). This two factor model showed a moderately good fit across sites according to RMSEA (<0.05), although less convincingly so according to TLI (0.77, much lower than 0.90, considered acceptable) ( Table 7). The models in which loadings were constrained to be equal across countries, and which were freely estimated in each country varied little in terms of AIC, TLI or RMSEA, suggesting measurement invariance. Variance in factor loadings was reduced for affective suffering items when Nigeria (a clear outlier) was omitted, and the model fit of the two factor solution was clearly improved. When the model fit of the constrained two factor model (omitting Nigeria) was compared with that of a one factor solution (omitting Nigeria), the two factor solution was clearly superior according to all absolute and relative goodness of fit indices.

Calibration against clinical diagnoses
The calibration of the EURO-D depression against ICD-10 clinical diagnosis is summarized in (Table 8). The Area Under the Receiver Operating Characteristic curve (AUROC) ranged from 0.89 and 1.00. The optimal cutpoint for the EURO-D against the reference criterion of ICD-10 depressive episode (using the criterion of maximizing Youden's index), was 4/5 (a score of five or more) in all of the Latin American sites, rural China and Nigeria. While a lower cutpoint (3/4) would have been selected in rural India, and a higher cutpoint in urban China (6/7) and urban India (5/6), there was actually little difference between Youden's index at these cutpoints and at the 4/5 cutpoint that was optimal for other sites. At the 4/5 cutpoint, the sensitivity for ICD-10 depressive episode was 86% or higher in all sites and the specificity exceeded 84% in all Latin American and Chinese sites. However, specificity was lower in urban India (74.1%),  rural India (69.5%) and Nigeria (79.3%), indicating a relatively high false positive rate using that cutpoint in those sites.

Concurrent validity
As hypothesized, EURO-D scores were positively correlated with WHODAS 2.0 disability scores in all sites (+0.15 to +0.48, P < 0.001), Table 9. EURO-D depression scores were inversely associated with global self-rated health in all sites, but at a much lower level in urban

Discussion
The results of these analyses extend the evidence for the cross-cultural validity of the EURO-D scale, at least to Hispanic Latin American and Indian settings. We were able to replicate the two factor structure ('affective suffering' and 'motivation') previously demonstrated in two studies in continental Europe [12,32]. Measurement invariance (common factor loadings and rank order of item difficulties) was demonstrated among Latin American and Indian sites, but the evidence for this was less compelling for Chinese sites, and measurement properties were quite different in Nigeria. Concurrent validity (hypothesized positive correlations with disability scores, and negative correlations with subjective health ratings and happiness) was strongly supported for the Latin American and Indian sites. However, correlations with subjective health ratings were weak in China, and the hypothesised negative correlations with happiness were absent in China and Nigeria.
We assessed the construct validity of the EURO-D in large, population-based surveys in diverse low and middle income country settings, including both rural and urban catchment areas. We used advanced psychometric techniquesconfirmatory factor analysis and item response models, as well as concurrent validity and calibration with clinical diagnosis to evaluate cross-cultural construct validity. Findings are directly comparable with similar analyses conducted in continental Europe [32,46]. The main limitations of this study are that we did not carry out a criterion validation using an independent clinical interview, and we did not assess test-retest, interinterviewer or inter-rater reliability for the EURO-D scale items.
Findings from this study are most directly comparable with those from the SHARE survey [22] and the EURODEP consortium studies [47], in which the EURO-D was administered to as part of the GMS (EURODEP, nine sites in eight European countries, older adults aged 65 years and over), or as a free-standing scale (SHARE, 11 European countries, older adults aged 50 years and over) in crosssectional population-based surveys. In EURODEP, the mean EURO-D score ranged from 1.3 to 3.6 among countries, and in SHARE from 1.8 to 3.1, similar to the range observed in our 10/66 studies of 1.7 to 3.2 (excluding the low outlier of China). Cronbach's alpha ranging from 0.61 to 0.75 in EURODEP, and from 0.62 to 0.78 in SHARE, similar to the range from 0.64 to 0.77 observed in most 10/66 sites. The unusually high internal consistency in rural India and Nigeria (Cronbach's alpha, 0.87) may suggest a problem with response set bias in those sites. The EURO-D demonstrated stronger hierarchical scaling properties in the European countries included in the SHARE survey [32] than in the 10/66 sites in Latin America and India. Nevertheless, the rank of item difficulties was similar, with depression, sleep disturbance and fatigue being among the most commonly endorsed items (low item difficulty), and guilt and wishing death among the least commonly endorsed (high item difficulty). In Nigeria, EURO-D item responses were strongly hierarchical but with a strikingly different rank order of item difficulties than that observed in the other 10/66 sites and in the European SHARE survey countries. Principal Components Analysis generated similar factor structures (affective suffering and motivation) in the current study as in the EURODEP studies [46], the SHARE surveys [32], and in convenience samples of depressed and older people from the general population in the 10/66 Dementia Research Group pilot studies in Latin America, India and China [24]. The two factor solution derived in the European SHARE study fitted moderately well in our current sample, particularly when the Nigerian site was excluded. As in the SHARE study, depression and tearfulness consistently loaded on Affective Suffering. However, in contrast to the SHARE study interest and enjoyment rather than enjoyment and pessimism dominated the Motivation factor. The clinical diagnosis of ICD-10 depressive episode in the current study was derived from the same GMS interview, using many of the same items that were used to score the EURO-D, the distinction being that particular combinations of symptoms (which needed to be persistent and pervasive) were required to meet the ICD-10 criteria. As such, the favourable validity coefficients cannot be taken as evidence of criterion validity. Such evidence is available from independent clinical assessments in some of the EURODEP studies [12], a clinical validation of the EURO-D scale in Spain [48] and high sensitivity for the detection of severe depression in the 10/66 Dementia Research Group pilot studies in Latin America, India and China [24]. We were, however, able to calibrate the EURO-D scale score against a ICD-10 clinical diagnosis of depressive episode; the optimal cutpoint was 4/5 in most sites, one point higher than the 3/4 cutpoint identified as optimal in the EURODEP consortium studies [12,46]. Concurrent validity of the EURO-D scale has not been assessed in previous studies. Depression among older people has been previously shown to be strongly associated with disability [49][50][51] and inversely associated with self-reported global health [12]. Although happiness is undoubtedly more than the absence of depression, recent analyses of population-based survey data from the United Kingdom, Germany and Australia indicate that mental ill health accounts for by far the largest component of the variance in lack of life satisfaction, dominating the effects of physical health, demographic and socioeconomic factors [52]. As such, the failure to observe the predicted inverse correlation with self-reported happiness in China and Nigeria does not support the construct validity of the EURO-D in those settings.
Several factors may have contributed to the discrepant measurement characteristics of the EURO-D in China and particularly Nigeria. In the Chinese sites the prevalence of nearly all depression symptoms was strikingly low. This may have impeded the elucidation of the factor structure and assessment of hierarchality, as well as limiting the variance to be explained in correlation with concurrent validators. In China the once popular and prevalent diagnosis of shenjing shuairuo, a neurasthenia like syndrome comprising weakness, fatigue, concentration problems, headache and other somatic symptoms seems in recent years to have been supplanted as the most common diagnosis in epidemiological surveys and clinical practice by depressive and anxiety disorders [53]. This has led some to allege an inappropriate importation of western nosologies that do not match well with Chinese cultural idioms of expression of psychological distress [53]. An alternative standpoint is that 'mental health literacy', judged by recognition and appropriate attribution of vignettes of depression and anxiety, is low in Chinese populations both inside and outside of China [54]. In this context, it is perhaps noteworthy that in our study depression was not a common symptom in either the urban or rural Chinese sites, and the sleep disturbance, fatigue and irritability were the three commonest symptoms in the urban site, and tearfulness, lack of concentration and loss of interest in the rural site. The EURO-D factor structure derived from the Chinese sample is consistent with previous observations from rural Thailand [55] where a high prevalence of fatigue was also observed, and where in addition to affective suffering and motivation, sleep and appetite constituted a separate third factor.
Cultural differences in the experience, attribution and communication of psychological distress might also have mediated some of the observed differences in measurement properties in Nigeria. Brain Fag Syndrome, comprising a tetrad of somatic complaints, cognitive impairments, sleep related complaints, and other somatic impairments was recognised as a West African culture bound syndrome in DSM-IV [56]. While originally recognised among students in the early 1960s, it is likely that this reflects enduring and widespread tendencies for the expression of psychological distress, informed by cultural norms and traditional medicine services. In our study, loss of enjoyment and interest, and fatigue were the most commonly endorsed symptoms in Nigeria; however, the rank orders of sleep disturbance and concentration problems were similar to those in other sites. Site-specific factors, some of which may have been culture related, may also have influenced the interaction between the older respondent and the interviewer, impacting on the assessment, ascertainment and recording of symptoms. In Nigeria, interviewers were local school leavers as opposed to graduates (often health professionals) in other sites, and levels of education and literacy among participants were the lowest of any of the 10/66 survey sites. While training for interviewing using the GMS was carried out using standardized and rigorous procedures in all sites, this may have been a particularly challenging task for the young interviewers in Nigeria. Finally, in both Nigeria and China, suboptimal translations and or cultural adaptions for either the happiness question or the EURO-D may have led to an underestimation of the correlations between these variables.

Conclusions
In conclusion, more work needs to be done to establish the validity of the EURO-D scale, and by extension the GMS interview, when used across cultures as a tool for assessing depression symptom severity, and generating clinical diagnoses. While its cross-cultural measurement properties are for the most part favourable, the case for measurement invariance with respect to its European origins weakens progressively with increasing cultural distance and disparity in levels of human development. Different questions, asked in different ways, may have served better to elicit symptoms of depressed mood in certain cultures. Ethnographically informed qualitative research might help to identify culture-specific idioms of psychological distress (not captured by depression nosologies), among older adults in China and Nigeria. With globalisation, and progressive economic and human development, it may be that cultures will tend to converge around a western consensus of 'mental health literacy'. If so, one might hypothesise that, through a cohort effect, cross-cultural challenges may be most evident in the assessment of the mental health of older adults.