The psychometric properties of the subscales of the GHQ-28 in a multi-ethnic maternal sample: results from the Born in Bradford cohort

Background Poor maternal mental health can impact on children’s development and wellbeing; however, there is concern about the comparability of screening instruments administered to women of diverse ethnic origin. Methods We used confirmatory factor analysis (CFA) and exploratory factor analysis (EFA) to examine the subscale structure of the GHQ-28 in an ethnically diverse community cohort of pregnant women in the UK (N = 5,089). We defined five groups according to ethnicity and language of administration, and also conducted a CFA between four groups of 1,095 women who completed the GHQ-28 both during and after pregnancy. Results After item reduction, 17 of the 28 items were considered to relate to the same four underlying concepts in each group; however, there was variation in the response to individual items by women of different ethnic origin and this rendered between group comparisons problematic. The EFA revealed that these measurement difficulties might be related to variation in the underlying concepts being measured by the factors. Conclusions We found little evidence to recommend the use of the GHQ-28 subscales in routine clinical or epidemiological assessment of maternal women in populations of diverse ethnicity.


Background
Good maternal mental health is important for a child's future health and wellbeing as depression and other mental health problems can interfere with bonding, attachment, enrichment activities and parenting behaviour [1,2]. Children of mothers who suffer from depression are more likely to experience behavioural problems and have lower school attainment; this can set a child on a pathway of fewer life chances with associated risks of health problems [3][4][5][6][7]. Antenatal distress, particularly anxiety, and postnatal depression are strongly correlated [8,9]; however, screening presents challenges as normal physical and hormonal changes may interfere with the sensitivity and specificity of screening instruments, particularly those containing items relating to somatic symptoms which will naturally be disturbed by both pregnancy and caring for an infant [10,11].
Commonly used population screens for psychological distress include the General Health Questionnaire (GHQ) family of instruments. The 28-item version (GHQ-28) was developed in the 1970's from a factor analysis of the GHQ-60 to distinguish four correlated underlying concepts as factors, each comprised of seven items related to the presence of somatic symptoms (subscale A, items 1-7), anxiety and insomnia (B, [8][9][10][11][12][13][14], social dysfunction (C, [15][16][17][18][19][20][21] and severe depression (D, [22][23][24][25][26][27][28] [12]. The GHQ-28 has been translated into several languages and used internationally. A key concern when applying a screening instrument in a different population is that it might perform unexpectedly; therefore 'emic' measures that have intrinsic meaning in the culture and populations in which they will be used [13,14] are preferable in the development of mental health measures. 'Etic' development of mental health measures whereby translated and/or transplanted measures are applied to a population under the assumption that concepts are similar across cultures may not be of particular concern when the health of a single population is assessed; however, potential variation has consequences when assessing differences between populations. If differences exist in the way groups interpret the underlying concept being measured, or variation in the strength of relationship between a question about a symptom and the concept, and this goes unnoticed or ignored, it might be difficult to distinguish between true variations (or similarities) in mental health, and spurious findings. Johnson [15] highlights the complexities inherent when defining and operationalising cross-cultural equivalence, with interpretive differences of concepts and constructs nested in lexical, semantic and idiomatic variation. Factors that can affect instrument accuracy include population variation in mental illness prevalence [16], differences in the strength of association between the items and the implied factor being measured, variation in the expression of psychological symptoms, and systematic differences in how the response scales for each question are completed [17].
Several methods are available to explore potential differences and test hypotheses to examine if measures are equivalent across populations. For multi-dimensional instruments the number of factors being measured by the items can be derived from exploratory factor analysis (EFA). The same technique can be employed to determine which items are most strongly (or weakly) related to the factors/s and which items relate to multiple factors. The instrument's equivalence across different populations can be tested using confirmatory factor analysis (CFA) which can indicate whether a factor is associated with the same item set across groups (configural invariance), the strength of the relationship between each item and the factor is the same across groups (metric invariance), and whether both groups have a similar response to an item response scale (scalar invariance). Such analyses lead to the development of a measurement model in which equivalence of the scale's performance in each group is suggested or rejected either from the observed data or after correction for systematic differences.
Using EFA, the four-factor structure of the GHQ-28 has been found to vary between countries, and across populations there may be less distinction between subscales A (Somatic) and B (Anxiety and Insomnia) than originally found [18]. Fewer studies have explored the performance of the GHQ-28 subscales during or after pregnancy; however, an analysis of a Yoruban translation given to pregnant Nigerian women indicated that subscales A and B and the more cognitive (non-suicidal ideation) items from subscale D represented a single factor [19]. Large scale investigations into the scale's performance in maternal populations and in ethnic minority women are lacking.
The GHQ-28 was used as a measure of maternal psychological distress for the Born in Bradford community birth cohort (www.borninbradford.nhs.uk) which includes roughly equal size populations of White women and those of South Asian descent. Because of the potential for variation in the underlying concepts measured by the GHQ-28 between ethnic groups and languages of administration, and due to the maternal characteristics of the cohort, we examined its psychometric properties to ensure that cohort-wide comparisons were valid between all subpopulations.
We aimed at identifying a strategy that could be used to measure and compare symptom subscale scores during and after pregnancy for women of varying cultural backgrounds and for those completing the GHQ-28 in different languages.

Population
Born in Bradford (BiB) is a longitudinal multi-ethnic birth cohort study aiming at examining the impact of environmental, psychological and genetic factors on maternal and child health and wellbeing [20]. Bradford is a city in the North of England with high levels of socioeconomic deprivation and ethnic diversity. Women were recruited prior to a glucose tolerance test offered as a routine procedure to all pregnant women registered at Bradford Royal Infirmary at 26-28 weeks gestation. A baseline questionnaire was administered to women who consented via an interview conducted in a designated room with semi-private booths. Women could choose to have their interview conducted in either English, Mirpuri (a spoken variant of Punjabi) or Urdu. Women not able to converse in any of these three languages were eligible to enrol but did not complete the baseline questionnaire and thus are not included here. The full BiB cohort recruited 12,453 women during 13,776 pregnancies between 2007 and 2010 and the cohort is broadly characteristic of the city's maternal population. Ethical approval for the data collection was granted by Bradford Research Ethics Committee (Ref 07/H1302/112).
Two samples from the BiB cohort were used to explore the properties of the GHQ-28. First we report on data from 5,299 women with singleton births enrolled between November 2007 and March 2009 who completed the phase two version of the three versions of the baseline questionnaire. Second, we used a subset of the cohort, known as BiB1000, to assess the structure of the GHQ-28 in pregnancy and postnatally. BiB1000 participants in our sample were enrolled between August 2008 and March 2009, completed the phase two baseline questionnaire and consented to repeat visits at six, 12, 18, 24 and 36 months postpartum. We report on the antenatal and six-month GHQ-28 data for 1,305 women with singleton births.

GHQ-28
An initial Urdu translation of the GHQ-28 questionnaire was adapted for use as a script in this population by a professional translator through a process of refinement using participatory methods [21,22]. Assessment of understanding was undertaken with groups of bilingual then monolingual Urdu women from local Children's Centres. A Mirpuri version was transliterated from a second draft that used a similar iterative process with bilingual then monolingual Mirpuri speaking women. Scripts were finalised from the third draft version in each language.
The GHQ-28 was administered on paper as part of a self-completion module at the end of the interview for women who chose to complete their baseline questionnaire in English. For the women who chose Mirpuri or Urdu language, the GHQ-28 questions were read aloud and the research assistant coded the response on paper. Verbal administration was necessary because there is no written form of Mirpuri, and not all Urdu speakers are fluent in reading and writing the Urdu language. Some of the women were accompanied; therefore verbal responses may have been audible to the accompanying person. For the women in BiB1000, the six-month GHQ-28 was administered in the women's home by research staff in the language of choice.
The GHQ-28 has a 4-item response scale anchored (typically) with 'Not at all' , 'No more than usual' , 'Rather more than usual' , and 'Much more than usual'. Several scoring options are available; we used the Likert method to indicate symptom severity, which scores the item response between 0-3 (0-1-2-3, subscale range 0 to 21) as this is the recommended method for assessment of the subscales. We excluded the few cases where either the GHQ-28 was missing in its entirety, or did not contain at least one intact subscale.

Ethnicity
Questions relating to ethnicity in BiB were based on those used in the UK's 2001 census and comprised of one question that asked which ethnic group the mothers considered they belonged to (White, Mixed ethnic group, Black or Black British, Asian or Asian British, Chinese or other), followed by a further question, based on their response, about their cultural background. For example, if a participant selected ' Asian or Asian British' as ethnic group, a choice of cultural background could be selected from the following; Indian, Pakistani, Bangladeshi, Indian Caribbean, African-Indian. Self-defined ethnic and cultural group information was taken from the baseline questionnaire and classified into the two most numerous groups of White and Pakistani; all other responses were coded into a separate category (Other). The few cases of women identified as mixed White and Pakistani (N = 18 in the cohort) were classified in the White group. Due to the low number of non-UK born White women (N = 146) we did not further differentiate the cultural background of those who identified as White.

Language of administration
The interviewer recorded the language in which the interview was conducted.

Analysis
We tested for measurement equivalence on the subscales by multi-group confirmatory factor analysis (CFA), using Mplus version 7 with a robust maximum likelihood (MLR) estimator as our data were not normally distributed. MLR is a full information estimator that employs all the available data and thereby calculates unbiased parameter estimates in the presence of data which are missing at random or missing completely at random [23]. Some women completed the instrument on more than one occasion due to multiple pregnancies. This introduces non-independence into the sample, which can lead to incorrect values for standard errors and fit statistics (fit statistics based on chi-square). We accounted for this minor clustering of the full cohort data by utilising a sandwich estimator (the cluster command within Mplus, combined with the complex samples approach). We fitted increasingly restrictive pairwise models in five subpopulations; women who completed the questionnaire in English for the ethnocultural groups of Pakistani, White and Other, women who completed the questionnaire in Mirpuri (Pakistani and Other), and women who completed it in Urdu (Pakistani and Other). As a subscale score is calculated independently from other subscales in practice, we considered the fit of each subscale separately for each subpopulation, with no cross-loading items permitted. If a factor was not associated with the same item sets across groups (i.e. configural invariance was not met) a model generation strategy was used where items within subscales were removed until adequate fit was achieved for each subpopulation for the same items for each factor. We considered model fit adequate if thresholds for three indices were met; comparative fit index, CFI (≥0.95), root mean square error of approximation, RMSEA (≤0.08) and standardised root mean square residual, SRMR (≤0.06). We interpreted modification indices to help identify the most problematic items and accepted the solution that retained the largest number of items, for the best fit, across groups. If configural invariance was then indicated, we tested whether the strength of the relationship between each item and the factor were equal across groups by constraining factor loadings to be equal across both groups (metric invariance). If metric invariance was indicated we then tested for scalar invariance by also constraining item intercepts to be equal [24][25][26]. For analysis purposes the latent variable is assigned the scale of the first item. If there is variation in how each group responds to an item response scale, a unit change in a factor score will be associated with an unequal change in the score of an item across groups. The presence of this Differential Item Functioning (DIF) indicates that between group comparisons will be invalid [27].
We treated the data as continuous for analysis purposes. Likert data can be treated as continuous, or can be considered to be ordered categorical (i.e. an item response theory -IRT-based approach). There is debate in the literature regarding the most appropriate method for analysing such data [28,29] however our aim was to analyse the scales in the same metric in which they are employed. The scales are typically scored by summing (or equivalently averaging) items, not scored using IRT-based methods, hence we analysed the covariance matrix.
We repeated this process (configural, metric, scalar testing) on the subsample of women who completed the measure both during pregnancy and six-months post-partum (BiB1000). We restricted the BiB1000 analysis to those women who completed both questionnaires in the same language. Two women from the 'Other' ethnic groups did not complete the questionnaire in English and only three women completed the GHQ-28 in Mirpuri. Therefore, our analysis compared these data across four ethnic groups; English administration for White women, English (Pakistani), English (Other) and Urdu (Pakistani).
As noted previously, we considered model fit adequate if thresholds for three indices were met; CFI (≥0.95), RMSEA (≤0.08), and SRMR (≤0.06). We did not interpret change in χ 2 as an indicator of invariance in increasingly restrictive models as it is relatively insensitive to change in large samples. Instead we used a change in CFI of ≤0.01 together with a change in SRMR of ≤0.03 to indicate substantive invariance, setting the SRMR criterion to ≤0.01 when evaluating scalar invariance [30,31].
As the same seven items were not associated with the same factors across groups, i.e. configural invariance was not indicated, we followed up the CFA of the BiB cohort with exploratory factor analysis (EFA). We specified an EFA with between 1 and 8 latent variable solutions as implemented in Mplus. To determine the most parsimonious solution that best fit the data we examined the scree plot [32] for the point of inflexion and used the fit criteria detailed above. A Somatic symptoms 6 (4 to 9) 8 (5 to 11) 7 (4 to 10) 7 (4 to 9) 8 (5 to 11) 7 (4 to 10) B Anxiety and Insomnia 7 (3 to 10) 8 (4 to 11) 7 (3 to 10) 4 (1 to 7) 4 (1 to 8) 6 (3 to 10) C Social dysfunction 7 (7 to 9) 8 (7 to 10) 8 (7 to 10) 7 (7 to 8) 7 (7 to 9) 8 (7 to 9) D Severe depression 0 (0 to 2) 1 (0 to 3) 1 (0 to 3) 1 (0 to 1) 1 (0 to 1) 0 (0 to 2) * Includes those with at least one intact GHQ-28 subscale and the language of administration, N presented may not total 5089 due to small amounts of missing data, ** total scores have more missing data but are not used in the analysis, SD standard deviation, IQR interquartile range.

Description of sample BiB cohort
We excluded 176 (3.3%) women without at least one GHQ-28 subscale score, along with a further 34 (<1%) women where the language of administration was not documented. Of the remaining 5,089 cases, 2.3% were missing a minor amount of GHQ-28 data. Nearly all the women who completed the questionnaires in a language other than English were born outside of the UK, and around 10% of the Mirpuri and 7% of the Urdu questionnaires were completed by women of Other ethnic origin (Table 1).

BiB1000
Of the 1,305 women enrolled, 186 (14.3%) were not included as they did not use either Urdu or English at each administration, and a further 24 were missing GHQ-28 data. The characteristics of women recruited to the BiB1000 study did not appear to differ markedly from the main cohort (Table 2).

Confirmatory factor analysis, BiB cohort Model generation strategy
Generally there was little evidence of good fit of the items to each subscale across groups. To achieve adequate fit across the sample all subscales required B Anxiety and Insomnia 7 (4 to 10) 7 (4 to 11) 5 (2 to 9) 7 (3 to 11) 7 (4 to 10) C Social dysfunction 7 (7 to 9) 8 (7 to 10) 8 (7 to 9) 8 (7 to 10) 8 (7 to 9) Postnatal GHQ-28 scores D Severe depression 0 (0 to 1) 0 (0 to 1) 0 (0 to 1) 0 (0 to 1) 0 (0 to 1) * Includes those with at least one intact GHQ-28 subscale from each time point and the same language of administration both times, N presented may not total 1095 due to small amounts of missing data, ** total scores have more missing data but are not used in the analysis, SD standard deviation, IQR interquartile range.
item reduction ( Table 3). The best fit was not always achieved for the same cluster across subpopulations, this was marked for subscales C (Social Dysfunction) and D (Severe Depression). The retained GHQ-28 questions are provided in Table 4.

Invariance testing
There appeared to be metric invariance between all subpopulations for all reduced item subscales (Table 5).
There was evidence of differential item functioning across many of the group comparisons on all subscales, which indicated that some subpopulations used the item response scales differently under the same state of mental health as measured by the latent factor.

Exploratory factor analysis, BiB cohort
The results from the CFA suggested greater variability between English and non-English groups than for pairwise comparisons between the White British, Pakistani and women of other ethnicities who completed the questionnaire in English. We hypothesised that this was due to differences in the underlying factor structure between linguistic-cultural groups and used EFA to investigate this possibility. A better fit was indicated for a five factor model over a four-factor for the sample overall and all English groups, and six factors over five for the Urdu and Mirpuri groups. However, the individual items making up these factors appeared to differ (Table 6). Across the cohort there appeared to be two concepts being measured with the somatic questions; one cluster of items relating to generalised somatic symptoms (items 1-4), and one relating to the two items concerning physical symptoms in or on the head (items 5 & 6, dubbed Head Somatics in Table 4). The depression concept was split into two factors for the women who responded to the Mirpuri version of the GHQ-28. Several items did not load onto any factor (factor loading <0.3) or loaded only weakly (<0.4); in particular Items 7 (hot/cold spells) 15 (busy and occupied) and 21 (enjoy normal activities), indicating little relevance to the observed factors in most of the subpopulations. The amount of variance in the overall model explained by the factors was low; from 41.1% for the Pakistani (English) group, to 32.6% of the Urdu responses. The Severe Depression and Anxiety and Insomnia factors accounted for the largest proportion of the variance for most of the groups. The exception was for the Urdu sample, where the Anxiety and Insomnia questions did not appear to be a unified concept and accounted for less of the variance.

Confirmatory factor analysis, BiB 1000 Model generation strategy
Fit of the seven items to each subscale (data not shown) and reduced item factors for the smaller sample (BiB1000) was broadly similar to the BiB cohort (Table 7), except for some severe model estimation problems on the reduced Severe Depression subscale (items 23-26).

Invariance testing
Although metric invariance held for the antenatal and postnatal analyses, there was evidence of DIF between many of the subpopulations at one or both time points (Table 8). To check that we had not forced items 23-26 into an ill-fitting factor, as this was the best fit for the cohort's Mirpuri sample which was absent in BiB1000, we repeated the analysis for the better fitting cluster 24-27; however, models then became inestimable for the Urdu sample.

Discussion
We conducted an extensive psychometric evaluation of the GHQ-28 subscales in a large community multi-ethnic maternal cohort in the UK. Our results are important because this is the first large scale investigation in both a maternal population and in South Asian women, where there is uncertainty about measurement equivalence of mental health [33][34][35][36]. For each subscale an item reduction strategy was necessary to fit all our defined subpopulations, and there was evidence of differential item functioning in many of the pairwise comparisons. Exploration of the factor structure indicates that this was caused by variation in the concepts being measured, with the most obvious differences visible between groups of women who completed the questionnaire in English and non-English. For example, Anxiety and Insomnia in the Urdu respondents and Severe Depression in the Punjabi respondents did not appear to be related to the same item clusters as women of any ethnicity completing the No questionnaire in English. The implication is that the meaning of the underlying concepts for some items differs according to language of administration and between ethnic groups; this may be related to any number of factors such as acculturation, translation or cultural differences in concept or interpretation. Our goal was to define a measurement model to compare symptom severity in each domain across subgroups; our findings indicate that due to lack of invariance we cannot recommend such comparisons across this cohort.
Research indicates the concept (if not the nomenclature) of postnatal distress has recognition and relevance globally e.g. [37,38]. However, internal construction of causality, symptom experience and illness resolution can vary greatly between cultures [39]. For example, in one UK study, women originating from the Punjab who had 'life troubles'  reported symptoms of sadness and grief that tallied with the notion of depression, but conceptualised their problems as an illness manifesting physically as 'heavy in the heart' [40]. Notably, there have been few studies exploring the meaning of depression in pregnant, not postnatal, South Asian women. Given such potential for variation, it is perhaps unsurprising that we found differences in the attribution of a specific symptom to particular construct of mental distress between the groups in our sample. Our results indicated several interesting points between the relationship of symptoms and mental health during the maternal period, and also between ethnic groups.

Somatic subscale
Irrespective of cultural background, it is common for people with depression to initially present with somatic symptoms e.g. [14,41]. Somatisation of psychological distress is of interest in maternal populations where new and perhaps unfamiliar bodily changes coincide with any onset of distress. Such simultaneous physical and hormonal changes may complicate self and clinical recognition of potential affective distress. For example, somatic dysfunction might be construed as causative of distress, distress could be overshadowed by physical symptoms that may be considered to have more serious implications for the baby's health, or body symptoms may simply co-exist alongside with distress. Neither is the concept of somatisation unidimensional. Simon et al. [41] define three different presentations; patients with psychological distress who initially present somatic symptoms, those distressed who present with medically unexplained somatic symptoms and those who present somatic symptoms and deny psychological distress. Bhui et al. [14] adds a fourth; presentation of somatic symptoms made significantly made worse by feeling low, stressed or anxious. The topic has generated much theoretical interest for South Asian cultures where somatisation has sometimes [42], but not universally [13,41], been reported to be more frequently endorsed as a symptom of depression. Indeed some data indicate that initial presentation with somatic symptoms might be a function of the patient-doctor interaction rather than a cultural phenomenon [41].
Our data show that broadly, across the maternal population, two concepts related to somatic symptomology were evident; the first comprised of generalised somatic symptoms and the second of symptoms related to the head. A principle components evaluation of a non- maternal European sample with rheumatoid arthritis [43] found a similar split in structure, but a study of pregnant Nigerian women [19] reported that all seven somatic items clustered together. Although there are differences in methodology, this indicates that the split between general and specific somatic symptoms may be related to factors other than maternity, or female gender, and in our study these elements appear stable regardless of ethnic background, language of administration or pregnancy/postnatal status. We suggest that this hypothesis is tested in other population samples.

Anxiety and insomnia subscale
Antenatal anxiety commonly co-occurs with depression and is antecedent to postnatal anxiety and depression [9,[44][45][46], and our EFA implicated this factor as the largest symptom cluster for most groups. However, the invariance testing indicated some significant problems with comparisons involving the Urdu group, which the EFA revealed was likely due to a split in the underlying concept.

Social dysfunction subscale
For all groups except the Urdu language groups, the concept of Social Dysfunction was related to all its hypothesised items, confirming the findings in a Nigerian antenatal sample [19]. Excluding comparisons with the Urdu group, this factor also appeared to indicate pairwise invariance. However, the clinical relevance of this subscale is not well researched [47], which limits its relevance in distinguishing psychiatric morbidity from the range of normal changes during pregnancy.

Severe depression subscale
As noted, anxiety and depression are commonly comorbid and these two GHQ-28 factors are unsurprisingly correlated, although the depression subscale has been found to garner some additional information [47].
Here it is noteworthy that this subscale measures severe depression with three questions relating to suicidal ideation; notably absent are enquiries into dysphoric mood. Measurement of such a dimension is of interest interculturally; Bhugra and colleagues have enumerated that in London, young South Asian women are at higher risk for presenting with attempted suicide than White women [48,49] with cultural and family conflict the actual and perceived causes of such attempts [48,50]. However, the utility of this subscale to measure the concept of suicidality might be limited, as although for the antenatal English language and Urdu respondents the questions seemed unified and the factor important, this was not the case in the Mirpuri group, and there was evidence of invariance between groups. Furthermore, only one of the suicidality questions (item 25) was invariant between groups. Model estimation difficulties that may have been related to low endorsement of these severe items precluded analysis of postnatal data.

Measurement invariance
After reducing items to create factors which appeared to have reasonable fit across all the subpopulations, the iterative process of invariance testing revealed systematic differences in how the different subpopulations rated themselves on the measurement scales. We would be able to solve the problem of systematic differences in scale response if, as in most CFA analyses, there were just two populations to compare; but due to both cultural and language variation we identified five distinct groups, and as the DIF varied within sub-group pairs, systematic correction is unfeasible. While some of the differences are small and would have a negligible impact on mean scores, some differentials are up to half a point (on a four-point scale) which has the potential to lead to spurious conclusions after comparison.

Postnatal scores
Interpretation of the analysis into any systematic differences in structure between antenatal and postnatal administration were limited due to difficulties with model estimation, particularly in the Severe Depression subscale.

Strengths and limitations
Our sample is representative of the maternal community in Bradford, and included a large number of South Asian minority women for whom relatively little is known about mental health in pregnancy. Further, we applied a  rigorous approach to our analysis; however, our study does have some shortcomings.

Ethnic and cultural classifications
We used limited classifications of ethnicity which may be overly general [14,51] and can only serve as a proxy for more defined distinction of culture and custom [52]. Such is the compromise when epidemiological rather than anthropological methods are used to classify people [53]. Analysing at the level of an arbitrary subgroup may lead to category fallacy [42] with loss of subtle individual effects such as acculturation and financial and social resources; indeed there may be as much variation within groups as there is between. In particular, we combined the group of women of all Other ethnicities into one heterogeneous reference group, which limits decomposition by ethnicity and culture. We split our sample into five (BiB cohort) and four (BiB1000) reference groups by ethno-cultural classification and language of questionnaire, although women within these groups were likely to have different levels of acculturation. Without a specific measure of acculturation it is impossible to assess values, beliefs, expectations, norms and practices of the new culture and the extent of their acquisition, and how much retention of original culture is still present [54]. Acculturation may have affected how women answered the GHQ-28 questions, for example it may have imposed some unmeasured variation in our estimates, or it could have potentially explained some of the differences we found.

Ethno-cultural instrument adaptation
The participatory translation process was rigorous and the translated versions had good semantic, content and conceptual equivalence to the English instrument. An Urdu translation of the GHQ-28 assessed in a bilingual (English and Urdu) population in Pakistan found reasonable semantic, conceptual and scale validity [55]. However, in our study there was no formal assessment of criterion or technical equivalence, necessary to establish whether the GHQ-28 performs similarly across cultures regardless of administration verbally or via paper, or whether the interpretation of measurement of mental health remains the same when compared to norms of both cultures [56]. We did not know which women were bilingually fluent, if we did we could have used their selection of language as a basis to disentangle any variance associated with the translation from that of cultural differences in interpretation and differential item functioning [57]. Of note, there may have been unmeasured administration bias as the administration to non-English speakers was verbal and responses that were potentially audible to family members or friends accompanying the women may have affected the way these women answered the questions.

Methodological limitations
As discussed in the analysis section, we treated Likert scale data as continuous for the purposes of analysis. Whilst this has the advantages that we described in that section it is problematic in that DIF cannot be described in terms of the scoring of the scale [28,29]. However, such an approach may be more appropriate for determining invariance in the underlying psychological constructs. In CFA, one item in a factor must be held constant (mean of 0 and variance of 1), and because this item's variability is not calculated, it can lead to spurious conclusions of invariance if the reference item is the source of DIF [27]. This may be relevant as we held the first item in any one cluster as the reference item. In addition, the lack of standardised diagnostic interview to confirm or exclude depression is a limitation to the interpretation of assessment of relevance of the subscales to clinical criteria in this maternal population.

Conclusions
We have conducted a robust analysis of the GHQ-28 subscales in a large, ethnically diverse pregnant population and found problems with measurement equivalence between ethno-language groups. In particular, the concepts of Severe Depression and Anxiety and Insomnia appear to vary between language of administration and ethnic heritage. Our findings are tempered by uncertainty about how much variation is caused by artefact of translation and administration bias, and how much due to cultural differences in interpretation. We recommend that the GHQ-28 subscale scores are not used to conduct between-group comparisons in this cohort, nor in other ethnically diverse pregnant populations either clinically or epidemiologically, although as indicated for some subscales and for some groups they could be used to explore within-group characteristics.