Detecting depression among adolescents in Santiago, Chile: sex differences

Background Depression among adolescents is common but most cases go undetected. Brief questionnaires offer an opportunity to identify probable cases but properly validated cut-off points are often unavailable, especially in non-western countries. Sex differences in the prevalence of depression become marked in adolescence and this needs to be accounted when establishing cut-off points. Method This study involved adolescents attending secondary state schools in Santiago, Chile. We compared the self-reported Beck Depression Inventory-II with a psychiatric interview to ascertain diagnosis. General psychometric features were estimated before establishing the criterion validity of the BDI-II. Results The BDI-II showed good psychometric properties with good internal consistency, a clear unidimensional factorial structure, and good capacity to discriminate between cases and non-cases of depression. Optimal cut-off points to establish caseness for depression were much higher for girls than boys. Sex discrepancies were primarily explained by differences in scores among those with depression rather than among those without depression. Conclusions It is essential to validate scales with the populations intended to be used with. Sex differences are often ignored when applying cut-off points, leading to substantial misclassification. Early detection of depression is essential if we think that early intervention is a clinically important goal.


Background
Depression is a common condition affecting people of all ages and races [1], with high prevalence among youngsters in Latin America [2][3][4]. Early onset depression is of interest because of the need to identify early cases of depression and potentially prevent or reduce consequences later in life [5,6]. Between 20% to 33% of those who meet criteria for the diagnosis of lifetime major depression report that their first episode occurred before the age of 21 [6][7][8][9], with a mean age of onset in this group estimated as 15 years [10]. Different studies have shown that depression in adolescence (early onset) affects school performance, increases antisocial behavior, self-harm and suicidal risk; as well as impairing overall functioning [9,[11][12][13][14][15][16][17][18][19].
Notwithstanding the importance of early identification of this disorder, community surveys consistently show that adolescent depression is under-diagnosed and undertreated [20][21][22]. Screening for depressive symptoms among adolescents may be one way of improving early detection. There are advantages and disadvantages in doing so [23] but identification is a necessary preliminary step if one wishes intervening early [24] with the aim of potentially ameliorating adverse outcomes later in life.
Brief depression self-rating scales can be especially useful for this purpose [25]. The Beck Depression Inventory (BDI) is one of the best known and most widely used self-rating scales to assess the presence and severity of depressive symptoms [26]. The second version of this scale (BDI-II) was created to establish a clearer link with the DSM-IV classification as well as informing on the severity of depressive symptoms. The studies published, mostly for the English version, show good agreement between this questionnaire and the clinical diagnosis of depression [26][27][28] and good psychometric properties for the scale [26].
The BDI-II when used among adolescents has also shown good psychometric properties [29][30][31][32][33][34][35][36][37][38][39]. However, many of the studies assessing the usefulness of BDI-II with adolescents have been affected by significant methodological limitations. Among these are: small and often only clinical samples, no concomitant assessment with a gold standard and when this is done there are often long delays between the screening and diagnostic interview, and overall poor reporting of methods [24,40]. Needless to say, few studies have been conducted in low and middle income countries where almost 90% of the world's young population lives.
Among the few studies that have explored BDI-II psychometric properties on adolescent non-clinical samples very few have tested criterion validity. More specifically we were unable to find any studies that had validated the BDI-II against a psychiatric interview (criterion) among adolescents in Latin America. More research is needed on the use of the BDI-II with adolescents from other nationalities and ethnic groups before we can confidently support its use as a screening or case identification tool for youngsters across different cultures.
In Chile, the prevalence of depressive symptoms among adolescents is high compared to other countries [41]. A number of studies with different methodologies have reported prevalence rates ranging from 13% [37] to 44% [42]. A recent study using the BDI-II in a representative urban sample of 700 high-school adolescents found that 33% of these youngsters scored 19 or above on the BDI-II [41]. However, the criterion validity of BDI-II among adolescents has never been studied in Chile and there is no empirical evidence to support the validity of any cut-off points used to define caseness with young populations in that setting or indeed in Latin America.
Sex differences in the prevalence of depression have been extensively reported and they become well established in adolescence. When reaching mid-adolescence there is a shift from similar rates of depression in pre-adolescent boys and girls to approximately twice as many females than males with depression [43] and these differences continue until late in life. There is controversy as to whether or not these are real differences or simply measurement artifacts. Misclassification of questionnaires according to various features has been repeatedly reported [44][45][46]. The possibility that boys and girls may respond differently to psychiatric questionnaires has been relatively untested even though this may have important repercussion in the estimates obtained when using questionnaires.
This study aims to fill this gap and assess the criterion validity of the BDI-II, determining the best cut-off points for male and female adolescents in Santiago, Chile. Of particular interest is to study possible differences between sexes. In addition this study aims to assess other psychometric properties of the BDI-II.

Sampling and procedures
Fifteen state high schools in Santiago, Chile, participated in this study undertaken in November 2009 and November 2010. Students were being assessed as part of a randomised controlled trial [47], which was concurrently taking place in these schools. The study sample consisted of 592 participants with a mean age 15.5 (SD=0.98), almost half (53.6%) were girls, all of them attending Grade 10th (approximately 10 years of education) in these schools. Two samples were drawn using different methods. The first sample of 250 students was drawn based on their BDI-II scores collected as part of the baseline assessment in five schools in the active arm of the trial. The first 50 students with BDI-II scores between 0 and 6 (lower tertile), the first 100 students whose scores in the middle tertile (7/15), and the first 100 students with high scores (>15) were invited for a clinical interview. For the second sample, all the 352 students in the control arm of the trial who scored high (≥15 for girls and ≥10 for boys) on the BDI-II were invited for clinical interviews. Students answered the BDI-II in the classroom and clinical interviews were performed within 72 hours in a private office in the school for both samples. One of three trained clinicians blinded to the student's BDI-II status administered this psychiatric interview. In order to improve the blinding of the assessors, interviewers were rotated between schools, so that no-one who participated in the administration of the BDI-II in a particular school also interviewed in the same school.

Ethics
The study complied and was conducted in accordance with the local Research Governance requirements about ethic concerns, and was carried out in compliance with the Helsinki Declaration. Full ethical approval was obtained from the local Committee (Hospital Clinico Universidad de Chile). At the start of the project a letter was sent to the carers of all eligible young people informing them about the study. The letter therefore informed carers that they could opt out of the assessments if they did not wish their child to complete the questionnaires or the interview. In addition, written consent was obtained before completing the questionnaire or the interview (dual carer/child consent/assent was required).

The Beck Depression Inventory-II (BDI-II)
This questionnaire has 21 items asking about depression symptoms experienced over the last two weeks [26]. Answers to each item are on a scale from 0 to 3. For example, 'I do not feel sad' (0), 'I feel sad' (1), 'I am sad all the time and I can't snap out of it' (2), and 'I am so sad and unhappy that I can't stand it' (3). The scores to each item are summed to generate a total score with a range between 0 and 63. Cut-off scores are often used to categorize degrees of severity of depression or if a given score matches the presence of a clinical diagnosis. It is highly desirable that cut-off points are established with a population similar to where those cut-off points will be subsequently applied. Traditional cut-off points used to estimate severity in adults are: 10-16 indicating possible mild depression, 17-29 likely moderate depression; and 30-63 probable severe depression [26]. A Spanish translation of the BDI-II showed good psychometric properties when used with US Spanish speaking young populations [48,49]. A Chilean adaptation of the Spanish version of the BDI-II for use with adolescents showed good internal consistency and test-retest correlation coefficients, as well as good concurrent validity with other depression scales and an adequate goodness-of-fit in the confirmatory factor analysis for both uni-and bi-factorial solutions [36]. Several other depression scales were tested in the formative phase but BDI-II performed as good, if not better, than other scales.

The Mini International Neuropsychiatric Interview for Children and Adolescents (MINI-KIDS)
The MINI-KIDS [50] is a brief, structured diagnostic interview used to assess the presence of the most common DSM-IV and ICD-10 child and adolescent psychiatric disorders (ages 6 to 16). It follows a similar format as the MINI for adults which was developed as a simpler and briefer psychiatric interview to use for clinical or research purposes [51]. It is reported that the MINI-KIDS generates psychiatric diagnoses for children and adolescents in a third of the time as the K-SADS-PL. It has been translated into Spanish and used extensively in Chile [52,53]. Studies have confirmed good psychometric properties when used among adolescents in different languages with sensitivity of 0.61-1.00 and specificity of 0.73-1.00 for most DSM-IV disorders [50]. It is desirable that interviewers have clinical experience and previous training in the use of this interview.

The Revised Child Anxiety and Depression Scale (RCADS)
The RCADS [54] is an adaptation from the Spence Child Anxiety Scale (SCAS) [55] and intends to assess symptoms of DSM-defined anxiety disorders and major depression. The brief version of the RCADS consists of five subscales with five items each one, ranged from 0 (never) to 3 (always), on a 4-point Likert scale [56]. We only included the Spanish version of the generalized anxiety, social phobia, and panic subscales in this study [57]. We excluded the depression and separation anxiety sub-scales because depression was measured with BDI-II and separation anxiety was regarded as less important for this age. Although we are unaware if other researchers have used a similar method we felt that as an approximation to estimating levels of anxiety this is a reasonable approach. We used in the analysis a total score by adding all item scores. The internal consistency of total RCADS scores in this study yielded a value of α=0.84 (males α=0.81; females α=0.84).

Data analysis
The analysis plan contemplated first to examine the general psychometric properties of the scale in order to determine how best to treat overall scores. Once this is established we will assess the criterion validity of the scale with a view to ascertain the best cut-off points to establish depression, with special emphasis on exploring sex differences.
Firstly, descriptive statistics including means and standard deviations were undertaken and sex differences examined. Subsequently we performed psychometric tests to investigate the performance of BDI-II. Initially we estimated Mardia's coefficients [58] to assess the multivariate normality distribution of the variables. Polychoric correlation is advised for factorial analysis when the distributions of ordinal items are asymmetric or with excess of kurtosis [59]. Thus, a polychoric correlation matrix of BDI-II items was estimated. An unweighted least squares factor analysis (ULS) was the method for factor extraction used in our exploratory factor analysis (EFA) in view of its robustness to failure of normality and heteroscedasticity of the data. We used parallel analysis [60] to identify the number of factors to include in the factorial solution, through replacing the raw data method [61] by optimal implementation based on minimum rank factor analysis [62], generating 500 random correlation matrices. With this analysis, a factor is considered significant if the associated eigen value is bigger than that corresponding to a given percentile, such as the 95th of the distribution of eigen values derived from a random dataset. This method is considered the best available solution to decide the number-of-factors-to-retain for a given scale [63,64]. We tested the goodness of fit of the exploratory model using goodness of fit index (GFI) [65] and root mean square of residuals (RMSR), taking into account Kelley's criterion [66].
Subsequently we performed an invariance analysis according to sex, using confirmatory factor analysis (CFA) and applying generalized least squares (GLS) method. This method is robust and allows estimation of χ 2 (df), adjusted goodness-of-fit index (AGFI), root mean square error of approximation (RMSEA) (90% CI), standarized root mean square residual (SRMR) and Hoelter 05 indices. In view that χ 2 estimations are highly sensitive to sample size we also used χ 2 /df, which indicates a good fit when values are <3 [67,68]. GFI and AGFI refer to explained variance and values ≥0.9 are considered acceptable [65,69]. RMSEA is a measurement of the error of approximation to the population and is considered to be acceptable with values <0.06 [65]. SRMR is the standardized difference between the observed and the predicted covariance, indicating a good fit with values <0.08 [68]. The Hoelter index indicates the sample size required to accept the hypothesis with perfect adjustment and a result of 200 or better indicates a satisfactory fit. In an analysis of multiple groups, it has been suggested that a threshold of 200 times the number of groups is sufficient [70].
We examined the reliability of the scale using congeneric, tau-equivalent, and parallel models, in the total sample and the sample divided by sex. The congeneric model is the least restrictive, and assumes that each individual item measures the same latent variable, with possibly different scales, degrees of precision and magnitude of error. The tau-equivalent model implies that individual items measure the same latent variable, on the same scale, with the same degree of precision, but with possibly different degrees of error. The parallel model is the most restrictive measurement model, and assumes that all items must measure the same latent variable, on the same scale, with the same degree of precision, and with the same amount of error [71]. We finally chose the model that fitted better with the data, applying GLS method, and establishing comparisons between models from the least to the more restrictive, through Δχ 2 . The reliability value was estimated by squaring the implied correlation between the composite latent true variable and the composite observed variable, to arrive at the percentage of the total observed variance that were accounted for by the "true" variable [72]. Item-total correlation coefficients (excluding the same item in the total score), mean inter-item polychoric correlations, and mean item-total correlations (excluding the same item) were also used to assess the internal consistency. Convergent-discriminant validity was assessed comparing the BDI-II with RCADS through Spearman's R coefficient.
Criterion validity was assessed plotting Receiving Operating Characteristics (ROC) curves, comparing the BDI-II with MINI-KIDS for the whole sample, as well as for males and females separately. Of primary interest here was the area under the curve (with 95% CI) as representing the capacity of the BDI-II to discriminate between cases and non-cases according to diagnoses ascertained with MINI-KIDS. We plotted curves for both sexes separately and compared these differences using χ 2 tests. Sensitivity, as an index of case identification, and specificity, as an index of non-case recognition, were estimated for several cut-off points, in order to ascertain the best trade-off between sensitivity and specificity. Positive and negative predictive values were also estimated, to ascertain the capacity of the questionnaire to detect true and false cases. Finally, we included the Youden Index, which is unaffected by prevalence, and represents the difference between the proportions of true cases and false cases identified by the questionnaire, with a higher the value indicating a better the cut-off point.
Finally we compared the means of the BDI-II and RCADS for cases and non-cases of depression according to the MINI-KIDS in order to explore if sex differences applied to other psychological questionnaires and/or the presence of depression. Given the multiple comparisons in this analysis we used 99% CIs. All analyses were done with SPSS 15.0, Epidat 3.1, Factor 8.02 and Amos 7.

Descriptive statistics
Less than 5% of the selected sample needed to be replaced, either because of unwillingness to participate or not attending the day of the interview. Table 1

Factorial validity
The analysis of the Mardia's multivariate asymmetry showed a non-normal multivariate distribution of the data for the total sample (kurtosis coefficient = 555.66; p = <0.001) and boys and girls separately. The polychoric correlation matrices of the BDI-II (Additional file 1) revealed that 46.7% correlation coefficients were ≥ 0.30 (38.1% in boys and 38.1% among girls). The determinant of the matrix was 0.01, KMO test had a value of 0.94, and Bartlett's statistic was 3,672.30 (df = 210; p < 0.001), with similar values for boys and girls. Based on these results an EFA for the total sample and according to sex, was undertaken. The parallel analysis based on minimum rank factor analysis ( Table 2) identified a clear one factor structure, with an Eigen value of λ 1 = 7.10, explaining 33.8% of the variance based on eigenvalues (boys λ 1 = 6.78, 32.3% of the variance; girls λ 1 = 6.55, 31.2% of the variance). The goodness of fit statistics was good, for the total sample and sub-samples by sex, with values of GFI of 0.99 and 0.04 for RMSR, in keeping with Kelly's criterion. Table 3 shows the unrotated loading matrix as well as the communality values from EFA for the total sample, and the standarized weights and standard errors for the subsamples from CFA. All the items loaded strongly and positively in a single factor. In general, the weight of the items ranged from 0.34 for 'insomnia' to 0.70 for 'sadness' , with important differences between sexes in items such as 'crying'; 'insomnia'; 'loss of appetite'; and 'loss of libido'. Communality values ranged from 0.12 for 'insomnia' to 0.48 for 'sadness' and 'worthlessness' in the total sample. Standard errors were lower among boys than girls, especially for the items 'loss of libido' and 'crying'.

Invariance analysis
Adjusting by sex did not alter our main results (Table 4). Good results were also seen when comparing sexes using models without and with restrictions, such as unconstrained, factorial weights, variances or residuals. An analysis including all restrictions at the same time yielded values of χ 2 /df = 1.55; GFI = 931; AGFI = 0.924; RMSEA = 0.031 (90% CI = 0.026-0.035); SRMR = 0.071 y Hoelter = 426. Not with standing these adjustments, χ 2 values increased significantly when comparing the model without restrictions with the model with restricted residuals (Δχ 2 =93.13; df=21; p<0.001). Table 5 shows the adjusted reliability models tested. The results fitted best with the congeneric model in all the indices, and the Tau-equivalent showed significant increments in χ 2 (total sample: Δχ 2 =91.60; df=20; p<0.001; boys: Δχ 2 =45.45; df=20; p=0.001; girls: Δχ 2 =50.39; df=20; p=0.001). Based on the congeneric model, the estimates of reliability obtained for the total sample were 0.90; with 0.86 for boys and 0.90 for girls respectively.

Reliability
The mean inter-item correlation was 0.30 for the total (0.28 for boys and 0.27 for girls). The mean item-total correlation was 0.48 for the whole sample (0.41 for boys and 0.48 for girls). All items were positively correlated to the total score, with coefficients item-total (Table 1) ranging from 0.29 ('loss of libido' among girls) to 0.61 ('worthlessness' among girls). In general, boys had lower values in all item-total correlations, with the exception of 'pessimism' , 'loss of libido' and 'suicidal ideas'.

Convergent-discriminant validity
The Spearman correlation coefficient between RCADS and BDI-II was 0.46 (p<0.001), with similar coefficients In other words much of the difference in mean BDI-II values between boys and girls is explained by differences among cases of depression rather than the scores of non-depressed. A similar pattern is seen with mean RCADS scores but there are no differences in mean scores between boys and girls among non-depressed. Figure 1 shows the discriminating ability of the BDI-II against a criterion (MINI-KIDS) using ROC curves. The area under the curve for the total score reached a value of 0. 81 Table 7 shows the discriminating ability and precision of the questionnaire for several cut-off points of the total score for either sex separately and for the total sample. We have only displayed validity coefficients for those cut-off points that seemed to be closest to optimal but all other coefficients are available from the authors. Overall the best cut-off point for the whole sample seems to be reached at 16/17 (≥ 17 represents a case) with a sensitivity of 78.7% and a specificity of 69.6%. However optimal cutoff points seem to differ for both boys and girls. In the latter case, a cut-off point at 19/20 offers a better balance in validity coefficients (sensitivity 74.5% and specificity

Discussion
As far as we are aware this is the first criterion validity study of the Beck Depression Inventory (BDI-II) among adolescents in Latin America. Overall the questionnaire had good psychometric properties with good internal consistency and good capacity to discriminate between cases and non-cases of depression. We think that a single general factor represents the best factorial solution for this questionnaire with this population. We found that the optimal cut-off point differed according to sex, with the optimal cut-off points being much higher for girls than boys. This is an interesting finding because most of the time cut-off points are established for total samples without considering differences across sexes and/or other attributes, something that may result in significant misclassification. These sex discrepancies were primarily explained by differences in scores among those with depression rather than among those without depression.
The main strength of this study is that we tested criterion validity using a standard psychiatric interview administered independently to ascertain caseness. Interviewers were blind to the results of the questionnaires and the interview was conducted less than 72 hours after the administration of the questionnaire. One of the reasons to explain the absence of criterion validity studies in this field is because of the practical problems as well as resources needed to carry out psychiatric interviews. There are also some limitations. Our sample was of moderate size and stratified according to results to the questionnaire (BDI-II). The sample was also restricted to students from lower socio-economic status and within a limited age range. Finally we were unable to vary the order of administration of the measures for practical reasons.
One of the most salient findings of this study is the clear difference in BDI-II total scores between boys and girls. The origin of these sex differences can only be speculated and it certainly deserves more research. Most evidence suggests that there are true differences in the prevalence of depression according to sex [73][74][75][76]. Previous reports had suggested that it may be important to consider why male and female adolescents show different symptom profiles [33,76]. For instance, adolescent girls may be more willing to recognize emotional feelings or they may truly experience more emotional symptoms. In our study girls scored much higher than boys in both the depression and anxiety scales. However we Table 3 Factorial weights for each item of the BDI-II  according to sex   Total  Boys  Girls   BDI items  w  c 2  w  SE  w   found these sex differences mostly among clinically depressed adolescents and not among non-depressed individuals suggesting that it is only when adolescents are clinically depressed that these sex differences in symptoms reported become important. One could imply that depression might have a different impact in boys and girls so that the latter would report more symptoms but it is also possible that a non-depressed population will also have fewer symptoms and this will attenuate any potential differences across sexes. Regardless of the reasons to explain these differences the fact remains that if the same cut-off point is used across sexes, misclassification is likely. In the end the decision of which cut-off point to choose will depend on what is more important, improving the capacity to detect cases or identify normal individuals.  Our overall proposed cut-off point of 16/17 is higher than that suggested in previous studies with diverse populations [26,28,77]. The discriminant capacity of the questionnaire, represented by the area under the ROC curve, was excellent, being better in girls than boys. If we had not estimated cut-off points independently for each sex we would be advising the use this overall cutoff point with this population. However the analysis by sex revealed that there were substantial differences in optimal cut-off points across sexes. If we had used a cutoff point of 16/17 for both boys and girls, the positive predictive value of the questionnaire among boys would be 59.3% and among girls 80.3%. In other words of all the cases detected by the instrument among boys only 59.3% would be true cases according to the interview (gold standard) whereas in girls 80.3% of those detected by the instrument would be true cases. The capacity to predict cases in boys and girls vary substantially depending on the cut-off point even in high prevalence situations, such as in this study. In previous papers we had identified similar issues related to the socio-economic or cultural status of respondents [44,45].
The BDI-II showed good psychometric qualities. Reliability and internal consistency was high, in keeping with other studies [32,34,36,38] and items were highly correlated. Each item seem to be measuring the same latent variable, but with possibly different degree of precision and different amount of error. Based on the analysis of invariance it seems reasonable to conclude that the same construct seems to apply to both boys and girls. However girls seem to have larger standard errors, most notable for the items 'crying' and 'loss of libido'. Responses to both items are probably influenced by social desirability norms, which may differ between boys and girls. Other studies in adolescents have also encountered similar issues [31,33,78], suggesting that certain items may behave differently with different populations. A study that asked 'experts' to rate the relevance of BDI-II items for diagnosing depression among adolescents and asked adolescents themselves about the best questions to report their feelings found that 'loss of libido' was the least useful item [31]. Unsurprisingly given the age of these individuals, the 'loss of libido' item achieved the lowest mean among all items in both sexes. These findings should inform other researchers about the importance of considering the meaning of items and social norms that may influence responses. Certain questions may be more appropriate for inclusion in studies with adult rather than young populations. Besides this the message that emerges over and over again is that of the need to validate instruments with the populations were they are intended to be used.
The EFA by parallel analysis showed a clear one factor solution, although the proportion of the variance explained  by this factor can only be regarded as moderate. This one factor solution was supported by the CFA according to sex. Several other studies have looked at the factor structure of this questionnaire but most of them have not used parallel analysis, which is now regarded as the best approach to ascertain the number of factors to derive from scales. These previous studies have suggested different factor structures with some describing three factors or more [33,79], others suggesting a two-factor structure [26,48,49,80], whilst other studies have suggested that a one general factor is the most appropriate solution [32,81]. It is interesting to note that there seems to be marked variability among studies in terms of the specific items that load into different factors. A single general factor is in keeping with the idea of summing all items to generate a total score reflecting severity, as suggested in the manual of the BDI-II and ratified by a panel of experts in another study [31].

Conclusions
Symptom questionnaires are often used to identify potential cases without any prior validation to determine the best cut-off points. This practice can lead to substantial misclassification. Although the Beck Depression Inventory (BDI-II) has been frequently used among adolescents in Latin America this seems to be the first criterion validity study. The questionnaire seemed to be good discriminating cases from non-cases of depression. The data supports a single general factor as the best factorial solution with this population. There were substantial sex differences in symptom profiles and most importantly in the optimal cut-off points for girls and boys. If the BDI-II is to be used as a binary instrument through established cut-off points we recommend that these are calculated independently for both sexes. Studies using questionnaires with the same cut-off points for boys and girls may be providing inaccurate estimates and misleading support to the existence of sex differences in depression. Although it is essential that brief self-reported questionnaires are validated with the populations that will be used with, this is unfortunately still the exception rather than the rule. Further replication of these results in other settings and cultures would be important to determine if these findings are specific to this setting or applicable to other cultures.