The present paper investigated the influence of age, gender, family status, informant, and recruiting centre on behavioural measures and intelligence in the International Multi-centre ADHD Genetics (IMAGE) project. The issue of homogeneity was of particular interest, because the power necessary for detecting susceptible genes is not only dependent on the sample size, but also on the homogeneity of the sample. Beyond genetic studies, our findings may be of more general interest for ADHD research, at least for study designs comparing clinical indicators of ADHD with other measures, e.g. for the investigation of neuropsychological or neurophysiological markers. Not least, some of our findings are of relevance for clinical practice in ADHD.
We had differing expectations, according to the various categories of data to which our measures can be assigned, about the influence of the different factors on the dependent measures: For example, because the IQ scores were based on language-specific normative samples, we expected effects for gender and family status but not for age and study centre. In the case of the raw-scores of the SDQ and the SCQ scales, we expected effects of age, status, gender, and informant [24, 26, 47], and probably also effects of study-centre . For the CTRS and CPRS scores, which are based on normative samples reflecting age, gender, and informant, we expected no age and gender and informant effects but status effects and probably centre effects .
Our analyses revealed numerous effects of independent factors on behavioural measures and intelligence which were not expected or exceeded the expected range. Many of these effects were present as interactions in addition to or instead of the main effects. To summarize the large number of effects and results, the following discussion focuses particularly on unexpected results or findings that were related to sample heterogeneity. The discussion emphasises the following three main factors that affected the distribution of behavioural measures: 1) The diagnostic procedures 2) the multi-centre design and 3) the source of information.
The formal diagnosis of 'ADHD-CT' for each proband required the presence of both six symptoms of inattention and six symptoms of hyperactivity/impulsivity. The presence of each symptom was given if it was recorded either in the teacher questionnaire or in the parent interview. This diagnostic criterion was not applied to the siblings in terms of an inclusion criterion but, rather, in cases of suspected ADHD as a potential exclusion criterion for the sibling sample in further analyses. Thus, structural differences between proband and sibling samples (effects of family status), such as the gender differences in mean behavioural scores, could reflect differing criteria for inclusion.
A second important issue concerns the effects of pooling behavioural data from different recruiting centres and different countries. A multi-centre design is usually chosen in order to increase the power of statistical inference. We were interested in the amount of heterogeneity, i.e. of additional variance stemming from differences between centres. Heterogeneity could present the other side of the coin with respect to statistical power by decreasing statistical power in subsequent analyses. Informant effects have already been investigated in the first paper . There we showed how diagnostic symptoms were perceived by different informants and instruments. In the present paper we were mainly interested in heterogeneity of the behavioural data stemming from informant effects and interactions with other factors.
As expected, probands had higher scores than siblings on all rating scales from both parent and teacher questionnaires. Similarly, the IQ measures were lower in probands than in siblings, at least in the measures with significant differences. In contrast to the questionnaire measures, the IQ differences had only small effect sizes. In the questionnaires scores with age and gender norms (CTRS and CPRS) status effects interacted with gender effects: female probands deviated to a weaker extent (about one SD less) from the population norm than male probands, particularly on the scales including hyperactive symptoms. In contrast, the differences in the deviation from the normative mean were in the opposite direction in the siblings: male siblings deviated on the relevant ratings by about half an SD more from the normative means than the female siblings.
We interpret this gender by status interaction as a bias which is attributable to the recruitment strategy: The DSM-IV inclusion criterion for ADHD-CT requiring the presence of six inattentive and six hyperactive/impulsive symptoms, independent of gender, led to the higher T-scores in female siblings. Moreover, we found evidence for this recruiting bias also in the hyperactivity scale of the SDQ. This scale is a raw-score and therefore reflects the perceived symptoms without relating them to population distributions. The male siblings had higher scores than the female siblings, reflecting known population differences. In contrast, the scores of male and female probands did not differ from each other, reflecting the symptom based diagnostic strategy.
The stronger deviation from normality in girls with externalizing, particularly hyperactive, symptoms compared to boys with identical symptoms is reflected in the normative samples of the questionnaires  and consistent with empirical evidence . Consequently, the male to female ratio in our proband sample was about 7:1 whereas girls and boys were equally frequent in the sibling sample.
Technically, this gender by status interaction effect on questionnaire scores in our sample introduced a gender bias in comparisons between probands and siblings. This bias may not only affect genetic analyses, but also categorical or quantitative analyses of neurobiological or neuropsychological markers. Even in a purely clinical context, one may question the validity of a diagnosis which is based mainly on symptom numbers, independent of epidemiological considerations of gender-specific distributions. In contrast, diagnostic models which would define gender specific liability-thresholds dependent on epidemiological distributions of a trait  would lead to almost identical numbers of affected subjects for each gender. It certainly would lead beyond the scope of the present contribution to decide which of the two fundamentally different approaches is of greater benefit for research and for clinical practice. Nevertheless, our finding may contribute to further discussions about the diagnosis of ADHD and future revisions of diagnostic systems.
The effects of family status also interacted with study-centre. In both raw and normative scores, we found centre main effects. These effects were measured either exclusively in the teacher ratings or, on some scales, were higher in the teacher ratings than in the parent ratings.
In contrast, centre by status interaction effects were present only in the parent ratings. These interactions were expressed in the greater number of pairwise centre differences, e.g. in ADHD DSM-IV scores of the Conners' questionnaires, in probands than in siblings. To put it the other way around: proband - sibling differences varied markedly across sites (e.g. about 0.8 SD for the centre ISR_P, but about 2.7 SD for IRL_D). It is not possible to provide a clear explanation for this phenomenon. Because we also found similar effects in the raw scores of the SDQ, we may perhaps exclude the use of a single (US) normative sample as a confounding factor of influence in the Conners' questionnaires. Furthermore, sociocultural normative backgrounds attributable to countries can explain only a part of the variance, because gender differences did not cluster in national categories.
Furthermore, status effects also interacted with informant effects, independent of the influence of centres. In contrast, there were no main effects of informants in the hyperactivity scores. The status by informant interaction was evident mainly in larger proband-sibling differences in the parent ratings compared to the teacher ratings. These interactions were considerable in raw and in normative scores and mostly concerned ADHD symptom scales or social behaviour ratings. In general, the siblings were perceived similarly by parents and teachers, both in raw scores and normative scores. In contrast, the probands had higher scores in the parent than in the teacher ratings. We conclude that the contrast effects  were more due to symptom aggravation in the parents perception of the probands behaviour than to suppression of their perception in the behaviour of the siblings. Again, this interaction between informant and family status resulting in higher contrasts in the parent ratings than in the teacher ratings introduces further heterogeneity to the sample. If not taken into account, this interaction may reduce statistical power in statistical analyses, even if average scores are used.
Effects of the study centre were discussed already in the context of their interaction with family status. A statistical main effect of centre was present mainly in the teacher ratings and weak to absent in the parent ratings. Because statistically testing of the interaction between centres and informant, for reasons of the data structure, was not possible, this differential main effect can be interpreted as a centre by informant interaction, even without statistical evidence. A definite interpretation of this effect is difficult. National or centre specific factors may have played a role. However, a simple pattern was not recognisable, because significant differences between centres were not consistent across the variables analysed (These were the DSM-IV ADHD scores and the oppositional score of the Conners' questionnaire).
In contrast to these rather weak effects, IQ differed to a greater extent between centres. Unlike the questionnaire scores, IQ data were collected by trained clinicians. The remarkable mean differences across centres (e.g. 17 IQ points difference between IRL_D and ESP_V) do not seem to reflect sociocultural differences between regions or countries, because the use of language specific normative samples should have accounted for them. The greatest difference between the three German speaking centres (GER_G, GER_E, and SWI_Z) all using the same normative sample was 8.5 IQ points. We speculate that differences in sampling strategies (existing patient register, outpatient or inpatient clinic, self-help organisations, resident doctors, newspaper advertisements etc.) may have played a role. Additionally, different test settings may have influenced the results: some assessments were included in a neuropsychological test battery, others were not, and in some cases pre-existing recent IQ assessments were used.
Finally, informant effects were present in various forms. Significant informant effects were recorded mainly in scales to which ADHD symptoms contributed at most only marginally, namely, in two scales of the SDQ (Conduct Problems and Emotional Problems) and in two normative scales of the Conners' questionnaires (Cognitive Problems and Social Problems). In contrast, informant effects were absent in the Hyperactivity scale of the SDQ and in the ADHD scales of the Conners' questionnaires. Although the ADHD scores did not differ between the raters in terms of an informant main effect, they were differently influenced by the raters depending on the family status. This informant by status interaction (contrast effects) has already been discussed above.
Compared to the informant by status interaction effects, the informant by gender interactions were weaker and, in combination with three way informant by status by gender effects, are more difficult to interpret. Mean score comparisons indicated larger differences between the parent and teacher ratings in boys, but not in girls. But these differences should be interpreted cautiously because there were major differences in the male to female ratios among the probands but not among the siblings. In addition, it should be noted that significant effects were found mainly in the normative scales. Thus, the reported differences did not necessarily reflect differences in the perceived behaviour but rather in the deviation from the normative mean. Given the rather small effects and the complexity of interacting factors we refrain from further interpretation of gender interactions.
In summary, first we found remarkable main effects of the study centre and interactions of centres with questionnaire scores and IQ even though a standardised recruiting procedure was employed. We assume that an interplay between local and national factors, between recruiting strategies and sociocultural aspects may explain these effects. Our data provide evidence for at least questioning to some extent the benefit of multi-centre designs. The statistical power achieved by enlarging the sample size may be lost by the additional heterogeneity introduced by the use of different centres.
Secondly, our data provide evidence for a remarkable heterogeneity in the behavioural data as a result of the use of symptom based diagnostic criteria, which reflect the actual state of the art. Boys and girls differed from normality to a considerably different extent despite the similar profile of their symptoms. In addition, the probands and siblings differed on several features that could be attributable to the diagnostic procedure, such as the gender differences shown on the questionnaire ratings.