The PubMed search yielded 2644 results and the PsycINFO search added 370 unique studies (Fig. 1). After excluding studies that were not in English and search results that were irrelevant studies, we assessed 2839 studies for eligibility (Fig. 1). Step one of the inclusion, i.e. assessment was conducted using online self-report instruments, left 1159 studies. Of these, 194 investigated and reported psychometric data (step 2). Next, we included 62 studies that investigated instruments for assessing common mental health disorders (step 3). Finally, we excluded 6 studies that did not report psychometric data that were relevant for our overview and synthesis, so we included 56 studies in our review. See Fig. 1 for a flow chart.
The details of the 56 included studies and their results are presented in Additional file 3. Combined, these studies described psychometric data for 62 different instruments. These studies and instruments are presented in Additional file 3. The data are summarised in Tables 1, 2, 3 and 4. The samples of most studies (48 of 56) contained a larger percentage of women (range 0 % to 100 %; Additional file 3). Seven studies included a sample with an average age below 20. Most studies recruited their samples from the general population using advertisements or links on websites (i.e. self-referral). Also common were studies among university students. Patient populations were less common, as 14 of the 62 instruments were investigated among patient populations. See Tables 1, 2, 3 and 4 and Additional file 3. All 56 of the included studies investigated internet-based instruments that were completed on a desktop, laptop or tablet computer, while none of the studies reported that their instruments were completed on cellular phones or smartphones.
We found instruments for all of the included mental health disorders. An average of 2.5 psychometric characteristics were reported for each instrument. None of the studies reported measurement error or responsiveness of instruments. We left the empty columns of these two outcomes in Additional file 3, but omitted them in Tables 1, 2, 3 and 4. Of the 62 investigated instruments, 29 assessed depressive symptoms. Of these, the CES-D and the Montgomery–Åsberg Depression Rating Scale Self Report (MADRS-S) were most frequently studied (6 studies each). Least studied were instruments for measuring suicidal ideation (1 study on 2 single items), self-harm (1 study) and stress (1 study).
Transdiagnostic online instruments
Seven instruments assessed both depressive and anxiety symptoms, or anxiety symptoms that apply to several disorders, such as the Beck Anxiety Inventory (BAI). These can be roughly divided in short instruments that screen for disorders, e.g. the Web Screening Questionnaire (WSQ)  and the Web-Based Depression and Anxiety Test (WB-DAT) , and scales that assess symptom severity, e.g. the Hospital Anxiety and Depression Scale (HADS)  and the Depression Anxiety Stress Scales (DASS) . The short screening questionnaires had poor to adequate criterion validity for screening individual disorders [8, 18, 21]. Of the symptom severity scales, the HADS was investigated in 5 studies [19, 22–25]. These 5 studies showed a fair to good internal consistency. The online HADS is the only instrument we found that was investigated among several patient populations [19, 23, 24]. Although the factor structure may be different from how the measure was designed [19, 23], there is mounting evidence that support adequate validity of the online HADS.
Online assessment of depression
Our review includes 29 instruments that measure depressive symptoms. These consist of 22 instruments that measure depression alone and 7 transdiagnostic instruments. The 22 studies on instruments for depression generally reported recruiting their samples from the general population. Five studies investigated instruments for depression among patient populations [3, 6, 26–28], each investigating a different instrument.
The full version of the CES-D has been evaluated in 6 studies [5, 29–33], and 5 characteristics were each reported by at least 2 studies (Table 2). Moreover, all 6 studies recruited their samples among non-patients, so the results can be considered complementary. The internal consistency was investigated in 5 of these studies, reporting a Cronbach’s alpha of .89 to .93. Factor analysis showed that the CES-D consists of 2, 3 or 4 factors [32, 33]. The 2-factor solution was among an English speaking population, the 3-factor solution among a Spanish speaking and the 4-factor solution among a Chinese speaking population [32, 33]. Adequate psychometric characteristics were found for the CES-D regarding equivalence of mean scores with the paper version [31, 33], convergent validity [5, 30] and criterion validity [5, 30]. One study  conducted a full measurement invariance analysis using confirmatory factor analysis, comparing paper and online formats, and found only a negligible difference in the latent mean score of one factor. Overall, it can be concluded that the online CES-D has good psychometric characteristics among non-patient populations, and that a start has been made to investigate its intercultural validity.
Another commonly investigated instrument was the MADRS-S [3, 4, 34–37]. Five of these studies reported Cronbach’s alpha, which is adequate to excellent (.73 to .90, Table 2) [3, 4, 34–36]. Thorndike and colleagues  found that the scale consists of 3 factors. Four studies found that the mean score of the MADRS-S does not differ significantly between the online and the paper version [3, 4, 35, 36].
Online assessment of GAD
The GAD-7 and two shorter versions were studied among a sample recruited from the general population . The scale showed promising internal consistency, convergent validity and predictive validity. The psychometrics of the GAD-7 were similar among a population of people with hearing loss .
Online assessment of panic disorder and agoraphobia
Internet interventions for PD/A, such as self-help courses, have been relatively extensively researched. Therefore, Austin and colleagues  and Carlbring and colleagues  studied the online questionnaires usually employed for such research. They focussed on equivalence of mean scores with paper versions of the same instruments. This equivalence could generally be assumed due to high correlations, but the study of Carlbring  found that online versions yield significantly lower mean scores for the Body Sensations Questionnaire (BSQ) and Agoraphobic Cognitions Questionnaire (ACQ) and higher scores for the Mobility Inventory (MI) subscale Alone. Finally, an agoraphobia screening item augmented with images was found to have adequate criterion validity (AUC .73) . All these studies recruited their samples from the general population.
Online assessment of social phobia
Two studies [3, 40] independently investigated the equivalence between online and paper versions of the online versions of the Social Interaction Anxiety Scale (SIAS) and Social Phobia Scale (SPS). Both did not find a difference between formats in mean score, but the factor structure did differ between formats , indicating that scores cannot be compared across formats. Adequate to good internal consistency of these scales has also been found in three studies [3, 40, 41], and adequate convergent validity of the SIAS in two [40, 41]. Lindner and colleagues revised item 14 of the SIAS, because the original item only applied to heterosexual people. This change did not alter the internal consistency or convergent validity of the scale . The study of Hedman and colleagues  recruited people classified with social phobia, but more research among patient groups is recommended.
Online assessment of specific phobia
Two of the transdiagnostic screening measures [18, 21] included specific phobia. These showed poor criterion validity for specific phobia. One instrument, the Flight Anxiety Situations Questionnaire (FAS), has been studied for aviophobia . This study showed near perfect criterion validity (AUC .99). Considering aviophobia is only one of many different specific phobias, much more development is needed in this area.
Online assessment of OCD
Four instruments for OCD have been studied, all in the US and among the general population [43–45]. Each instrument was studied only once. Williams and colleagues  investigated differential item functioning between black and white Americans, finding significant differences for the Padua Inventory (PI). Next to these 4 instruments, the WSQ  and the CIDI-SF  also screen for OCD.
Online assessment of PTSD
Like instruments for OCD, 4 instruments for PTSD have been studied, all in the US and among the general population [31, 46–48]. The transdiagnostic WSQ  also screens for PTSD. One additional study investigated an instrument for perinatal PTSD . Miller and colleagues  checked the factor structure of their measure for PTSD (National Stressful Events Survey) using item-response theory. The factor structure was confirmed, but the items of the instrument may cover too narrow a range of the latent factors.
Online assessment of worry and stress
The PSWQ, assessing worry, was studied twice [20, 50]. These studies found slightly differing values for internal consistency (.73 and .88). We found one study on an instrument that assesses stress .
Online assessment of suicidal ideation and self-harm
We found one study on an instrument that assesses self-harm.  This study used Rasch analysis to further confirm the factors of the Inventory of Statements About Self-injury (ISAS), obtained by factor analysis, and their unidimensionality. Furthermore, we found two single-item measures for suicidal ideation, being item 9 of the BDI-II and item 9 of the MADRS-S . Item 9 of the online BDI-II yielded lower scores than item 9 of the paper version of the BDI-II . The WSQ  also contains an item that screens for suicidal ideation, but the validity of this item was not investigated (also see ).
Generalisability and risk of bias
The sample sizes of the included studies were generally adequate for analysing psychometric properties. Nine studies contained over 1000 participants. The other studies in the tables (n = 46) had an average sample size of 261 participants. A sample size below 100 was found in 10 studies, which generally gives too little statistical power for psychometric analyses . It should be noted that required sample sizes differ per number of items and type of analysis. Most results could be biased due to selectively missing data. Two studies reported missing data and included numbers. In 33 studies, the amount of missing data was not specifically reported, but could be deduced or estimated. Missing data were not reported by or could not be deduced in 21 studies (see Additional file 3). Overall, COSMIN quality ratings of ‘Excellent’ were rare and ‘Poor’, ‘Fair’ and ‘Good’ ratings were equally common. Instead of adding the COSMIN ratings to the tables and Additional file 3, we decided to report the characteristics the ratings are based on, because the ratings do not always do justice to a study’s quality. The study characteristics give an objective and interpretable indication of the robustness and generalisability of a study’s findings. Lastly, 47 of the 62 instruments were investigated in only one study (Tables 1, 2, 3 and 4), so the robustness of the psychometric properties of these instruments relies heavily on the aspects of the individual studies and cannot be easily generalised to other populations or settings.