There is a pressing need for accurate and efficient instruments to screen for depression and to measure its severity, for several reasons. First, the U.S. Preventive Services Task Force [1] recommended adults be screened for depression, based on findings that feedback of depression screening to clinicians increased the recognition of depressive illness. Moreover, a great proportion of depression care is in the hands of clinicians who lack specialized mental health training [2, 3]. These clinicians may benefit from methods to detect cases and evaluate the outcomes of care. For example, it has been recognized for some time that depression is an important source of morbidity in primary care [4], and that improvement is needed in the recognition and management of depression in that setting [5, 6]. However, clinician time and attention are highly constrained in many health care settings [7].
Second, efficient yet accurate instruments to measure depression would also be of considerable value in monitoring the progress of treatment by mental health specialists and other treating clinicians. Systematic case management activities involving regular outcome measurement are part of an emerging paradigm of high quality care of chronic illnesses [8–11]. However, such programs will fail if patients are unwilling to adhere to measurement protocols. Therefore, follow-up measurement protocols cannot burden patients with repeated exposures to long questionnaires. Finally, researchers frequently assess severity of depression using standard instruments such as the Beck Depression Inventory (BDI) [12]. Efficiency is important here as well, because researchers often assess many constructs, straining study participants' endurance.
Given that many instruments permit assignments of diagnoses, measurement of symptoms, or assessment of functioning, one might ask why routine depression screening and treatment monitoring are not already common. Although these instruments are commonly used in research, they have had less effect on clinical care. One important reason is that the time and other costs of mental health assessments may outweigh their benefits for busy patient-care settings. These costs may also make screening or treatment monitoring suboptimal from a societal perspective. For example, a recent simulation study by Valenstein and her colleagues [13] suggested that screening for depression would not meet a reasonable criterion for cost-utility if the cost of administering a single test was substantially higher than $5, where that cost comprised a fee for the instrument, six minutes of staff time, and one minute of physician time.
Computerized adaptive measurement of depression
Two technical advances could substantially improve the tradeoff between efficiency and accuracy in the measurement of mental health problems, such as depression. First, the Internet is reducing the cost and other barriers to the delivery of computerized testing services to clinical offices. Wireless Internet connectivity is becoming widely available, and powerful, mobile tablet and handheld computers are now available at commodity prices. These technologies should substantially reduce the cost of putting computer-administered tests in the hands of patients and clinicians in front-line clinical settings. Computerized tests, particularly those that can be self-administered by patients, can reduce the staff and clinician time required to administer and score an instrument. There are several computerized mental health instruments including, for example, a computerized version of the Composite International Diagnostic Interview (CIDI) [14, 15].
The second technical advance, Computerized Adaptive Testing (CAT) [16], has been widely used by educational and vocational testers, but has seen surprisingly little application in physical or mental health settings [17, 18]. CAT is a technology for interactive administration of tests that tailors the test to the examinee (or, in our application, to the patient). These tests are 'adaptive' in the sense that the testing is driven by an algorithm that selects questions in real time and in response to the ongoing responses of the patient. We believe that computerized, adaptive mental health assessment services, delivered on stand-alone computers or over the web, could make significant contributions to both mental health research and clinical care. In this article, we discuss how CAT can be used to screen for, and measure severity of, depression.
The need to achieve both accuracy and efficiency poses a difficult tradeoff for an instrument developer, for two reasons. First, classical test theory [19] teaches that, everything else being equal, the way to make a test more accurate is to increase its length, so that random errors in the responses to individual items cancel each other out.
Second, the need to accurately measure patients with varying levels of severity of disorder lengthens tests. Failing to include items about symptoms reflecting a wide range of severity of disorder will result in an instrument with a floor or ceiling effect [17]. Thus, an accurate and wide-ranging instrument should include several questions at each relevant level of severity of disorder.
Unfortunately, a fixed instrument that has multiple questions for each of several ranges of severity is an inefficient instrument for any individual patient. That individual patient has a disorder the severity of which falls into only one of those ranges, and questions that ask about much more or much less severe symptoms are often irrelevant to that patient. In summary, until recently the goals of having a brief, efficient instrument and an accurate, wide-ranging instrument have seemed mutually incompatible.
Computerized adaptive testing
CAT can improve the terms on which accuracy and efficiency are traded off. It has two components. First, one administers the instrument via computer, using a device such as a touch screen, or through a computer-administered telephone interview [20, 21]. Research on computerized tests [22] has shown that the medium has few negative effects on how subjects respond. To the contrary, computerized data collection directly from patients appears to reduce social desirability bias in the reporting of alcohol and drug use, sexual activity, and medication noncompliance [23]. Of particular interest, is the suggestion that people seem to prefer revealing some types of very personal information e.g., gynecological details [24], sexual abuse [25], or suicidal ideation [26] to a computer than a person. Similarly, alcoholics seeking treatment disclosed greater levels of consumption of alcohol to a computer than to a person [27].
CAT, however, goes farther. 'Adaptive' means that the computer follows an algorithm that administers a test (for example, the BDI) to a patient one question at a time. At each step, the patient's prior responses determine (a) whether to ask another question and (b) which question to ask [16, 28]. The test stops when the patient's score has been estimated to a prescribed level of precision. Hence, the computer adapts the test to use the fewest items required to assess that particular patient accurately. By comparison, an instrument using a fixed list of items may have too few items to accurately measure some patients, while posing unnecessary questions to others.
To that end, at each step, the program uses the current subset of responses to estimate the patient's score on a latent trait, in this case depression, as well as a confidence interval (CI) around that estimate [29]. The latent trait is conventionally denoted as θ, and is conventionally expressed in standardized units. However, θ could be rescaled to the same units as the BDI to aid clinicians familiar with that instrument. The CI around θ is then compared to the 'cut' or criterion score on the latent trait that defines a positive screening result. If the upper bound of the CI were to fall below the cut score, the program would declare the screening result negative, and stop testing. Conversely, if the lower bound of the CI were to fall above the cut score, testing would stop with a positive result. Otherwise, the CI includes the cut score, and testing continues.
Suppose now that we are in mid-test, and the adaptive algorithm has to choose another question to pose to a respondent. Using the data already collected about the respondent, the program calculates an information statistic for each of the test items that have not yet been posed. The information statistic for an item is larger if the response to that item is expected to make a greater reduction in our uncertainty about the patient's true score on the latent depression dimension. The computer then presents the maximally informative item to the respondent. Everything else being equal, a question will be more informative if the severity of the symptoms it concerns is similar to our current estimate of the severity of the patient's depression. For example, if we already have substantial evidence of depression based on the responses thus far, the computer will discount the value of items that primarily ask about minor symptoms, and focus on those that ask about severe symptoms. Please notice that the adaptive algorithm we describe here is different from the branching logic used in many computerized tests to skip questions based on earlier patient responses. Programs that use branching identify questions to be skipped because those questions are irrelevant based on a patient's previous answers. Adaptive tests choose questions to be asked because those questions maximize the precision of the patient's estimated score on a latent dimension of interest.
We reasoned that adaptive technology could substantially improve the efficiency of psychometric measurement in clinical settings, with little or no cost in the accuracy of measurement. We sought to test this by developing an adaptive version of the BDI. We chose the BDI because it is a well-validated instrument for depression and representative of the many screening instruments available for this common condition. It is brief, has been very widely used, and is already in a self-report format.
The goals of this study were (a) to test whether an adaptive version of the BDI would predict a Structured Clinical Interview for DSM-IV Axis I Disorders (SCID) [30] diagnosis as accurately as the full BDI, (b) to estimate how many fewer questions the adaptive BDI would ask, and (c) to determine whether the adaptive BDI would measure the severity of depression as well as the full BDI. The statistical methodology underlying adaptive testing is well established and there is considerable experience in using it in other domains of measurement [16, 31]. In a previous study [32], we showed that the screening decisions made by an adaptive version of the Pediatric Symptom Checklist (PSC) [33] agreed nearly perfectly with the screening decisions made by the full PSC (κ = .97). The adaptive PSC achieved that agreement by asking an average of only 10.5 questions per patient, compared to the 35 items required by the full PSC. However, that study did not examine whether adaptive testing affected the PSC's accuracy, which would have required comparing screening decisions based on adaptive data to independent psychometric criteria. To our knowledge, there have been no studies of how an adaptive implementation of a screen for mental health problems affects the agreement between the screen and criterion measures. In this study, we evaluated the performance of an adaptive version of the BDI against an independent SCID diagnosis and Hamilton depression measure.