Delirium diagnosis defined by cluster analysis of symptoms versus diagnosis by DSM and ICD criteria: diagnostic accuracy study

Background Information on validity and reliability of delirium criteria is necessary for clinicians, researchers, and further developments of DSM or ICD. We compare four DSM and ICD delirium diagnostic criteria versions, which were developed by consensus of experts, with a phenomenology-based natural diagnosis delineated using cluster analysis of delirium features in a sample with a high prevalence of dementia. We also measured inter-rater reliability of each system when applied by two evaluators from distinct disciplines. Methods Cross-sectional analysis of 200 consecutive patients admitted to a skilled nursing facility, independently assessed within 24–48 h after admission with the Delirium Rating Scale-Revised-98 (DRS-R98) and for DSM-III-R, DSM-IV, DSM-5, and ICD-10 criteria for delirium. Cluster analysis (CA) delineated natural delirium and nondelirium reference groups using DRS-R98 items and then diagnostic systems’ performance were evaluated against the CA-defined groups using logistic regression and crosstabs for discriminant analysis (sensitivity, specificity, percentage of subjects correctly classified by each diagnostic system and their individual criteria, and performance for each system when excluding each individual criterion are reported). Kappa Index (K) was used to report inter-rater reliability for delirium diagnostic systems and their individual criteria. Results 117 (58.5 %) patients had preexisting dementia according to the Informant Questionnaire on Cognitive Decline in the Elderly. CA delineated 49 delirium subjects and 151 nondelirium. Against these CA groups, delirium diagnosis accuracy was highest using DSM-III-R (87.5 %) followed closely by DSM-IV (86.0 %), ICD-10 (85.5 %) and DSM-5 (84.5 %). ICD-10 had the highest specificity (96.0 %) but lowest sensitivity (53.1 %). DSM-III-R had the best sensitivity (81.6 %) and the best sensitivity-specificity balance. DSM-5 had the highest inter-rater reliability (K =0.73) while DSM-III-R criteria were the least reliable. Conclusions Using our CA-defined, phenomenologically-based delirium designations as the reference standard, we found performance discordance among four diagnostic systems when tested in subjects where comorbid dementia was prevalent. The most complex diagnostic systems have higher accuracy and the newer DSM-5 have higher reliability. Our novel phenomenological approach to designing a delirium reference standard may be preferred to guide revisions of diagnostic systems in the future. Electronic supplementary material The online version of this article (doi:10.1186/s12888-016-0878-6) contains supplementary material, which is available to authorized users.


(Continued from previous page)
Conclusions: Using our CA-defined, phenomenologically-based delirium designations as the reference standard, we found performance discordance among four diagnostic systems when tested in subjects where comorbid dementia was prevalent. The most complex diagnostic systems have higher accuracy and the newer DSM-5 have higher reliability. Our novel phenomenological approach to designing a delirium reference standard may be preferred to guide revisions of diagnostic systems in the future.
Keywords: Delirium, Dementia, Delirium rating scale-revised-98, Sensitivity and specificity, Reliability, Diagnostic and statistical manual of mental disorders, International classification of diseases, Cluster analysis, Discriminant analysis Background Valid and reliable diagnostic criteria in order to correctly classify delirium are fundamental to guide identification, management and prognosis [1]. Validity of a test or set of criteria involves accuracy, determined in part through sensitivity and specificity, and usually measured against a "gold standard" that is considered valid.
Without an easily measured biological marker for delirium, its diagnostic criteria are the only gold standard for clinical diagnosis. Criteria have been evolving through iterations since the 1960's. However, the use of criteria largely relying on experts' consensus and epidemiological research can be circular [2][3][4]. Further, iterations of diagnostic classification systems may result in different delirium diagnosis status in the same patient population.
Cole et al. [5] reported diagnostic accuracies for DSM-III, DSM-III-R, DSM-IV, and ICD-10 delirium criteria using latent class analysis (a latent variable model to delineate latent discrete variables from observed discrete criteria that allow describing accuracy among them). They found a relatively low sensitivity for ICD-10, low specificity for DSM-IV and high sensitivity and specificity for the DSM-III-R criteria. Those subjects were assessed with DSM-III-R delirium criteria, Confusion Assessment Method (CAM), and Delirium Index, without mention about how other diagnostic criteria were evaluated or if they were imputed from the available data obtained with the instruments of the studies. Meagher et al. [6] compared performance of DSM-5 criteria, imputed using symptom ratings from the Delirium Rating Scale-Revised-98 (DRS-R98) items, against DSM-IV criteria as directly assessed in patients in their pooled database. They reported 30.0 % sensitivity and 99.0 % specificity for DSM-5 criteria using a "strict" approach while a "relaxed" interpretation performed more similarly to DSM-IV with 89.0 % sensitivity and 96.0 % specificity. Concordance was only 53.0 % for these approaches where "strict" DSM-5 appeared to be only delineating full syndromal delirium whereas DSM-IV detected milder cases as well. Therefore, it remains unclear which is the most useful diagnostic system.
An alternative method is to use an "agnostic" approach to categorizing delirium based on its features. Cluster analysis is a multivariate statistical method that identifies groups of cases according to similarity on certain wellaccepted characteristics (phenotype) of a specific disorder [7] without the constraint of an a priori diagnostic system. Cluster analysis should be performed in populations with a wide range of diagnostic severity and complexity. The complexity of delirium detection increases when it occurs in the context of other neuropsychiatric disorders, especially dementia [8,9].
Conversely, studies of inter-rater reliability for delirium diagnostic criteria show more variable levels of agreement. Cameron et al. [32] reported a Kappa Index (K) of 0.62 for test-retest reliability of DSM-III in acute medical inpatients. Silver et al. [33] found an excellent inter-rater reliability for DSM-IV in critically ill pediatric patients (K =0.9). Malt et al. [34] evaluated ICD-10 in a general hospital via evaluation of written history cases by diverse clinicians and K for delirium diagnosis of about 50.0 %.
According to Kendler [35], a defining feature of mature sciences is their cumulative nature and its capacity to build on what has gone before. In this sense, evolution of diverse psychiatric criteria could be understood as an iterative process that should eventually increase accuracy and reliability of clinical diagnosis, though to measure the components of a condition, an independent way needs to be employed in order to avoid the presumption of truth of any classification system. We aimed to assess the accuracy of several diagnostic systems for delirium when tested against delirium and nondelirium reference groups defined in an "agnostic" fashion through cluster analysis of DRS-R98 items. To increase complexity our population had high dementia prevalence. We also measured inter-rater reliability of each system when applied by two evaluators from distinct disciplines.

Subjects
This is a cross-sectional prospective study of 200 consecutive patients admitted to a skilled nursing facility (Centro Sociosanitario Monterols, Tarragona, Spain). Patients were admitted from home, general hospital, assisted living or senior community for convalescence of medical-surgical conditions or control of geriatric conditions. Exclusion criteria were refusal to participate, coma/sedation, severe language disorder, or inability to speak Spanish.

Ethics, consent and permissions
This study was performed in accordance to Declaration of Helsinki and approved by the Hospital Universitari de Sant Joan Ethics Committee (our corresponding evaluation center). All patients or their proxy, when Mini Mental State Examination (MMSE) score was <24 (taken as part of the initial evaluation at admission), gave their written consent to participate.

Measures and instruments
Demographical and clinical data, including age, sex, marital and occupational status and years of education were collected. We also reviewed medical records for a recent diagnosis of delirium.

Charlson Comorbidity Index (Short form; CCI-SF)
Developed from the CCI with similar prognostic value [36], this version is based on history of 8 medical conditions: cerebrovascular accident, diabetes mellitus, chronic obstructive pulmonary disease, congestive heart failure, dementia, peripheral arterial disease, chronic renal failure and cancer, scored so that the first six receive 1 point and the last two receive 2 points. A CCI-SF score of 0 or 1 indicates no comorbidity, 2 low comorbidity, and ≥3 high comorbidity.

Spanish-Informant Questionnaire on Cognitive Decline in the Elderly (S-IQCODE)
Structured interview composed by 26 questions about cognitive and functional aspects of the patient during the last 5 years [37]. It is a valid approach to detect a probable dementia. Scores range from 26 to 130. We used the validated Spanish version with the recommended cut-off >85 for possible dementia [38].

Delirium Rating Scale Revised-98 (DRS-R98)
The DRS-R98 has descriptive anchors for rating the severity levels for each of its items (0 is normal to a maximum of 3) with a maximum scale score of 46 points. It measures severity of many delirium symptoms using phenomenologically anchored descriptions for item ratings and can also diagnose delirium. Its 16 items include 3 diagnostic items comprising the DRS-R98 Total scale where 13/16 items constitute the DRS-R98 Severity scale. The DRS-R98 measures core symptoms representing the 3 core domains of delirium (cognitive, circadian, higher order thinking) and noncore symptoms (psychotic and affective). It was originally validated using raters blinded to the diagnoses in five diagnostic groups of inpatients [10]. It has been subsequently translated and revalidated in countries outside of the U.S. The appropriate Spanish version was used [11], and the expert rater had ample experience in using the scale in delirium phenomenology studies. The Spanish DRS-R98 had very high inter-rater reliability (intraclass correlation coefficient >0.9 in both Colombian and Spanish samples) [11,14], and excellent validity as shown by the area under the curve >0.9 (Receiver-Operator Characteristic analyses) when discriminating DSM-III-R, DSM-IV, DSM-5 or ICD-10 delirium in a sample of patients from the same facility of this study [31]. The DRS-R98 has been assessed against other neuropsychiatric disorders making it an ideal instrument to assess phenomenology [8,10].

Clinical diagnostic criteria
We used four classification systems: the DSM-5, DSM-IV and DSM-III-R editions [39][40][41] and the ICD-10 for research [42]. We designed a diagnostic criteria checklist to systematically rate each item for all diagnostic criteria as present or not in order to ensure their complete evaluation.

Procedures
After running a pilot test with 10 patients (not included in the study sample) to evaluate logistic difficulties and possible problems in using research instruments, all patients admitted to the facility were rated by three researchers from 24 to 48 h after admission (all evaluations were done within the same 24-h period). Researchers #1 (psychiatrist trained and experienced in delirium and dementia clinical and research evaluations) and #2 (neuropsychologist experienced in evaluation of delirium and dementia for research purposes) evaluated symptoms for the delirium diagnostic criteria checklist. Researcher #3, a psychiatrist experienced in delirium and dementia research, teaching, clinical assessment, and specifically trained on the DRS-R98, administered the Spanish DRS-R98. Evaluations were made independently by each researcher. Ratings were based on the previous 24 h period. Researcher #3 also compiled demographic and clinical information for this report and researchers #1 and #2 contacted the family or caregiver to obtain the S-IQCODE score. All of them had unlimited access to medical/nursing records or reports of any kind and to interview caregivers, and were blinded to information from each other.

Statistical analysis and delineation of study groups
Data were analyzed using SPSS Statistics 17.0 and a spreadsheet.
Continuous variables are expressed as means ± standard deviation (SD). Chi-square test was used to compare categorical variables (continuity correction was used when appropriate) and t test for continuous ones. Statistical significance was set at p < 0.05.

Delineation of study groups without a priori criteria using cluster analysis of the DRS-R98
We analyzed DRS-R98 Severity Scale (items 1 to 13) using two-step cluster analysis with Log-likelihood as a measure of "distance" between item scores. This is an exploratory technique that reveals natural groupings within a set of data. It allowed us to automatically calculate the number of natural clusters within the dataset without any a priori specification of what that number should be. Schwarz's Bayesian Criterion method was used for clustering (to avoid overfitting of the obtained clusters due to the high number of items). Before cluster analysis, we excluded possible colinearity issues by means of a principal components analysis of the items, where any Eigenvalue (i.e., the part of the total variance induced by a factor) close to zero suggests a colinearity problem. We used the Belsley criterion to define "close to zero": values between 30 and 100 for the square root of the ratio between the higher and the lower Eigenvalue indicate moderate to strong colinearity problems. We did not find concerning colinearity because the higher Eigenvalue was 6.045 and the lower was 0.195 (square root of the ratio =5.567).

Discriminant analysis of DSM and ICD criteria for delirium over study groups
Logistic regressions and crosstabs were used to assess sensitivity, specificity, and percentage of subjects correctly classified by each diagnostic system and their individual criteria, and the corresponding 95.0 % confidence intervals (95 % CI) are reported. Values are also given for diagnostic systems when each of their individual criteria were excluded. Wald test p value was utilized to define if classification performance percentages against reference groups were significant. All discriminant analyses are for the performance of all diagnostic criteria assessed by Researcher #1 (psychiatrist) against DRS-R98 evaluation from Researcher #3 (psychiatrist). Frequency (percentage) of subjects positive for delirium according to each diagnostic system and for presence of their individual criteria was also assessed.

Inter-rater reliability of DSM and ICD criteria for delirium
We report Kappa Index (K) with its 95 % CI and Standard Error (SE) as measure of reliability of all diagnostic criteria and items (for all diagnostic criteria assessed by Researcher #1 vs. Researcher #2). K for diagnostic systems when each of their individual criteria (items) were excluded is reported also. Every K was interpreted according to the following ranges: <0.20 = unacceptable, 0.20-0.39 = questionable, 0.4-0.59 = acceptable, 0.60-0.79 = good, and 0.8 0-1 = excellent. Figure 1 shows patients flow throughout the study. A total of 224 patients were admitted during the 14 months of patient collection. Reasons for exclusion were denied consent (n = 7), severe language disorder (n = 9), coma/ sedation (n = 6), unable to speak Spanish (n = 2), leaving 200 who were included for analyses. Of these, the mean age was 78.3 ± 9.9 and 51.5 % were women.

Groups defined according to cluster analysis
Cluster analysis of DRS-R98 item scores resulted in a 2natural cluster (or group) solution (nondelirium n = 151, delirium n = 49) (Fig. 2 boxplots). In nondelirium, the mean score for DRS-R98 Total was 6.67 ± 5.00 (range 0-19) and DRS-R98 Severity was 5.60 ± 3.82 (range 0-13). In delirium, the mean score for DRS-R98 Total was 25.59 ± 4.90 (range 17-38) and DRS-R98 Severity 21.29 ± 4.50 (range 12-33). There was minimal overlap between clusters except for small portions of their tails. Medians were also significantly different (median test p < 0.001). Table 1 shows characteristics of the sample, divided into delirium and nondelirium groups using cluster analysisdefined groupings. The delirium group was older, had greater frequency of systemic infection as main diagnosis and a higher frequency of dementia as an antecedent. In both the whole sample and subsample of 117 with dementia (58.5 %), delirium subjects were more likely to have a comorbid diagnosis of dementia, and were more often on treatment with atypical antipsychotics. A past history of delirium was also more common in those with delirium.

Criteria systems accuracy
Delirium classification performance characteristics for each diagnostic system and their individual criteria are shown in Table 2. All diagnostic systems correctly classified subjects similarly enough to the cluster-defined groups to be significant (Wald statistic p < 0.05). In the whole sample all diagnostic systems had very good accuracy, where the highest percentage of correctly classified cases was obtained by DSM-III-R criteria (87.5 %) and followed closely by DSM-IV (86.0 %), ICD-10 (85.5 %) and DSM-5 (84.5 %). The pattern was for all to have lower sensitivity than specificity especially evident for ICD-10 with specificity of 96.0 % and the lowest sensitivity of 53.1 %. In contrast, DSM-III-R had the best sensitivity (81.6 %) and the most balanced sensitivity-specificity values.
All diagnostic systems were relatively robust and, in general terms, maintained their classification performance when each individual criteria was excluded. Each of the individual criteria correctly classified subjects (p < 0.05), except for criterion C of DSM-III-R (57.5 %) and for criterion C of DSM-5 (43.6 %) in the demented subsample. DSM-5 criterion C had significant but low accuracy (51.5 %) in the whole sample. These two individual criteria were each compound (listing more than one type of symptom).
The cardinal criterion A from all diagnostic systems (attention) had high accuracies and reasonably well-balanced sensitivity and specificity. Evaluation of other cognitive Only DSM-III-R includes a criterion for disorganized thinking which performed well (89.8 % sensitivity, 79.5 % specificity). ICD-10 had criteria for psychomotor disturbance and sleep-wake cycle disturbance which performed moderately well.
As expected, Individual criteria with high sensitivity, as reported in Table 2, had the highest percentage of positivity for delirium within their corresponding whole sample or dementia subsample (containing Additional file 1: Table S1).
The results for the dementia subsample were similar to the whole sample except that accuracy, sensitivity and specificity were all slightly lower. The largest decrease in accuracy between the whole sample and the dementia subsample was for ICD-10 (from 85.5 % to 77.8 %). And when excluding an individual criterion, the largest reduction was for ICD-10 criterion evaluating memory and orientation (from 61.0 % to 48.7 %).
In the whole sample, the acute onset criteria (86.0-87.0 %) and the criteria including attentional disturbance (84.5-88.0 %) had the highest classification accuracy within each system. The highest individual criterion accuracy (88.0 %) was in ICD-10 for "clouding of consciousness and attention alteration." This same pattern occurred in the dementia subsample though the values were slightly lower -82.9-84.6 % and 80.3-84.6 %, respectively, with DSM-III-R performing the worst on each criterion.

Reliability
Reliability of the four diagnostic systems is shown in Table 3. DSM-IV, DSM-III-R, and ICD-10 showed K values in the range of acceptable to good in the whole sample. DSM-5 did the best with the highest K value and when considering its individual criteria, also had most values in the good range irrespective of which sample was tested. In contrast, DSM-III-R performed the most poorly, with the highest number of questionable range K values in the dementia subsample. The reliability performance of both systems would remain almost the same if any of their individual criterion were excluded. No criterion performed in the unacceptable or excellent range.
Standard errors for each system and their individual criteria were all ≤0.1 with exception of the compound criterion C of DSM-III-R (SE 0.129) and the criterion C of DSM-5 for additional cognitive change/perception (SE 0.140) in the subset with dementia.

Discussion
We describe a novel approach to evaluate how different delirium diagnostic systems perform in their ability to separate delirium and nondelirium groups, given that reliance on any particular diagnostic system a priori makes an assumption of superior validity if it is to be used as a reference standard. Instead, we applied cluster analysis of DRS- Fig. 2 Study groups. Boxplots of DRS-R98 to illustrate the two study groups obtained using two-step cluster analysis. Part a shows distribution of DRS-R98 Total score for the delirium cluster (n = 49) and for the nondelirium cluster (n = 151). Part b shows DRS-R98 Severity score distribution for the same groups. Solid lines within boxes are median scores; boxes correspond to the middle 50.0 % of scores; tails indicate 25thpercentiles R98 items to a sample of 200 subjects to discern natural groups as the reference standard and then measured performance of four classification systems to diagnose delirium. The DRS-R98 uses phenomenological descriptive anchors for many delirium characteristics that were assessed in a standardized way, independently and without regard for a particular classification system ("agnostic"). Our DRS-R98 cluster analysis yielded two clearly differentiated groups, which indicates very good performance to serve as a reference standard. Additionally, dementia patients with or without delirium were included to increase diagnostic complexity.
Accuracy was very good for all diagnostic systems with DSM-III-R the highest (87.5 %) and DSM-5 the lowest   Cluster analysis-defined groups were identified using DRS-R98 items. Performance characteristics and 95 % confidence intervals (95 % CI) are given for each classification system. Performance values for the diagnostic criteria after each individual criterion was excluded are noted within brackets. Bolded values denote when the percentage of correctly classified cases (accuracy) as compared to the reference standard are significant at p < 0.05 according to the Wald test (84.5 %). Overall, the classification performance in the dementia subsample was similar to but somewhat lower than in the whole sample, with ICD-10 performing the least well (77.8 %) and DSM-III-R somewhat better (83.8 %) than the other DSM versions. Values for sensitivity and specificity varied more than did accuracy in the whole sample, where the pattern for all was lower sensitivity than specificity. The most extreme was ICD-10 (53.1 %, 96.0 %) suggesting a better capacity for delirium confirmation, while the most balanced values were for DSM-III-R (81.6 %, 89.4 %). Each individual criterion, except one, significantly distinguished delirium and nondelirium groups in both the whole sample and dementia subsample. Accuracies of diagnostic criteria remained robust even after each individual criterion was excluded such that they perform as an integrated whole. Exclusion of most of the individual criteria resulted in only small increases in classification accuracy of the remaining criteria. However, several individual criteria reduced overall classification accuracy before they were excluded and the most prominent of these had a compound construction (more than one type of symptom listed together). Inter-rater reliability for diagnostic systems was "good" except for ICD-10 that was "acceptable", but none were excellent. ICD-10 had the lowest and DSM-5 had the highest interrater reliability.
The individual criteria across all classification systems with the highest accuracies were those for attentional disturbance and acute onset of symptoms, consistent with inattention being a cardinal feature and the syndrome being a noticeable change in consciousness. These might comprise the simplest screening approach for busy clinicians but has not been studied. Meagher at al. [8] reported that digit span forwards differentiated delirium from dementia subjects because simple inattention occurs in delirium more than in dementia, whereas both groups performed poorly on the more challenging backwards span test. A commonly used brief tool, the CAM [43], includes both inattention and acute onset among its four items, however, it does not have consistent concordance with DSM versions and DRS-R98 [6,44].
These diagnostic systems varied greatly as to how many of the other cognitive, perceptual, thinking and circadian symptoms of delirium are represented. Interestingly the disorganized thinking criterion of DSM-IIIR performed well. However, the disorganized thinking was dropped as a criterion after DSM-III-R in order to improve the reliability of delirium diagnosis when assessed by non-psychiatrists [4]. However, as a core domain symptom our data suggest it should be included again in diagnostic criteria. Two other core domain symptoms, that describe circadian activity, have separate criteria in ICD-10 but performed only moderately well in accuracy. However they performed better than the "other cognitive" criterion in ICD-10.
None of these four diagnostic systems has individual criteria representing all three core domains of delirium (cognitive, circadian, and higher order thinking) [39][40][41][42]. DSM-III-R has disorganized thinking and ICD-10 has two circadian criteria. DSM-III-R includes more core domain symptoms than do the other DSM versions, though they are collapsed with "consciousness" into one compound criterion (i.e., consciousness, perception, sleep-wake cycle, motor activity, orientation and memory). This particular compound criterion was the only criterion from among all the systems whose accuracy was not significantly different between delirium and nondelirium groups. It would be worth studying new criteria that individually capture all three core domains.
Further, the compound criteria from DSM-III-R (C), DSM-IV (B), and DSM-5 (C) each carried lower accuracy contributions than when they were deleted. Because compound criteria, comprised of more than one type of symptom, had lower accuracies we recommend they be avoided in future diagnostic system revisions.
Accuracies were highest for the A criteria in each system, consistent with their being cardinal for the syndrome of delirium. Though other symptoms besides inattention had lower accuracies, such as evaluating other cognitive aspects, they showed high sensitivity despite low specificity. As such, they may be useful for delirium screening.
The wording of the cardinal A criterion varies across these systems, where DSM-IV and ICD-10 include mention "consciousness" along with inattention. Though contributing much to accuracy, interrater reliability was less strong when inattention was combined with consciousness as compared to cardinal criteria that only included the components of consciousness (i.e., attention and awareness). "Clouding of consciousness" has no precise or common definition however. Note that the DRS-R98 does not include vague items like "consciousness" or "clouding of consciousness." Rather, the symptoms of delirium taken together should represent the components of an impairment of consciousness, where cerebral cortical arousal is intact (i.e., level of consciousness is not coma or stupor). Intact consciousness means being alert/attentive (and having other cognitive domains intact), awake (with an intact sleep-wake cycle), and aware (comprehending one's inner self and one's surroundings). So to include the term consciousness within the criteria is not helpful to delineate the particular features of delirium that would establish it as an impaired state of consciousness by its overall definition [44]. Thus, the raters would be influenced by their overall impression of the patient's presentation during the interview to rate consciousness, similar to a clinical global impressions scale (CGI). DRS-R98 items do not include "consciousness" terms and can more cleanly establish the components of delirium when cluster analysis determined the groups. Because we found the highest accuracy (88.0 %) for the ICD-10 "clouding of consciousness and attention alteration" cardinal A criterion, it suggests that such wording functioned like a CGI rating and could be a candidate for a single screening question for use by clinicians in hospital settings.
Cognitive alterations are core for both dementia and delirium, and symptoms of the latter overshadow those of the former when they are comorbid [8,21,22], which may explain the decreased accuracy performance of diagnostic systems within the dementia subsample. Classification performance for all diagnostic systems in that subsample was slightly lower than in the whole sample, but over 80.0 % accuracy for all except ICD-10 that suffered the largest decline (7.7 percentage points). The ICD-10 criterion evaluating memory and orientation also had the highest accuracy drop within ICD-10 and among all individual criteria (12.3 percentage points) suggesting ICD-10 may not be as suitable for use in comorbid dementia cases though this needs confirmation in other studies.
Inter-rater reliability was highest for DSM-5 and, in the dementia subsample, the lowest for DSM-III-R when considering individual criteria reliabilities. Similar to a previous report of low ICD-10 reliability in general hospital inpatients, we found ICD-10 criteria had the worst reliability values [34]. Reliability values were somewhat lower in the dementia subsample overall as compared with the whole sample. As suggested by Regier et al. [1], comorbidity is usually associated with lower reliability values, especially when concurrent entities have shared symptoms, as happens with dementia and delirium. It could explain why although all diagnostic systems and individual criteria were very precise (95 % CI <0.5 and SE <0.1) in the whole sample, criteria that included cognitive aspects of delirium (criterion C in DSM-III-R and DSM-5) had SE a little over the desired 0.1 value in the subsample with dementia.
Though DSM-5 criteria had the best reliability, its accuracy in our sample was a little lower than the other systems, whereas DSM-III-R had the highest accuracy of 87.5 %. A previous report using latent class analysis found that DSM-III-R had higher accuracy than DSM-IV [5]. These findings, taken together, may be a consequence of the trend toward simplification of criteria over newer DSM editions which improve reliability at the expense of lowering accuracy. An alternative to oversimplification to enhance reliability for nonspecialists is to include operational descriptions for each criterion in future DSM versions, similar to what is available for the DRS-R98 Administration Guide (pdf available from Dr. Trzepacz at pttrzepacz@outlook.com).
Limitations include our use of only the DRS-R98 to capture characteristics of delirium. Designed for broad and detailed phenomenological descriptions of delirium features, it is ideal for this study's purpose with advantages over other existing assessment tools that are not so structured.
A reliable yet-to-be-determined biological marker, perhaps electroencephalography or fMRI, would be an important addition to phenotype criteria validity assessment, which we did not include.

Conclusions
All diagnostic systems classified (>80.0 %) delirium from nondelirium cases as compared to an agnostic clusteranalysis reference standard, though all performed less well in the comorbid dementia subsample. The two best performing individual criteria across all classification systems were the attentional disturbance and acute onset features. Compound criteria (i.e., those with more than one type of symptom) tended to have lower accuracies and should be avoided in future diagnostic system revisions. None of the four diagnostic systems includes separate criteria that represent all three core domains of delirium (cognitive, circadian, higher order thinking).
In summary, ours is the first evaluation of four classification systems for delirium diagnosis that utilized comparisons of accuracy to an "agnostic" rating of symptoms using the DRS-R98 by an independent rater, and assessed classification performance characteristics of each system. This approach lends itself to discernment of how criteria are written in order to develop an even better set of diagnostic criteria that could truly serve as a reference standard.

Additional file
Additional file 1: Table S1.