Accuracy of telepsychiatric assessment of new routine outpatient referrals

Background Studies on the feasibility of telepsychiatry tend to concentrate only on a subset of clinical parameters. In contrast, this study utilises data from a comprehensive assessment. The main objective of this study is to compare the accuracy of findings from telepsychiatry with those from face to face interviews. Method This is a primary, cross-sectional, single-cluster, balanced crossover, blind study involving new routine psychiatric referrals. Thirty-seven out of forty cases fulfilling the selection criteria went through a complete set of independent face to face and video assessments by the researchers who were blind to each other's findings. Results The accuracy ratio of the pooled results for DSM-IV diagnoses, risk assessment, non-drug and drug interventions were all above 0.76, and the combined overall accuracy ratio was 0.81. There were substantial intermethod agreements for Cohen's kappa on all the major components of evaluation except on the Risk Assessment Scale where there was only weak agreement. Conclusion Telepsychiatric assessment is a dependable method of assessment with a high degree of accuracy and substantial overall intermethod agreement when compared with standard face to face interview for new routine outpatient psychiatric referrals.


Background
Verbal information and visual cues are major and primary ingredients of psychiatric assessment. The sounds and images transmitted through video-conferencing are equivalent to these two parameters respectively. Other factors such as empathy and rapport are also crucial and their influence on the outcome of assessment is well understood but not well quantified. The assumption that videoconferencing would provide results equivalent to those from face-to-face psychiatric interview is related to these corollaries and requires testing and quantification. Trust and confidence in using this technology can be greatly enhanced if this assumption is proved true. In view of rapid developments in hardware, wireless technology and data-transmission, psychiatric intervention through video-conferencing (telepsychiatry) can be an effective mode of service delivery, especially for remotely located population clusters.
Meeting mental health needs for remotely and sparsely populated communities has been a challenge to service providers due to various factors including resource constrains and difficulty in recruitment of mental health staff. Attempts have been made to address these concerns through the use of new and emerging technologies. When such new methods are used for clinical assessment, there is bound to be inherent uncertainties as to whether these new methods are as reliable, sensitive and accurate as existing methods of clinical assessment.
Although several studies [1][2][3][4] can be found on pilot projects and the feasibility of telepsychiatry, it seems that to date none have attempted to test a predetermined hypothesis for a complete set of clinical parameters in adult psychiatry. Earlier reviews [5][6][7] were complemented by evaluation of psychiatric assessment using the telephony system [8], video recording [9] and then video-conferencing [10,11]. A recent study [12] demonstrated usefulness of telepsychiatry as a valuable clinical and research tool. However, most of these studies focussed narrowly on the diagnostic aspect of psychiatric assessment. Other studies have attempted to deal with psychopathology [13,14], cost and feasibility [15,16], user satisfaction [3,17], acceptability [18] and psychological intervention [19,20]. The authors of this study failed to identify studies of reasonable quality on complete and comprehensive assessments of new psychiatric referrals in a general adult outpatient clinic.
This study attempts to detect the level of intermethod agreement between telepsychiatric assessment and faceto-face interview for routine new outpatient referrals to the general adult psychiatric unit. It is anticipated that there is a high level of agreement between conclusions drawn from psychiatric interviews through video-conferencing (V) and the standard method of face-to-face psychiatric assessment (S) for diagnosis, risk assessment and clinical intervention. This study aims to test this assumption.

Setting and participants
The study was conducted at Hawkes Bay Health Care, which provides a National Health Service to the Hawkes Bay and Chatham Island areas of New Zealand (NZ). The study was approved by the Hawkes Bay Ethics Committee, New Zealand. The sample consisted of consecutive new adult psychiatric referrals to the Napier Community Mental Health Team (NCMHT). They belonged to the 19 to 65 age group, were not under care of the NCMHT and had not received care for any mental health issue from this unit for a period of at least 6 months at the time of referral. Cases requiring urgent assessment or home visit were excluded.
In clinical practice, the outcome of the standard method of face-to-face assessment (S) is always supposed to be accurate. Accordingly, using method S as the gold standard, the results from V can be classified as 'accurate' if the outcome is identical to that from S for a given attribute, or as 'inaccurate' if there is disagreement between methods S and V. For the purpose of this study, the accuracy ratio (AR) is defined as the risk ratio (RR) between the accurate outcomes of video-assessment (V) and the results from face-to-face assessment (S). Assuming an AR of 0.95 or above for face-to-face assessment and results of videoassessment at a significance level of 0.05 and a power of 0.8, a sample size of 34 for each method would suffice to detect a difference of 15% or more between these two methods of assessment [21]. Accordingly, a sample of 40 participants based on single stage cluster sampling was considered to be adequate for this two-way, within subjects, crossed balanced design. The data derived from this study was also used for calculation of Cohen's kappa (CK) and its bootstrap confidence interval.
From the 40 consecutive new psychiatric referrals fulfilling above criteria, two cases declined to participate and one case could not be located. A written informed consent was obtained from all remaining 37 cases and they all went through their complete intended assessments.

Assessment procedure
The assessment order for each method and for each psychiatrist was predetermined using a method of random allocation. The whole list was randomly divided into the two sub-lists, then participants were randomly allocated to have their first assessment either by researcher R1 or R2. The randomly selected half of the cases of each researcher (R1 and R2) had their first assessment by method S and the remaining half had their first assessment by method V. The second assessment (S or V as appropriate) of each individual case was subsequently completed by the other researcher (R1 or R2). The details of the randomisations and assessment procedures have been displayed in the Figure 1.
None of the researchers had prior experience of conducting formal telepsychiatric interviews for clinical care. Prior to initiating the research assessments, the researchers spent one session to familiarise with the equipment, and two sessions practising on known cases to evaluate and compare their findings in order to enhance their interrater agreement.
All face-to-face interviews were conducted at Hastings, while video assessment was carried out from Wairoa, 140 km away. All participants underwent both methods of assessment; each participant having one assessment on video by one psychiatrist and one face-to-face interview by the other psychiatrist. The interviewers utilised their own usual practice of clinical interview to resemble a standard outpatient setting.
The main confounding variables likely to influence the level of agreement are; bias between the researchers doing the assessments, duration of interview, use of interpreter, order-effect (effect on the second interview due to practice or residual memory from the first interview) and the time interval between the two methods of interview. The influence of such biases was minimised by adopting a crossover design and assigning an equal number of cases to each of the interviewing psychiatrists, to each interview meth-ods, and to each order of assessments (S followed by V and V followed by S). The researchers were not aware of each others findings while assessing an individual participant. Both assessments for each participant were completed on the same date and each assessment lasted up to 60 minutes. If an interpreter was involved, he/she had to attend both sessions for that given case. The sample randomisation Figure 1 The sample randomisation. The numbers in the boxes are the serial numbers of the sample cases. Those crossed are dropouts.
The group for the first interview by R1 using S and the second interview by R2 using V The group for the first interview by R1 using V and the second interview by R2 using S The group for the first interview by R2 using S and the second interview by R1 using V The group for the first interview by R2 using V and the second interview by R1 using S Randomly divided into two equal groups and then randomly assigned between R1 and R2 Further random split of the above two groups into two was not used on the interviewee side to prevent unnecessary distraction during the interview.

Diagnostic tools, scales and data
Diagnoses on the DSM-IV axes were based on the method described in the Decision Trees for Differential Diagnosis of the Handbook [22] with assistance from the manual when required. A Risk Assessment Schedule (RAS) was adopted from the guidelines for assessment of risk factors identified by the NZ Ministry of Health [23]. This scale has not been tested for its reliability and validity and is included in the appendix for information [Appendix-I]. A List of Psychiatric Intervention (LIPI, Appendix-II) was developed to record options of admission/discharge/follow up, investigations, psychological intervention and community support. Primarily, this is a list of clinical decisions to select if applicable. The details of any pharmacological intervention were also recorded in a structured format.
The full diagnostic code for the DSM-IV-Axis-1, the presence or absence of diagnoses on Axis2 and Axis-3, the applicability or non-applicability of DSM-IV-Axis-4 questions and the score for DSM-IV-Axis-5 Global Assessment of Functioning (GAF) was recorded for each assessment.
To confer uniformity, the numerical score of GAF was changed into ordinal type ranging from lower to high categories (A to E) based on a class interval method fulfilling the transformation criteria [24]. The RAS original scoring options of 'NIL' and 'LOW' were merged to 'low', and 'HIGH' and 'VERY HIGH' to 'serious'. This produced three distinct 'low', 'medium' and 'serious' risk categories for the purpose of statistical analysis. Possible responses for items of LIPI scale were dichotomous in nature excepting drugrelated outcomes. Clinical decisions for investigations, psychological intervention and community support were summarised on a group wise basis. All medications were classified into nine types for eleven indications, resulting in five drug related initiatives.
Minor adjustments were made to present the data table in an n-by-n format as a pre-requisite for kappa calculation. One DSM-IV Axis-1 diagnosis of disorganised schizophrenia (V, 295.10) was changed to paranoid schizophrenia (295.30) giving a concordant entry; one case of cyclothymic disorder, (S, 301.13) was changed to bipolar disorder (296.56) giving a discordant entry; and one case of factitious disorder (V, 300.16) was changed to somatoform disorder (300.81) giving a concordant pair.
If the number of total diagnoses for a given case differed between methods of assessment, a category of 'NIL' was introduced to reflect the lack of identification of an equivalent diagnosis by the corresponding method. This led to the introduction of 2 concordant pairs and one discordant pair by the first method of adjustment and 3 discordant pairs arising from 'NIL' categories from the second method of adjustment. The resulting preponderance of discordant pairs over concordant pairs is likely to influence the interpretation against the research hypothesis, rather than in favour of it.

Statistical evaluation
The test statistics of AR, Risk Difference (RD) and CK were calculated and summarised in accordance with methods described in the standard texts [24][25][26]. For the purpose of comparisons using AR and RD, an assumption is made that all outcomes from face-to-face assessment are 100% accurate. While using the asymptotic method of computation, some of the upper confidence interval of CK may exceed the permitted value of 1. Techniques like Bias Corrected Accelerated Bootstrap Confidence Interval (BCaCI) [27] and exact p estimate [28] have been advocated to resolve this paradox. Accordingly, this study applied nonparametric BCaCI methodology using 50,000 bootstrap samples with replacement.
The re-sampling was performed in a manner that retained the structural consistency of each subgroup. The techniques of bootstrapping and re-sampling are well established statistical methods and yet are little known in medical literature. The required software codes were developed for these models by the principal author (SPS) using R (version 2.4.1) [29] and were tested against other packages (SPSS, SAS and S-Plus) before data analysis. R is an open source statistical language software from the R Foundation of Statistical Computing, Vienna (ISBN 3-900051-07-0, 2006).

Results
Of 37 participants, 20 were female and 17 male. Ethnically, there were 27 participants of European descent, 8 were of Maori origin and 2 were from other groups. Their ages ranged between 19.21 and 63.29 years; with an average age of 35.40 years and a standard deviation of 12.46 years.
The presence of statistical significance for the results of ARs in the Table 1 is based on two sided p value of 0.05 or less. The primary data from rows 1 to 30 in Table 1 and from rows 1 to 28 in Table 2 have been re-used to summarise in the remaining rows of their respective tables. This has invariably lead to multiple comparisons and interpretation of the results should reflect this limitation.
The ARs (Table 1) with nil variance (rows 6, 8, 10) were excluded from comparison due to the constant nature of observation data. The results with upper 95% confidence interval of AR>1 (rows 2, 3, 7, 11, 19, 24) were not treated as statistically significant due to the fact that the accuracy     The criteria described in the previous paragraph were also applied for RDs (Table 1). Accordingly, all the valid obser- The column 'Size' constitutes the number of total sample points used for assessments.
The column 'Group' stands for the number of total categories of results e.g. all diagnoses with Paranoid Schizophrenia will make one group.
CK: Cohen's Kappa values derived from 'n by n' table made from the Groups. The significant values (Lower CI more than 0) are printed in bold. The summary CK results in the rows from 29 to 47 are based on "Inverse Variance method" using CKs and variances from previous rows displayed in the brackets for the respective assessments. 95% Lower & Upper Confidence Intervals (LCI and UCI) from the rows 1 to 28 are derived from non-parametric Bias Corrected Accelerated (BCa) variance from 50,000 bootstrap samples with replacement. The summary LCIs and UCIs in the rows from 29 to 47 are based on "Inverse Variance method" using CKs and original variances from previous rows are displayed in the brackets for the respective assessments.
Weighted Kappa using "square error weights" were computed for ordinal observations where applicable.
The definitions for Drug Type, Drug Action and Drug Indication are same as in the Table 1.   The interrater agreements between the two interviewing psychiatrists were very close and could not be differentiated from that of intermethod assessment. Their respective values do not differ more than 0.01, with a considerable degree of overlap between their confidence intervals. Accordingly, there is no significant difference between intermethod and interrater agreements and are equivalent to each other.

Discussion
This study aims to establish whether conclusions drawn from telepsychiatric assessments are in agreement with those from the standard method of face-to-face assessment for new referrals in an outpatient clinic. Most new and old referrals to the Community Mental Health Teams in the UK and NZ are discussed in a multidisciplinary team setting. Application of DSM diagnostic criteria in NZ has gradually evolved into standard clinical practice and is becoming popular in the UK. The application of Decision Tree for diagnosis on Axis-1 of DSM-IV can be easily adopted without significant additional resources and training. The methodology adopted in this study emulates the real clinic situation; hence its findings are both applicable and relevant to day-to-day clinical practice.
The authors have taken all necessary precautions in dealing with anticipated results, some of which might be paradoxical or erroneous e.g. CK and AR exceeding a maximum permitted value of 1. Those results where AR>1 and with nil variance were excluded from further statistical interpretation. Though the number of studied cases was only 37, the numbers of sample points (as in 'Size' column of Table 1) for statistical calculation were large enough for meaningful interpretation. The issue of sample size in handling large number of diagnostic categories for CK was dealt with bootstrapping technique and nonparametric Bias Corrected Accelerated Bootstrap Confidence Interval. This approach enhances the statistical quality of the data analysis.
A previous study [5] evaluating twelve telepsychiatric and face-to-face assessments on multiple scales found a mean weighted kappa coefficient of 0.85. The interrater reliability for diagnosis between two psychiatrists in three different experimental conditions on 63 patients has been found to vary between 0.69 and 0.85 [6]. A review of telepsychiatry services in Australia concluded that this technology could be reliably used for treatment recommendations and diagnostic assessments [7]. A Canadian study involving child psychiatrists found that in 96% of cases, the diagnosis and treatment recommendations made via video-conferencing were identical to those made in face-to-face interviews [10]. Another study [8] using telecommunication and audiovisual technology found interrater diagnostic agreement of 0.70. Despite methodological differences, the results from the present study are consistent with the findings quoted above in this paragraph.
Interrater agreement among clinicians for video-taped face-to-face interviews has been noted to have a relatively lower CK value of 0.55 [9]. It is possible that the flexibility conferred by the ability to question the patient in real time in a face-to-face or video-conferencing interview is an advantage over videotape assessment and accounting for improved intermethod and interrater agreement.
In a field trial of DSM-III, the interrater reliability for faceto-face interviews for the major disorders varied between  [31] and an another study on ICD-10 also yielded a fair to good kappa values for the four-character diagnostic codes [32]. In comparison, the outcome of telepsychiatric assessments in the current study is at least similar to the interrater reliability of faceto-face interviews from these large field trials.
None of the primary studies quoted above tested intermethod agreement for a complete set of clinical parameters. Their sample size and statistical methodology are also limiting factors for satisfactory conclusions. In contrast, the present study employs suitable statistical methods for comprehensive outpatient assessment for multiaxial DSM-IV diagnosis, risk assessment, investigations, and treatment. As compared to standard method of faceto-face interviews, telepsychiatric assessments in this study have a high accuracy ratio (AR 0.81) and a substantial intermethod agreement (CK 0.60).
Although the kappa value of intermethod agreement for risk assessment is low, it would be premature to ascribe this to telepsychiatric assessment itself. A study on videotaped interviews of 30 patients attending emergency psychiatry service revealed interrater correlation coefficients of 0.32 and 0.44 for risks to self and others respectively [33]. Another study conducted in a comparable setting also found similar results and came with the observation that in some circumstances the level of disagreement was high enough to warrant concern [34]. In a prospective study for risk assessment on 161 inmates of a high risk forensic unit [35] the agreement level among psychiatrists for face-to-face interviews in absence of operational criteria was very poor (CK -0.006). The same study reported that this agreement can be greatly enhanced (CK 0.742) by application of operational criteria. The findings in the present study for risk assessment are comparable to these results. In addition, the lower base level of risk in routine outpatient clinics in comparison to emergency and forensic psychiatric units is likely to cause further decrement in the kappa level.
There is a paucity of research-based knowledge concerning levels of agreement for risk factors. Most of the scales currently used for risk-assessments have yet to have their reliability, validity and predictability ascertained. To establish agreement levels in statistical terms for uncommon risk elements (suicides and homicides etc.) would require an enormous sample which may not be feasible. Lack of a valid and reliable tool for risk assessment may produce erroneous results while uncertainty over the time frame (short-term, immediate and long-term) for risk anticipation may lead to inconsistencies in reporting and recording. The sort-term serious risk will probably be dealt with in the emergency system rather than through routine outpatient referrals and the long-term risks are likely to have minimal influence on the decision making process while dealing with routine outpatient referrals.
The ability to reach an accurate DSM-IV-Axis-1 diagnosis through telepsychiatric assessment is perfect (CK 0.90) and accurate (AR 0.91). Arriving at a reasonable diagnostic impression is a pre-requisite of the medical recommendation for assessment or treatment under the Mental Health Act and this objective can be very well achieved through telepsychiatry. Another prerequisite under the act is to evaluate potential risks with input and information from various sources. Identification of risk related concerns direct from the interviewee constitutes only one component of this whole process. The referring agency usually indicates and expresses its concerns about risk elements and additional information are generally obtained on telephone from other sources such as clinicians and family members. With this in mind, low concordance on risk assessment may not necessarily be a limiting factor in the use of telepsychiatry for the purpose of Mental Health Act assessment.

Conclusion
Telepsychiatry is a dependable mode of service delivery for diagnostic assessment and psychiatric intervention in routine new referrals. Its accuracy varies between 79% and 83% in comparison with face-to-face interview. There is also an overall substantial agreement between these two methods of psychiatric evaluation. Although there is potential for usage of telepsychiatry for the Mental Health Act assessment, this requires further research using more refined operational tools to enhance the low accuracy and agreement scores found in the present study. The accuracy of conclusions arrived at from telepsychiatric assessment is likely to improve in future with further advances in technology [36].

Clinical implications
1. Allows telepsychiatric services to be made available to a geographically distant and inaccessible population where it is difficult and expensive to recruit mental health professionals.
2. Enhances confidence in use of telepsychiatry as an alternative mode of service delivery.
3. Increases scope of international research and collaboration in the practice of clinical psychiatry in different parts of the world.
Limitations and solutions 1. Although the outcome of risk assessment was similar to other studies, the level of agreement for this parameter is significantly low. There is scope to overcome this deficit through usage of operational criteria [35]. On this subject, some of the scientifically unexplored topics such as tools for risk assessment, its reliability and predictive value require further research.
2. The study assumed that there is 100% concordance between clinical decisions amongst psychiatrists if they conduct face-to-face interviews. This is seldom the case. Further studies with an added component to detect overall interrater agreement for face-to-face assessment will help in eliminating the need for this hypothetical 100% concordance rate.
3. There is an inherent problem in determining the sample size for CK for an unknown number of categories that may be encountered during a prospective research. This requires application of alternative statistical approaches.
The current study has attempted to address some of these concerns through usage of resampling method and bootstrap confidence intervals.

Competing interests
The author(s) declare that they have no competing interests.

Surendra P Singh
Planning, design, resource arrangement, coordination, template design, clinical assessment, data collection, data entry, statistical consideration, data analysis, review, writing and submission.

Dinesh Arya
Consultation and suggestion; especially about risk assessment and elements of statistical consideration, clinical assessment and data collection.

Trish Peters
Randomisation, liaison, consenting and coordination with other researchers and participants.
All authors have read and approved the submitted manuscript. Table 3 Appendix-II Table 4 Publish with Bio Med Central and every scientist can read your work free of charge