Improving risk prediction accuracy for new soldiers in the U.S. Army by adding self-report survey data to administrative data

Background High rates of mental disorders, suicidality, and interpersonal violence early in the military career have raised interest in implementing preventive interventions with high-risk new enlistees. The Army Study to Assess Risk and Resilience in Servicemembers (STARRS) developed risk-targeting systems for these outcomes based on machine learning methods using administrative data predictors. However, administrative data omit many risk factors, raising the question whether risk targeting could be improved by adding self-report survey data to prediction models. If so, the Army may gain from routinely administering surveys that assess additional risk factors. Methods The STARRS New Soldier Survey was administered to 21,790 Regular Army soldiers who agreed to have survey data linked to administrative records. As reported previously, machine learning models using administrative data as predictors found that small proportions of high-risk soldiers accounted for high proportions of negative outcomes. Other machine learning models using self-report survey data as predictors were developed previously for three of these outcomes: major physical violence and sexual violence perpetration among men and sexual violence victimization among women. Here we examined the extent to which this survey information increases prediction accuracy, over models based solely on administrative data, for those three outcomes. We used discrete-time survival analysis to estimate a series of models predicting first occurrence, assessing how model fit improved and concentration of risk increased when adding the predicted risk score based on survey data to the predicted risk score based on administrative data. Results The addition of survey data improved prediction significantly for all outcomes. In the most extreme case, the percentage of reported sexual violence victimization among the 5% of female soldiers with highest predicted risk increased from 17.5% using only administrative predictors to 29.4% adding survey predictors, a 67.9% proportional increase in prediction accuracy. Other proportional increases in concentration of risk ranged from 4.8% to 49.5% (median = 26.0%). Conclusions Data from an ongoing New Soldier Survey could substantially improve accuracy of risk models compared to models based exclusively on administrative predictors. Depending upon the characteristics of interventions used, the increase in targeting accuracy from survey data might offset survey administration costs. Electronic supplementary material The online version of this article (10.1186/s12888-018-1656-4) contains supplementary material, which is available to authorized users.


Background
Concerns exist about high rates of interpersonal violence, mental disorders, and suicidality among U.S. Army soldiers [1][2][3][4]. Although intensive preventive interventions have been developed in the civilian population and shown to reduce risk of some of these outcomes, including those involving physical and sexual violence (e.g., [5,6]), cost-effective implementation of these interventions would require that they be delivered only to soldiers judged to be high-risk. It has been shown that useful risk targeting systems can be developed for these outcomes based on administrative data available for all U.S. Army soldiers using machine learning methods, with the small proportions of soldiers predicted to be at high risk by these systems accounting for substantial proportions of subsequently observed instances of the outcomes [7][8][9][10][11][12][13]. However, many known risk factors for these outcomes are not assessed in Army administrative records, raising the possibility that risk targeting could be improved by expanding the predictor sets to include information from such additional data sources as self-report surveys [13] and social media postings [14].
Given that the risk of many negative outcomes, including involvement in physical and sexual violence [4,15,16], is especially high in the early years of Army service, a survey carried out at the beginning of service might be especially useful in providing information that would help increase the accuracy of risk-targeting beyond the accuracy achieved by exclusively using administrative data as predictors. A New Soldier Survey (NSS) of this sort was administered as part of the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS) [17]. As reported previously [13], prediction models derived from NSS data found that the small proportions of new soldiers judged to be at high risk based on NSS predictors accounted for relatively high proportions of attempted suicides, psychiatric hospitalizations, positive drug screens, and several types of violent crime perpetration and victimization. For example, the 10% of new male soldiers estimated in cross-validated models to have highest risk of major physical violence perpetration in the early years of service accounted for 45.8% of actual acts of major physical violence in the sample.
To date, no results have been reported about the extent to which information obtained in the NSS could be used to increase the accuracy of predictions based exclusively on administrative data. Such increases might be especially large early in the Army career, when administrative data are sparse, with the predictive power of NSS data decreasing relative to that of administrative predictors over time. The current report presents the results of the first attempt to add data from the NSS survey to previously-developed predicted risk scores based on administrative data. We focus on predicting physical and sexual violence perpetration among males and sexual violence victimization among females during the early years of Army service because these are the three outcomes for which separate risk models based on NSS and administrative data were previously developed.

Sample
The NSS was administered to representative samples of new U.S. Army soldiers beginning Basic Combat Training (BCT) at Fort Benning, GA, Fort Jackson, SC, and Fort Leonard Wood, MO between April 2011 and November 2012. Recruitment began by selecting weekly samples of 200-300 new soldiers at each BCT installation to attend an informed consent presentation within 48 h of reporting for duty. The presenter explained study purposes, confidentiality, and voluntary participation, then answered all attendee questions before seeking written informed consent to give a self-administered computerized questionnaire (SAQ) and neurocognitive tests and to link these data prospectively to the soldier's administrative records. These study recruitment and consent procedures were approved by the Human Subjects Committees of all Army STARRS collaborating organizations. The 21,790 NSS respondents considered here represent all Regular Army soldiers who completed the SAQ and agreed to administrative data linkage (77.1% response rate). Data were doubly-weighted to adjust for differences in survey responses among the respondents who did versus did not agree to administrative record linkage and differences in administrative data profiles between the latter subsample and the population of all new soldiers. More details on NSS weighting are reported elsewhere [18]. The sample size decreased with duration both because of attrition and because of variation in time between survey and end of the follow-up period. The sample included 18,838 men (decreasing to 16,479 by 12 months, 15,306 by 24 months, and 3729 by 36 months) and 2952 women (decreasing to 2300 by 12 months, 2094 by 24 months, and 687 by 36 months).

Measures Outcomes
Outcome data were abstracted from Department of Defense criminal justice databases through December 2014 (25-44 follow-up months after NSS completion). Dependent variables were defined as first occurrences of each of the three outcomes for which predictive models had previously been developed from both administrative data and NSS data: major physical violence (i.e., murdermanslaughter, kidnapping, aggravated arson, aggravated violence, or robbery) perpetration by men, sexual violence perpetration by men, and sexual violence victimization of women, each coded according to the Bureau of Justice Statistics National Corrections Reporting Program classification system [19]. The perpetration outcomes were defined from records of "founded" offenses (i.e., where the Army found sufficient evidence to warrant full investigation). The victimization outcome was defined using any officially reported victimization regardless of evidence.

Predictors
As reported in previous publications, separate composite risk scores for each outcome were developed based on models from either the STARRS Historical Administrative Data System (HADS) [8,9,12] or the NSS [13]. The details of building the models that generated these scores are reported in the original papers and will not be repeated here other than to say that they involved the use of iterative machine learning methods [20] with internal cross-validation to predict the outcomes over a one-month risk horizon in a discrete-time personmonth data array [21]. The HADS models were developed using all the nearly 1 million soldiers on active duty during the years 2004-2009 and were estimated for all years of service rather than only for the first few years of service, whereas the NSS models were developed using the NSS sample. We then applied the coefficients from these models to the data from the soldiers in the present samples to generate composite prediction scores. Thus, each person-month had a single score from each model representing the predicted log odds of the outcome occurring (note that this score changed each month for the HADS models, but remained the same within each person for the NSS models because the NSS was administered only once). Each score was then standardized by a mean of 0 and variance of 1 in the total sample. These composite prediction scores were used as the input in the current analysis. In other words, for each of the models reported here, there were two possible two independent variables (plus their transformations and interactions): the standardized log odds of the event occurring according to the HADS model and the standardized log odds of the event occurring according to the NSS model.
The potential predictors selected for inclusion in the iterative model-building process for the HADS and NSS models operationalized 8 classes of variables found in prior studies to predict the outcomes: sociodemographics (e.g., age, sex, race-ethnicity), mental disorders (self-reported Diagnostic and Statistical Manual of Mental Disorders, 4th edition [DSM-IV] disorders in the NSS and medically recorded International Classification of Diseases [ICD] disorders in the HADS models), suicidality/non-suicidal self-injury (self-reported in the NSS and medically recorded in the HADS models), exposure to stressors (assessed in detail in the NSS models with questions about childhood adversities, other lifetime traumatic stressors, and past-year stressful life events and difficulties; assessed in the HADS models with a small number of available markers of financial, legal, and marital problems, information about deployment and stressful career experiences, and military criminal justice records of prior experiences with crime perpetration and victimization), military career information (for new soldiers, Armed Forces Qualification Test [AFQT] scores; physical profile system [PULHES] scores used to indicate medical, physical, or psychiatric limitations; enlistment military occupational specialty classifications; and a series of indicators of enlistment waivers; and for the HADS models, increasing information over the follow-up period about promotions, demotions, deployments, and other career experiences), personality (only in the NSS models), and social networks (only in the NSS models). Results of performance-based neurocognitive tests administered in conjunction with the NSS were also included in the NSS models [22]. More detailed descriptions of the HADS and NSS predictors, the final form of each model (i.e., the variables that were ultimately selected for inclusion by the algorithms), and predictive performance are presented in the original reports [8,9,12,13].

Analysis methods
Analysis was carried out remotely by Harvard Medical School analysts on the secure University of Michigan Army STARRS Data Coordination Center server. Given that respondents differed in number of months of follow-up, we began by inspecting observed outcome distributions by calculating survival curves using the actuarial method [23] implemented in SAS PROC LIFETEST [24]. We projected morbid risk to 36 months even though some new soldiers were followed for as long as 44 months because the number followed beyond 36 months was too small for stable projection. Discretetime survival analysis with person-month the unit of analysis and a logistic link function [21] was then used to estimate a series of nested prediction models for first occurrence of each outcome. Models were estimated using SAS PROC LOGISTIC [24].
The model-testing process involved two steps: first, determining the best model using the HADS risk score only, and then finding the optimal strategy for combining NSS data with the best model from the first step. Specifically, we began with a model including only the composite predicted risk score based on the HADS (expressed as a predicted log odds standardized to have a mean of 0 and a variance of 1), controlling (as in all subsequent models) for time in service; we then estimated models including a quadratic effect of HADS risk score, an interaction of the risk score with time, and their combination. In the second step, we tested the effect of adding the NSS composite predicted risk score to the best HADS model, followed by combinations of a quadratic NSS term, an interaction of NSS score with HADS score, and an interaction of NSS risk score with historical time. Importantly, whereas the values of the NSS composite risk score did not change with time in service because the NSS was administered only once, the values of the HADS composite risk score did change due to the addition of new administrative data each month. We tested the significance of interactions between the composite risk scores and time in service to evaluate the assumption that the HADS composite risk score might become more important over time and the NSS composite risk score less important. Design-based Wald χ 2 tests based on the Taylor series method [25] were used to select the best-fitting model for each outcome. This method took into consideration the weighting and clustering of the NSS data in calculating significance tests.
Once the best-fitting model for each outcome was selected, we exponentiated the logistic regression coefficients and their design-based standard errors for that model to create odds-ratios (ORs) and 95% confidence intervals (95% CIs). We then divided the sample into 20 separate groups (ventiles), each representing 5% of respondents ranked in terms of their risk scores in the best-fitting models, and calculated concentration of risk for each ventile: the proportions of observed cases of the outcome in each ventile. If the models were strong predictors, we would expect high concentration of risk in the upper ventiles. Concentration of risk was calculated and compared not only for the best-fitting models but also for the HADS-only models to determine the improvement in prediction strength achieved by adding information from the NSS rather than relying exclusively on HADS risk scores. We also calculated concentration of risk for the NSS-only models for comparative purposes. Finally, we calculated positive predictive value: the proportion of soldiers in each ventile that had the outcome over the follow-up period. As with morbid risk, positive predictive value was projected to 36 months using the actuarial method to adjust for the fact that the follow-up period varied across soldiers.

Correlations between predictions based on the separate NSS and HADS models
The correlations between HADS and NSS composite risk scores varied over time because of monthly changes in the administrative variables used to generate the HADS predictions. Median (interquartile range) withinmonth Pearson correlations between the two scores were .36 (.34-.38) for major physical violence perpetration among men, .06 (.05-.07) for sexual violence perpetration among men, and .26 (.24-.27) for sexual violence victimization among women ( Table 1). The magnitudes of the associations between the two composite risk scores decreased over time for physical violence perpetration and sexual violence victimization, with Pearson correlations of −.78 and − .84 between number of months in service and magnitude of the withinmonth association between the two scores. The associations increased over time, in comparison, for sexual violence perpetration, r = .84 (Fig. 2).

Relative fit of the base models and extensions
In the first analytic step, none of the expansions of the base HADS models for nonlinearities or interactions improved model fit in predicting either physical violence perpetration among men or sexual violence victimization among women. However, the addition of the NSS risk score improved model fit in both cases. We consequently focused on the additive model (i.e., HADS plus NSS composite risk score) for these outcomes. For sexual violence perpetration among men, however, model fit was improved by inclusion of the interaction between the NSS composite risk score and time since survey administration (χ 2 2 = 6.8, p = .034) relative to the additive model (Table 2). (See Additional file 1: Table S1, for odds ratios and chi-square values for all models tested.)

Coefficients in the best-fitting models
Inspection of the odds ratios (ORs) of univariate models with either the NSS or HADS composite risk scores as the only predictors shows that each score is associated with significantly increased odds of each outcome, with ORs relatively comparable in magnitude for NSS (OR = 1.9-2.1) and HADS (OR = 1.5-2.5). Due to their collinearity, individual predictors' ORs are lower but remain significant in the two additive models that include both composite risk scores as predictors (HADS OR = 2.1 for physical violence perpetration and 1.3 for sexual violence victimization; NSS OR = 1.6 for physical violence perpetration and 1.8 for sexual violence victimization). In the model for sexual violence perpetration, the HADS composite risk score is significant (OR = 1.4) and stable over the follow-up period, whereas the NSS composite risk score is a significant predictor in the first 12 months of service (OR = 2.3), decreases but remains significant during the second year of service (months 13-24; OR = 1.7), and becomes nonsignificant beyond the second year of service (months 25+; OR = 1.3) ( Table 3).

Concentration of risk and positive predictive value in the best-fitting models
Concentration of risk was strongly elevated compared to the 5% of observed cases expected by chance in the top 3 predicted risk ventiles of all three best-fitting models (Fig. 3). 39.5% of observed physical violence perpetration, 26.1% of sexual violence perpetration, and 29.4% of sexual violence victimization occurred among the 5% of soldiers in the top risk ventiles for those outcomes (Table 4). Between 49.8% (sexual violence victimization) and 56.3% (physical violence perpetration) of observed cases of the outcomes occurred among the 15% of soldiers in the top three risk ventiles.
These proportions were for the most part meaningfully higher than those achieved by using only the HADS predicted risk score (Table 4). For example, the 39.5% concentration of risk of physical violence perpetration among soldiers in the top risk ventile of the best-fitting model was proportionally 16.6% greater than the 33.9% concentration of risk among soldiers in the top risk ventile of the model based only on the HADS predicted risk score (i.e., 39.5/33.9). Three of these 9 proportional improvements (i.e., the top 3 ventiles for each of 3 outcomes) were less than 15% (4.8-11.2%). Four others were 15-30% (16.6-29.6%). The largest two were 45.9% and 67.9%.
Despite the high concentrations of risk in the top predicted risk ventiles, positive predictive value was low even in the highest risk ventiles due to the rarity of the outcomes. In any given month, 3.4/1000 male soldiers in the highest predicted risk ventile of physical violence perpetration were accused of that outcome, 1.5/1000 male soldiers in the highest predicted risk ventile of sexual violence perpetration were accused of that outcome, and 11.5/1000 female soldiers in the highest predicted risk ventile of sexual violence victimization experienced that outcome. However, cumulative positive predictive value projected over the first 36 months of service was considerably higher, between 34.6 and 241.7/1000 soldiers in the highest risk ventile across the outcomes.

Discussion
Prediction of all three outcomes considered here was improved, in some cases substantially so, by adding information from the NSS predicted risk score to information from the HADS predicted risk score. One would expect this improvement to shrink somewhat in out-ofsample performance due to the fact that the NSS predicted risk score was developed in the same sample as it was applied. However, a counter-balancing consideration is that incremental prediction accuracy might increase beyond the level found here if an NSS survey became a routine part of Army accession, as the sample available for analysis would then be large enough for disaggregated analyses of individual predictors from both administrative and survey data rather than requiring the use of the composite predicted risk scores we were forced to use here because of the small NSS sample size.
We found unexpectedly that the strength of the NSS predicted risk scores remained stable over the time period of the study for physical violence perpetration and sexual violence victimization. This suggests that the NSS tapped into relatively stable individual differences in risk factors for these two outcomes rather than situational risk factors that became less relevant over time. A review of the most important predictors making up the NSS predicted risk scores showed, consistent with this interpretation, that these variables are dominated by measures of personality, history of pre-enlistment lifetime psychopathology, and history of pre-enlistment lifetime trauma exposures, most notably prior sexual violence victimization among women and prior involvement in violence among men [13]. For sexual violence perpetration, however, the NSS risk score was no longer a significant predictor after the end of the second year in service (i.e., in months 25 and beyond). This could simply be a function of greater uncertainty in the model as time progresses (as the confidence interval for the odds ratio at months 25+ still contains relatively large values), or it could reflect a true decrease in the predictive strength over time. Regardless, the NSS data were valuable for predicting the majority of sexual violence perpetration outcomes, because 83.2% of reported assaults occurred within the first two years of service.
It is less clear why the HADS predicted risk scores did not increase in strength over time, as administrative information about soldiers became richer over time. One plausible interpretation is that the early months of service, when administrative records are sparse, are also characterized by lower prevalence of the outcomes considered here, as new soldiers are more closely supervised during Basic Combat Training (BCT; the first 10-16 weeks in service) and Advanced Individual Training (2-12 months after completion of BCT) so that opportunities for violence are lower. Administrative records become richer after the end of training, at which time prevalence of violence outcomes also increases. The extent to which the temporal consistencies in prediction accuracy continue beyond the early years of service studied here is unclear, although it seems unlikely that the prediction accuracy of survey reports obtained from 18year-old new soldiers will maintain constant strength over many years in service. This question will be the focus of ongoing analyses of the STARRS data as the NSS cohort ages. Even though the addition of NSS data improved prediction of all outcomes beyond the HADS predictions, the question remains whether the magnitudes of these improvements are large enough to justify implementing an ongoing NSS for all new soldiers. The answer depends on the number of interventions the Army might want to implement (which could involve many more than the three outcomes considered in this report), the proportional increases in concentration of risk of a composite risk score using the NSS as well as the HADS compared to the HADS alone at the thresholds selected by the Army for intervention implementation, and the costs, benefits, and competing risks of those interventions in relation to the costs of implementing an ongoing NSS. Uncertainties about these values make it impossible to calculate these cost-benefit ratios here, but these are the calculations the Army needs to make if it is interested in using evidence-based standards for targeting preventive interventions for high-risk new soldiers. If so, future research might also consider other  Although the same sample of soldiers was used for both male outcomes, the number of person-months differed because we predicted first occurrence of each outcome, and each soldier was censored after the month when the outcome first occurred, termination of service, or December 2014, whichever came first. Number of person-months was 543,603 for male physical assault perpetration, 543,636 for male sexual assault perpetration, and 75,772 for female sexual assault victimization. c Out of M1-M5, M2 was the best model for each outcome; M6-M11 add NSS predicted log odds to the best model (Ba) from HADS data alone. The final best models were M6 for physical violence perpetration and sexual violence victimization and M7 for sexual violence perpetration data sources that could be added beyond an ongoing NSS to improve prediction accuracy even further over the accuracy of models based exclusively on administrative predictors. The performance of these models is on par with, or better than, other attempts to use machine learning or more traditional methods to predict risk of crime (e.g., [26,27]), but the accuracy is nonetheless intermediate in strength. Consequently, using these predictions as a basis for decision-making (whether with or without NSS predictor) requires that the benefits of the action taken for those accurately classified as high-risk outweigh the costs to those misclassified as high-risk. For instance, it would certainly be beneficial to deliver a reasonably lowcost intervention that does no harm to those to whom it is administered, but has some effect on reducing interpersonal violence, to a group of soldiers identified as high risk using these models. However, classification might not be accurate enough to deliver an intervention with high per-person expense, or one that causes some kind of harm (e.g., stigma, limiting career advancement) to those who are misclassified.
In addition to their implications for informing U.S. Army decision-making regarding data collection and  Fig. 3 Concentration of risk by ventiles for best model of each outcome use, these findings may be relevant to other researchers using machine learning methods to predict various outcomes for individual humans. In this study, even an extremely rich passively-collected administrative data set was no substitute for querying individuals directly about psychologically relevant variables. Administrative/institutional data are often abundant and incur relatively low additional cost to collect, so they are have formed the typical feature sets used in machine learning algorithms. However, prediction may be considerably improved through the addition of self-report data, especially (1) when an outcome is partly psychologically driven, and consequently, subjective information may be a powerful predictor, and/or (2) when important predictors in administrative data sets may be noisy or inaccurate because they are not fully captured by institutional systems (e.g., health events when no medical care was sought, covert antisocial behaviors). The control conferred when administering self-report questionnaires is an additional advantage; researchers can select the variables they consider to be most essential, and questionnaires can be designed in such a way that data require little processing prior to use in algorithms.

Conclusions
Self-report information can substantially improve prediction of risk for interpersonal violence beyond the information available in administrative databases for Regular Army soldiers early in their careers. The U. S. Army may benefit from ongoing administration of selfadministered questionnaires to new soldiers. Other researchers may want to consider collecting self-report data to augment administrative/institutional data sets when developing machine learning algorithms, especially to predict psychologically driven outcomes.

Additional file
Additional file 1: Table S1. Odds ratios and chi-square values for all models (n = 18,838 men and 2952 women). Results of all models tested and indices of fit. (DOCX 23 kb)