Identifying long-term and imminent suicide predictors in a general population and a clinical sample with machine learning

Background Machine learning (ML) is increasingly used to predict suicide deaths but their value for suicide prevention has not been established. Our first objective was to identify risk and protective factors in a general population. Our second objective was to identify factors indicating imminent suicide risk. Methods We used survival and ML models to identify lifetime predictors using the Cohort of Norway (n=173,275) and hospital diagnoses in a Saskatoon clinical sample (n=12,614). The mean follow-up times were 17 years and 3 years for the Cohort of Norway and Saskatoon respectively. People in the clinical sample had a longitudinal record of hospital visits grouped in six-month intervals. We developed models in a training set and these models predicted survival probabilities in held-out test data. Results In the general population, we found that a higher proportion of low-income residents in a county, mood symptoms, and daily smoking increased the risk of dying from suicide in both genders. In the clinical sample, the only predictors identified were male gender and older age. Conclusion Suicide prevention probably requires individual actions with governmental incentives. The prediction of imminent suicide remains highly challenging, but machine learning can identify early prevention targets. Supplementary Information The online version contains supplementary material available at (10.1186/s12888-022-03702-y).


ANALYTICAL DETAILS IN THE COHORT OF NORWAY Steps 1 and 2: Partitioning and Balancing
We first partitioned the Cohort of Norway data into training and testing subsets, assigning 85 percent of the suicide cases to training (n=270) and 15 percent to testing (n=49). Imbalanced data-where one class predominates over the other makes it difficult for machine learning algorithms to predict the minority outcome [1]. Accordingly, we randomly sampled 270 individuals who did not die from suicide (i.e. alive or died from other causes) and added them to the training set, thereby giving us a balanced set of suicides and non-suicides. The testing subset consisted of the rest of the suicide cases (n=49) plus the rest of the Cohort of Norway participants. Partitioning the data in this manner resulted in a small training set (n=540) and a large test set (n=172,684). In the sex-segregated models, the split in the training set was 305 males/235 females and 83,845 males/88,839 females in the testing set.

Steps 3: Imputation
We examined the missing data patterns, and discarded nine variables with a missing proportion exceeding 20 percent. According to Liao and colleagues [2], 20 percent is typical practice and researchers should not be overconfident that highly missing variables are imputable. The discarded variables (and their missing proportions) were: living with others with ages 18 and below (52%), having children of kindergarten age (78%), using vitamin/mineral supplements (69%), takes lipid lowering drugs (65%), number of good friends (24%), number of months taking antidepressant in the previous year (70%), lack of sleep has affected work (33%), frequency of attendance in organized activity (28%), and frequency of sleeplessness (32%). The remaining variables had missing rates below 21 percent (or no missing values at all) except for having an injury requiring hospitalization (21%) which was close to the threshold. This variable was retained, in part, because some of them may have been self-harm injuries and potentially informative about suicide.
We adhered to two principles in carrying out imputation: (1) preventing data leakage between training and testing sets, and (2) preventing the outcome (suicide death, and survival time) from influencing the imputation of predictors. Since our primary objective was to identify predictors of suicide, using suicide status and follow-up time to impute missing predictors would be circular. Accordingly, we imputed training and test sets separately and excluded the outcome variables from imputation models. Having continuous, ordinal, and dichotomous variables, we chose two methods shown to be appropriate for datasets with mixed variable types. These are missForest [3] and Factorial Analysis for Mixed data (FAMD) [4], an adaptation of principal components analysis to mixed data types. We had a bake-off between FAMD and missForest, comparing the accuracy of each method for each candidate variable, at five different missing proportions: 5, 10, 15, and 20 percent. We then used the method having greater overall accuracy to impute a particular variable. This follows the strategy of Liao and colleagues [2] that selects a particular method that performs better (i.e. higher accuracy) for a particular variable.
The assessment of accuracy was determined in the following manner. Two identical copies of the testing dataset were created-one for imputation using missForest and the other using FAMD. We used the testing data for assessment of accuracy (instead of the training data) because it was Fig. S1. Two identical copies of the original data (left) were created and allocated for imputation by missForest and FAMD (middle). Since original data also has missing values, assessment of accuracy is restricted to values that were not missing in the original. We masked them (pink highlight) to serve as ground truth. Accuracy is defined by the imputed value (gray or yellow highlight) that has a smaller error with respect to the true values (green highlight). Dashed arrows point to examples of imputed values with greater accuracy. The method with greater accuracy is then used to impute the missing values in the original data, resulting in a completed dataset.Highlights in the completed data (right) indicate the source of a particular figure, i.e. missForest (yellow), FAMD (pink), or original data (green). large, and enabled us to mask (or hide) known data. Accuracy was assessed by comparing values of the masked data with the original data, using appropriate metrics: RMSE for continuous and ordinal variables [5] and AUC for dichotomous variables. Figure S1 is a schematic diagram of the masking procedure and accuracy assessment procedure.
We found that imputation accuracy remained stable at various missing proportions in the 5 to 20 percent range (Table S1). Generally, missForest had greater accuracy for dichotomous variables and FAMD for ordinal and continuous ones. Once we found which method worked better for a particular variable in the testing subset, we used that same imputation method in the training subset.
Step 4: Fitting survival and ML models to the training data We first developed Cox models to the training data, fitting each of 13 predictors in a univariate model. Predictors that were significant at p = .2 were entered into multivariable Cox models--one for males and one for females, consistent with previous work [6,7]. We tested the proportional hazards assumption in these Cox models and found no violation.
We then fitted random survival forests to the same training data using the randomForestSRC package [8]. The predictive ability of random forests depends on so-called hyper-parameters (i.e. number of variables to try, number of observations per terminal node, number of split points per variable) specified by the user. Accordingly, we selected the optimal hyper-parameters in a tuning process that resulted in the smallest out-of-bag error.

Step 5: Top Predictor Variables
We examined the important predictor variables, separately by gender, in both Cox and random survival forest models. In the Cox models, we focused on variables that were statistically significant at p = .05. For the random survival forest, we used Altmann's permutation method, [9] selecting a p value threshold of .05. With this method, the importance of a variable is based on the decrease in prediction accuracy when the outcome is permuted vis-a-vis the true values [9]. This method is resilient to the influence of variables with many categories. We implemented Altmann's permutation method using the ranger package [10].
Step 6-7: Predicting the Test Set Outcomes and Assessment of Accuracy Using the models developed in Step 4, we predicted the survival probabilities of the observations in the test set at the first, second, third and fourth quartiles. These values were: 181, 212, 243, and 267 months. The models' predictive accuracy were assessed using Heagerty's cumulative cases /dynamic controls method [11]. This method is intended for the assessment of predictors that were measured at participants' entry. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) at each quartile were calculated based on an optimal threshold, selected using the criterion proposed by Perkins and Schisterman [12]. This threshold represents the value of the marker that results in the point closest to the upper left corner of the ROC curve (i.e. (1,0)). We also calculated the area under the ROC curve (i.e. the C-index) for the last follow-up period. The calculation of accuracy measures was performed in Stata using the package stroccurve [13]. The accuracy measures for the Cohort of Norway test set are in Table S4.

Steps 1 and 2: Partitioning and Balancing
The format of the person-period data for Saskatoon is shown in Table S5. This individual has four rows corresponding to the six-month intervals (in days) in which the person visited the hospital. Along with the fixed variables Age and Male, the mental health conditions associated with each visit are recorded. Person 1 was seen for anxiety in the first six months and for depression in the three succeeding interval. The person died of suicide sometime between the third and fourth 6-month interval.
As we did in the Cohort of Norway analysis, the Saskatoon data was partitioned into training and testing subsets. The training data consisted of 67 suicide deaths and 67 living people-giving 134 unique people having a mean follow-up time of 2.5 years (SD: 1.4). Because of multiple records per person, the training data had 777 rows. The testing data had 13 suicide deaths and 12,473 controls with a mean follow-up time of 3.1 years (SD: 1.2). The testing data had 72,991 rows because of multiple records per person.

Step 3: Imputation
In Saskatoon, all hospital visits are captured electronically, so there were no missing values with respect to the diagnoses and the RESH score. The person-level characteristics had high rates of missingness, so these were not imputed.

Step 4: Fitting survival and ML models to the training data
We fit two types of models to the Saskatoon training data: a discrete time survival model, and a historical random forest, which uses historical and current interval predictors. The historical random forest is similar to a survival model with time-varying predictors, but uses these variables to "grow" decision trees for predicting suicide risk. The historical random forest was implemented in the htree package [14]. The discrete survival model was implemented as logistic regression with the intervals treated as factors. We entered fixed and time-varying covariates one at a time to the model with only intervals as predictors. Variables that were significant at p = .20 were entered into a multivariate logistic model. For the historical random forest, we entered the following as historical predictors: self-harm, anxiety, depression, substance abuse, schizophrenia, mania, ADHD episodes, RESH score, and previous community mental health visits. For concurrent predictors, we entered only gender and age at index. We set the terminal node size at 13 (i.e. 10 percent of the training sample) based on the recommendation by Dankowski [15].

Step 5: Top Predictor Variables
For the discrete survival model, only age was as a significant predictor, so it was not necessary to create a multivariable model. For the historical random forest method, importance was determined by integrating predictors one at a time. The prediction errors of the original model and the marginalized model were then compared and the p value and z-score of the difference was calculated. This procedure implemented using the variable importance function in htree.

Steps 6-7: Predicting the Test Set Outcomes and Assessment of Accuracy
Using the models developed in Step 4, the survival probabilities for each person in the test set was predicted at quartiles of follow-up time. We restricted the assessment of accuracy to those intervals where at least one suicide death occurred. This was done to avoid zero PPV values resulting from zero positive cases. In contrast to the Cohort of Noway, we followed the incident cases/dynamic controls method because our Saskatoon data had updated measurements at each interval [16]. Sensitivity, specificity, PPV, NPV at each of the intervals with suicide were also calculated using stroccurve. Unlike the Cohort of Norway analysis, we did not create separate models by sex because we were limited by the number of suicide deaths. The accuracy measures for the Saskatoon clinical sample are summarized in Table S6. conor_imp20_FAMD <-imputeFAMD(conor_l_miss_20, ncp=3) conor_imp20_FAMD <-conor_imp20_FAMD$completeObs #####################15 pct conor_l_miss_15 <-prodNA(conor_l_anon, noNA=.15) conor_imp15_FAMD <-imputeFAMD(conor_l_miss_15, ncp=3) conor_imp15_FAMD <-conor_imp15_FAMD$completeObs #####################10 pct conor_l_miss_10 <-prodNA(conor_l_anon, noNA=.10)