G-estimation of causal pathways in vocational rehabilitation for adults with psychotic disorders – a secondary analysis of a randomized trial

Background Vocational rehabilitation (VR) has increasingly become an important intervention targeting poor occupational functioning in schizophrenia. The Norwegian Job Management Program (JUMP), sought to enhance occupational outcomes by augmenting VR with either cognitive behavioral therapy (CBT) techniques aiming to improve psychotic symptoms or cognitive remediation (CR) aiming to improve cognition. CBT is standard treatment in schizophrenia, but recent meta-analyses question the effect of CBT on negative psychotic symptoms. It is of interest to study the causal role of psychotic symptoms and cognitive functioning on occupational functioning. Methods Data from the JUMP VR – program, was reanalyzed with a causal inference method to assess the causal effects of reduced symptoms / improved neurocognitive functioning on occupational functioning measured by number of working hours per week. Participants (N = 131) had been randomized to either VR + CBT (N = 68) or VR + CR (N = 63). Large improvements in number of working hours were demonstrated in both intervention groups (nonsignificant group difference). G-estimation was used to assess the strength and nature of the causal effects, adjusted for time-varying confounding and selection – bias from loss to follow-up. Results Significant causal effects of reduction in each of four dimensions of symptoms and improved neurocognition respectively, on number of working hours were found (separate models). The effect of negative symptoms was the strongest and increased in magnitude during the whole observation period, while the effect of two other symptoms and neurocognition was constant. Adjusted for confounding (including potential feedback), the causal effect of a hypothetical change in negative symptoms equal to the average improvement in the CBT group corresponded to an increase in working hours of 3.2 h per week (95% CI: 1.11, 5.35). Conclusion High performance of g-estimation in a small psychiatric data set with few repeated measures and time-varying confounding and effects, was demonstrated. Augmented vocational rehabilitation showed causal effects of intervention targets with the strongest and increasing effect from negative symptoms on number of working hours. Combination of therapy and activation (indirect and direct approach) might explain improvement in both cognition and negative symptoms, and shed some light on effective ingredients for improved treatment of negative symptoms. Supplementary Information The online version contains supplementary material available at 10.1186/s12888-021-03349-1.

increasing effect from negative symptoms on number of working hours. Combination of therapy and activation (indirect and direct approach) might explain improvement in both cognition and negative symptoms, and shed some light on effective ingredients for improved treatment of negative symptoms.
Keywords: G-estimation, Causal inference, Vocational rehabilitation, Psychotic disorders Background Schizophrenia is associated with positive and negative symptoms, neurocognitive impairment and poor occupational functioning [1,2]. While positive symptoms (hallucinations, delusions, thought disturbances and impaired reality testing) tend to diminish over time and respond well to medication, this is not the case for negative symptoms (social withdrawal, apathy, avolition and fatigue) [3,4]. There is a growing consensus that targeting negative symptoms is essential to improve long-term functioning (employment, education, friendships) [5]. However, they are not easy to improve [6]. Recent reviews show less than convincing effects of cognitive behavioral therapy (CBT), one of the most frequently used strategies [5,7]. In a systematic search of the literature from 2015, the authors conclude that it is necessary to disentangle effective treatment ingredients to make further improvements [4].
Neurocognitive deficits have large impact on functional outcomes like occupational attainment, with no effective pharmacological treatment [8]. In a metaanalysis, cognitive remediation (CR), targeting the patient's attention, memory and executive functioning, showed stronger effects (small to moderate) on neurocognition when combined with vocational rehabilitation [9]. Cognitive remediation has also been found to improve negative symptoms [10] although this symptom complex is not traditionally a target for CR interventions.
Vocational rehabilitation (VR) in schizophrenia is an approach aimed at helping individuals attain and maintain work. Evidence-based approaches in VR exist [11], but challenges remain causing discontinuation of employment. Thus, in an attempt to strengthen effects, VR programs are being augmented with other therapeutic approaches. Since both symptoms and neurocognitive impairment strongly influence occupational functioning, these factors were targeted in the Job Management Program (JUMP). The JUMP Study was a randomized, multisite hybrid VR program for adults with psychotic disorder in Norway [12][13][14]. Significant improvements in occupational outcomes were found in both groups (VR + CR versus VR + CBT) across an observation period of two years, with a non-significant group difference [15][16][17]. Also, both neurocognition and symptoms improved, so far best documented with respect to neurocognitive measures [12,14]. An association was found between positive change in neurocognition and subsequent occupational functioning, but quantifying a potential "causal" effect of improved neurocognition (or improved symptoms) on occupational functioning has thus far not been explored.
Causal inference -estimating causal effects has become a large field in statistics, with applications in most applied sciences. Causal interpretation of results of a statistical analysis relies on strong assumptions, which are explicitly stated in causal inference methods, in contrast to analysis of associations, with potential misleading interpretation. Causal inference methods are useful, both in observational studies and randomized trials. In an observational study with interest in the effect of a time-varying exposure on an outcome, the proper causal method successfully adjusts for time-varying confounding in cases where traditional methods fail. The same method applies to a randomized trial where the causal effect of interest does not involve the randomized groups, or there is potential selection-bias from differential loss-to-follow-up. G-computation [18], g-estimation for Structural Nested Models [19,20] and inverse probability weighting (IPW) for Marginal Structural Models (MSMs) [21][22][23] were proposed by James Robins to overcome difficulties with standard regression methods. The use of IPW estimation for MSMs is widespread, popularized in epidemiology two decades ago, with a considerable amount of citations [23], although applications in psychological / psychiatric research are scarce [24][25][26]. The less known method of g-estimation for structural nested models [19,20,27] is older, but is far less cited than IPW. Even though g-estimation seems complex, it out-performs IPW methods in several ways; higher efficiency, more robust for some forms of bias, more suited for the analysis of continuous exposures and, in contrast to IPW methods it can accommodate effect modification by time-varying covariates [28]. Gcomputation is even more efficient, but is highly computer intensive with many parametric assumptions (distributional assumptions on confounders etc) which makes it less robust, and can also not accommodate effect modification by time-varying covariates [29]. In an attempt to remedy it's "underuse", methodological experts promoted g-estimation for structural nested models in 2014 with a call for more applications [30], and again in a more popularized form in 2016 [28].
This study is a re-analysis of the data from the JUMP program in Norway, with the aim of estimating causal effects of reduced symptoms and improved neurocognition on number of working hours per week using gestimation. Assessment of how much of the increase in working hours per week can be attributed to improvement of symptoms / neurocognition, can possibly contribute in the design of better VR programs and also to the debate on how to target different core symptoms of schizophrenia.

Sample
The present sample is based on the sample from JUMP, a multisite VR program for adults with psychotic disorders in Norway (N = 131). It was a joint venture between health and welfare services with the goal of enhancing occupational outcomes for persons with schizophrenia spectrum disorders. All counties in Norway were invited to participate and six were included in the study. Each county was randomized to one of two interventions VR + CBT or VR + CR. A control group without the augmentation proved impossible to include. More details of design and recruitment of participants are described elsewhere [13]. The intervention consisted of an 10month extensive VR program with competitive or sheltered work through collaboration between mental health and vocational services, employers and employment specialists in addition to either CR or CBT components. CR and CBT were carried out by employment specialists based in sheltered workshops. Participants had CBT or CR sessions with the employment specialist two hours weekly, and the employment specialists themselves received training (40 h) in CBT or CR, followed by weekly supervision by an experienced health professional [13,14]. Each employment specialist served around 10 participants, allowing for close collaboration with all involved parties. Participants were assessed at baseline, at the end of the intervention (10 months after baseline) and at approximately 2 years after baseline. The groups are denoted CBT and CR from here onwards. The JUMP study was approved by the Regional Committee of Medical Research Ethics and the Norwegian Data Protection Authority. All participants provided written informed consent.

Clinical measures
Clinical assessment was carried out by trained clinicians. Diagnostic evaluation was done with M.I.N.I PLUS [31]. Current level of psychotic symptoms at three timepoints, were rated using the Structural Clinical Interview of Positive and Negative Syndrome Scale (SCI-PANSS) [32], measuring three separate dimensions, and one total score: negative symptoms (denoted by PANSSneg t ), positive symptoms (PANSSpos t ), general symptoms (PANS Sgen t ), and total score which is a sum of the other three (PANSSsum t ). Each dimension was of interest separately (in addition to the total) because of difference in timecourse, and specific focus on the challenge with negative symptoms from the literature.

Neurocognitive measure
Assessment of neurocognition at three time-points was based on "The Measurement and Treatment Research to Improve Cognition in Schizophrenia" (MATRICS) Consensus Cognitive Battery (MCCB) [33]. Nine out of ten of the MCCB subtests were included in the JUMP protocol (measure of social cognition was excluded), measuring six domains: Speed of processing, Attention/ Vigilance, Working memory, Verbal learning, Visual learning and Reasoning and Problem solving [15]. A modified MCCB neurocognitive composite score was calculated as the mean of the nine demographically corrected domain T-scores (denoted by COGN t ) [15,34]. The choice to limit the neurocognitive measure to the mean composite score, was pragmatic. The different domains all had a similar time-course and were found sufficiently represented by the mean, to describe a crude neurocognitive measure.

Other covariates
Time-constant variables recorded at baseline consisted of: age, gender, indicators for low education, living alone, marital status, group (CR/CBT) and history of unemployment, and log-dose of daily medication at baseline. In addition to SCI-PANSS symptoms and neurocognitive MCCB composite score, measure of psychosocial functioning -Global Assessment of Functioningboth function and symptoms were assessed at all three timepoints (GAFF t , GAFS t ).

Outcome
The outcome of interest in the present application was number of working hours per week, as recorded by the employment specialists at three time-points (denoted by WH t ). Both competitive and sheltered work were considered. The latter category is financed partly by e.g. disability benefits, but with similar work demands as in competitive employment. Competitive employment was the goal [13].

Causal effects in JUMP
Does improvement in psychotic symptoms influence occupational functioning, and if so, how much? This question can be answered by causal inference, which is commonly formalized by means of counterfactuals. In general, with an interest in the effect of an exposure on an outcome, the counterfactual outcome Y(a) is the potentially unobserved outcome for the hypothetical exposure level a. This formalism helps to identify the probability distribution (or functions) of the counterfactuals from observed data. In the following, random variables and their realizations are represented by upperand lower-case letters respectively.
A conditional average causal effect of a hypothetical intervention (e.g. exposure) setting exposure to a versus 0 (0 can represent exposure-free or any reference value) among subjects with covariate value l, can be formulated by the linear structural mean model (SMM) [20,28,30] as is the conditional population mean, z is a covariate vector possibly depending on l, and ψ is the vector of causal parameters of interest.
For example, with l representing gender and z = 1, the additive exposure-effect on an outcome from exposure a relative to no exposure, is equal between genders. On the other hand, z = l would describe different exposure effects between genders.
With time-varying exposure, covariates and outcome, the exposure and vector of covariates, A t and L t , are thought to be assessed at time t = 0, 1, 2 with the history up until (and including) t, denoted by A t and L t , for example A 2 ¼ fA 0 ; A 1 ; A 2 g. L t is thought to precede A t and both precedes the outcomes they affect. Y s ða t ; 0Þ is the counterfactual outcome at time s = 1, 2, 3 for an exposure history equal to a t up until time t and zero thereafter. This construct facilitates assessments of a causal effect of a time-varying exposure on the following outcome as well as on all subsequent ones, formulated by the structural nested mean model (SNMM) [20]: where s > t. The treatment effect posited in (1) represents the difference in outcome when exposure at time t is set equal to a t and zero thereafter, relative to when exposure at time t and onward is set equal to zero, conditional on exposure and covariate history. In other words, the contribution to the exposure effect from a specific time-point (so called "blip") when later contributions are "removed".
The causal effects of interest in the present application, is the effect of SCI-PANSS symptoms at time t, or the neurocognitive MCCB composite score at time t, on number of working hours per week WH s , s > t. With one dimension of SCI-PANSS symptoms (for example PANS Sneg t ) considered as "exposure", then COGN t is assessed as a potential effect-modifier, and vice versa. The structural nested mean model (SNMMs) with g-estimation is particularly well suited for estimating an effect of a time-varying continuous exposure, and to assess effectmodification by time-varying covariates [28].
Fitting the SNMM for the causal effect of the timevarying PANSSneg t (or COGN t ) on WH s , s > t is done in three steps (Appendix in supplementary file): 1. Regressing the exposure PANSSneg t (COGN t ) on its history, baseline and preceding time-varying covariates and outcome for each t (e.g linear regression), including non-linearities and interactions. The fitted value from this regression is called "propensity score", denoted P t 2. Regressing the observed outcome WH t on confounders as well as the terms with previous exposure z t, t − 1 a t − 1 and propensity score z t, t − 1 p t − 1 in a repeated measures regression model (GEE with independence working correlation). The z t, t − 1 a t − 1 coefficient is the preliminary estimateψ 3. Predicting Y s ðA t ; 0Þ for all t, s with s > t by means ofψ ð0Þ . The updated and improvedψ ð1Þ is found by a second independence GEE with these predictions as outcomes and similar covariates as above.

Standard errors are estimated by bootstrap
Different causal hypotheses can easily be assessed with different choices of z st in (1) (Appendix in supplementary file). To differentiate between short-and long-term effects, and how a long-term effect may decrease over time may be of interest. Different short-term effects allowing for group difference in the time-varying effect or e.g. effect-modification from COGN t when PANSSneg t is exposure, are others. However, lack of power is a limitation with increased model complexity. A simple model with effect-modification from a non-linear function of time proved to have a good fit in the present application, and is given by: for s=1, 2, 3 s > t, with the following f(t): where λ > 0 (<0) yields a transformation that increases more (less) rapidly than log(t). λ = 1 yields a linear function of time. In the outcome model (step 2 and 3), f(t) from (3) was included as one component in l t with λ = 0.2 giving a good fit of WH t as a function of weeks (f(weeks), λ = 0.2 in Fig. 1) in the area of observations (baseline -0 weeks, 30 weeks, and 100 weeks). A causal graph (DAG) of the JUMP design is shown in Fig. 2.
Three repeated measures of time-varying covariates A t and L t and outcome Y t are shown, and potential direct causal effects between them are indicated with arrows. Both symptoms and cognition are considered fairly stable over time [15] and therefore assumed to precede the outcome (WH t ), even though they are assessed at the same time. All the models rely on this unverifiable assumption. With no more than three repeated measures, limited sample size with non-complete data, such a compromise was found necessary. To simplify notation, the index of the outcome is set one higher than the index of the exposure. L t is made to precede A t by  Causal graph (DAG) of design in the JUMP study. A t : PANSS symptoms / neurocognitive composite score, L t : baseline and time-varying covariates, Y t : working hours per week (WH t ), 131 adults with broad schizophrenia spectrum disorders in Norway. Indicators C t , for censoring from loss-to-follow up (missing outcome) could be included with arrows entering from preceding exposure, covariates and outcome including only previous confounders (as well as various time-fixed baseline covariates like gender or daily dose of medication). Previous outcome is also included in L t in the outcome models (step 2 and 3), and allowed to influence both later confounders and exposures. In the DAG (Fig. 2), Y t is represented separately for illustration.
Measurement error in the exposure variable is not accounted for in the present application. Concerns about reliability in the symptom scores, motivated the attempt to adjust for measurement error in a comparison between the total JUMP sample and a constructed control group (TAUtreatment as usual) [13]. With PANSS measurements as an important confounder for the JUMP / TAU comparison, measurement error would be expected to have an influence. On the other hand, the effect of the symptom scores themselves (as in the present application) on an outcome, would not be influenced in the same way by measurement error (both magnitude and standard error in the coefficient would be expected to increase, leaving the p-value approximately unchanged) [36]. Also, it is not obvious how such a measurement error correction should be included in the fitting of the SNMM.

G-estimation with multiple imputation in JUMP
Incomplete data were of concern, both for the outcome (WH t ) and for different covariates, particularly at the two-year follow-up. Missing values in the outcome was considered to be loss-to-follow-up and the person not allowed to re-enter. Potential selection bias from loss-tofollow-up was adjusted for by inverse probability of censoring weights (Appendix in supplementary file).
To address the issue of incomplete data in covariates, multiple imputation (MI) (Appendix in supplementary file) was performed under the assumption of missing at random (MAR), with the R-package mice [37]. The outcome (WH t ) served as covariate (both in exposure and outcome models, steps 1, 2), and was imputed as missing covariate, but not as missing outcome. Out of 34 variables in the data-set, the covariates with missing values were, sorted by decreasing number of missing values (%): A useful tool for determining potential efficiency gain from MI is the fraction of missing information (FMI), approximated by FMI = r/(1 + r) (for a high number of imputations), where r is the relative increase in variance due to the missingness [39,40] (Appendix in supplementary file). The FMI is a parameter-specific measure that quantifies the loss of information due to missingness, while accounting for the amount of information retained by other variables [39]. A low number of imputations in MI is usually sufficient, by an argument based on relative efficiency (variance compared to the case with a very large number of imputations). For example, with an FMI of 20%, 10 imputations correspond to a relative efficiency above 98% [40].
With respect to standard errors in the g-estimation algorithm, the so-called sandwich estimator (in standard regression software) is no longer guaranteed to be conservative, because it ignores the uncertainty in the initial estimateψ ð0Þ , and therefore the bootstrap is recommended [28]. To account for uncertainty from incomplete data and to achieve unbiased standard errors from the g-estimation, the bootstrap was performed for each imputed dataset. Ten complete datasets were generated from the imputation algorithm, and for each complete dataset a bootstrap parameter estimate with standard error was generated (500 resamples), and finally combined according to Rubin's rules [41] (Fig. 3 and Table 3).
In accordance with basic bootstrap assumptions, model selection was carried out for every bootstrap sample, both in the censoring weights and propensity score models, to identify the best model. This was solved by the automatic lasso regression algorithm (Appendix in supplementary file) in the R-package glmnet [35]. A large set of covariates was entered for each bootstrap sample, and the lasso algorithm returned the best crossvalidated model for that particular sample, in terms of minimum prediction error, while avoiding overfitting.

Results
The randomization resulted in similar groups (CBT, CR) at baseline. Selected baseline covariates in the total sample and stratified by group are shown in Table 1.
The total sample (and both groups) consisted of more than 60% men, mostly with lower levels of education, living alone and single. The most pronounced group difference was found for gender (significant), with the highest proportion of women in the CBT group.
To illustrate the time-course of the key variables, mean PANSSneg t (to represent symptoms), COGN t and WH t over time are plotted (Fig. 1). The individual plots of WH t show large variation, both cross-sectionally and over time (Fig. 1, top left), but with an increasing trend for group-means (Fig. 1, top right) where the mean intervention length of 30 weeks and mean follow-up time of 100 weeks were chosen as time-points. In both groups, there was a large increase in WH t during the intervention, followed by a smaller increase in the post intervention period. The CBT group had the largest increase, with 3.3 h more on average than the CR group (non-significant group difference). Also plotted is the function f(t) from (3), used in the outcome model (step 2,3) as one component of l t , representing a good fit of WH t over time in the area of observations (poor fit below 20 weeks has no impact due to lack of observations). The PANSSneg t decreased markedly in the CBT group, but not in the CR group (significant group difference in favor of the CBT group -data not shown). The CR group, on the other hand, had a slightly better improvement with respect to COGN t (non-significant group difference). Different causal models (SNMMs) were fitted to assess long-term effects, group-differences in short-term effects (interactions with exposure or other covariates) and whether COGN t could play a role as effect-modifier for PANSSneg t 's (or other symptom dimensions) influence on WH t (or vice versa) (Appendix in supplementary file). No such significant effects were found, but few repeated measures with limited observation time, and model complexity probably resulted in a lack of power for one or more of these tests.
In contrast, the simpler SNMM from eq. (2), also modelling a short-term effect with a constant and a time-varying part, showed significant causal parameters for all dimensions of symptoms and for neurocognition.
The neurocognitive composite score and the total symptom score had both significant effects, but of different nature. The effect of COGN t was constant over time with ψ 0 found to be positive and significant (ψ 0 ¼ 0:22; p ¼ 0:005 ) and with no significant time-varying part (ψ 1 ). In contrast, PANSSsum t had no significant constant part (ψ 0 ¼ 1:55; p ¼ 0:6 ) and a negative and significant time-varying part (ψ 1 ¼ −0:31; p ¼ 0:018) Final combination estimates from imputations are shown in Table 2.
The specific form of f(t) had a large impact on the fit (variation in λ), with best fit for λ = 0.2 in the outcome model (Fig. 1), and λ = 0.8 in the SNMM (larger increase) with symptoms as exposure. Of specific interest in this application was PANSSneg t , which turned out to be dominating the time-varying part of the total symptom score. This dimension changed the most during the observation period and had the strongest causal effect on WH t . The parameter estimate for the constant effect of PANSSneg t , was positive and near significant (ψ 0 ¼ 3: 59; p ¼ 0:077). The parameter estimate for change over time, ψ 1 was negative and significant (ψ 1 ¼ −0:29; p ¼ 0: 003 ) ( Table 2). PANSSpos t changed less over time and had a constant effect on WH t with a negative and significant ψ 0 (ψ 0 ¼ −3:46; p ¼ 0:011), and with no significant time-varying part (ψ 1 ). The same was found for PANS Sgen t with a negative and significant ψ 0 (ψ 0 ¼ −4:2; p ¼ 0:026) and no significant time-varying part.
Chances are that the estimates for the four symptom scores have been biased towards zero by measurement error. The interpretation of the results is that the mean neurocognitive composite and four dimensions of symptoms, all seemed to have separate significant causal effects on number of working hours. The strongest effect was found for PANSSneg t . A positive ψ 0 seemed to indicate that at baseline, higher level of symptoms yielded higher WH t , but this changed during the intervention with a negative ψ 1 . Formulated with counterfactuals, a hypothetical intervention that could lower a person's PANSSneg t level at baseline corresponding to 0.23 points on the log-scale (equal to the average change in the CBT group over the whole observation period) would lead to a gain of around 3.23 h per week (95% CI: 1.11, 5.35) by the end of follow-up. The causal effect seemed to be strongest during the intervention period, and slightly attenuated in the post-intervention period yet still increasing, in terms of change in WH t per unit time.
For PANSSsum t the effect was similar to PANSSneg t 's, but with no constant part, and with an increasing effect across the whole observation period, most during the intervention. A hypothetical intervention that could lower a person's PANSSsum t level at baseline  Table 2 Causal effects of composite cognitive measure (COGN t ), PANSS symptomssum (log(PANSSsum t )) and negative (log(PANS Sneg t )) in separate models, on number of workinghours per week (WH t ), by g-estimation in JUMP. Parameters refer to SNMM = (ψ 0 + ψ 1 f(t))a t (2), where a t is COGN t , log(PANSSsum t ), or log(PANSSneg t ), and f(t) is the function in eq. (3)  For PANSSgen t , COGN t and PANSSpos t the effects were found to be constant across the whole observation period, with no time-varying part. A hypothetical intervention that could lower a person's PANSSgen t level at baseline corresponding to 0.13 points on the log-scale (equal to the average change in the CBT group) would lead to a gain of around 0.54 h per week (95% CI: 0.06, 1.01) by the end of follow-up. In other words, changes achieved during intervention seemed to be upheld during follow-up.
A hypothetical intervention that could increase the level in COGN t with 2.4 points (equal to the average change in the CR group) would yield a gain of approximately 0.54 h per week (95% CI: 0.16, 0.92) by the end of follow-up.
A hypothetical intervention that could lower a person's PANSSpos t level at baseline corresponding to 0.09 points on the log-scale (equal to the average change in the CBT group) would lead to a gain of around 0.3 h per week (95% CI: 0.07, 0.53) by the end of follow-up.
In the MI routine, both prior and subsequent measurements (if available) were allowed in the imputation model to make use of all available information. Also, interactions discovered in preliminary complete-case analysis were included. Estimated FMI s were evaluated to be between 3 and 19%, all considerably smaller than the proportion of missing values in the exposure at followup. In PANSSneg t , the proportions of missing values were 2.3, 17.6, and 29.8%, and since a time-varying short-term effect was found, all three time-points contributed in estimation of both ψ 0 and ψ 1 . Estimated FMI s for ψ 0 and ψ 1 were 9 and 18.4% respectively (Table 3), reflecting that other variables succeeded in retaining information for the missing PANSSneg t values (FMIs less than the proportion of missing values), with more challenge for the time-varying part (ψ 1 ) than the constant part (ψ 0 ), due to more complexity (with extra shapeparameter). The exposures with constant effect, all had estimated FMIs lower than 15.1%. For example, for COGN t the proportions of missing values were 3, 17.6, and 29.8% (similar to PANSSneg t ), and had an estimated FMI of 5.5%. This reflects better prediction of the missing COGN t with considerable amount of information restored. Also, a constant effect (ψ 0 ) means a simpler imputation model, estimated from all three time-points.
Selection bias from loss-to-follow-up seemed to be negligible, results with and without censoring weights (Appendix in supplementary file) were nearly identical (data not shown).

Discussion
The augmented VR program JUMP has previously documented cost-effectiveness, improved occupational outcomes and improved levels of apathy and neurocognition [14,16,42]. In this study, causal pathways were assessed, by means of g-estimation, to quantify the magnitude and nature of specific effects that were targets in the CBT and CR groups. Significant causal effects from improvement in different symptoms and a neurocognitive composite on number of working hours per week were found. Table 3 Bootstrap estimates for each imputed dataset (combination estimate in Table 2 is mean value across imputations, standard error follows from Rubins's rules  The effects were all short-term and characterized as either constant across the observation period or increasing in strength, most during the intervention and slightly less post intervention. PANSS negative symptoms had the strongest effect, followed by PANSS total which both had time-varying effects of magnitude 3.2 h per week (95% CI: 1.11, 5.35) and 2.2 h per week (95% CI: 0.37, 4.06) respectively, for a hypothetical increase in the exposure equal to the mean change in the CBT group. The remaining effects were all found to be weaker and constant over time, with the neurocognitive composite representing an effect of magnitude 0.54 h per week (95% CI: 0.16, 0.92), the same magnitude as PANSS general with 0.54 h per week (95% CI: 0.06, 1.01), and lastly PANSS positive symptoms with a magnitude of 0.3 h per week (95% CI: 0.07, 0.53). In sum, of the five effects under study, PANSS negative symptoms represented the strongest, and PANSS positive symptoms the weakest. The difference in nature between effect of PANSS negative (increasing) and positive (constant) symptoms might seem reasonable in light of their time-trajectories and content. Even though the effects may seem small relative to the total increase in working hours per week (10 h), the adjustment for potential feedback means that their unadjusted contribution might be larger.
Some caution with interpretation is necessary. The short-term causal effects are based on the assumption that the exposure can at least partly be viewed as preceding the outcome, even though they have been assessed at the same time (PANSS negative symptoms / neurocognitive composite score are viewed fairly stable [15]). With incomplete data in many variables (including the key exposure variables), efficiency gain was achieved by multiple imputation. However, unbiasedness relies on assumptions of MAR and no model misspecification in both imputation and analysis models, even though gestimation provides better protection against misspecification than alternative methods. Missing values in the outcome were few, and selection bias from loss-tofollow-up was found to be negligible.
Within the limitations of only three repeated measures, including an intervention period and a limited sample size, g-estimation captured causal information. Few alternative methods are available, particularly with respect to assessment of potential effect-modification by time-varying covariates. Neither PANSS symptoms nor the neurocognitive composite score seemed to play the role of such an effect-modifier, but a non-linear function of time did, which revealed differences in the nature of the causal effects. The results indicated no long-term effect (e.g. a decreasing effect of A t on Y s for s > t + 1). In the causal DAG (Fig. 2) this indicate no arrows from A 0 → Y 2 , A 0 → Y 3 or A 1 → Y 3 . With more repeated measures, a potential long-term effect would be easier to detect. With respect to the time-varying or constant short-term causal effects, these refer to A 0 → Y 1 , A 1 → Y 2 and A 2 → Y 3 in the causal DAG. The PANSS negative effect started out with a near significant positive causal effect at baseline (more symptoms associated with more working hours). Participants working at baseline were either trying to transit from sheltered to competitive employment, or were not coping with current job demands and were in need of support in order to maintain or change current employment [15]. The stratification "low education = yes / no", showed slopes of opposite signs in the association between PANSS negative symptoms and working hours at baseline (Fig. 4).
This might be explained by reverse causation, the fact that the low education group is over-represented in the sample and that employment perhaps created more negative symptoms for them. In any case, this initial effect changed over time by the intervention (significant interaction with the non-linear function of time). The function of time that gave the best fit indicated that the influence of PANSS negative symptoms was strongest during the intervention, and slightly weaker, but still increasing during the post intervention period. Continuous improvement in PANSS negative symptoms combined with increase in working hours is consistent with a positive feedback mechanism, and interesting with respect to potential long-term impact on occupational functioning.
In terms of magnitude, the effect of PANSS symptoms may have been underestimated due to measurement error. Compromised reliability in the JUMP study in the PANSS measurements was a concern previously, and an attempt to adjust for it was made [13]. In a simple linear regression, the corrected regression coefficient for a covariate with a reliability coefficient of 50%, would be expected to increase two-fold. It's not clear how to adjust for potential measurement error in the present application.
The causal effects support the choice of targets in the CBT and CR groups. Time-course of PANSS symptoms and neurocognitive composite score stratified on group, shows that each group succeeded best in having an impact on their target measure (Fig. 1, bottom panels). The causal effects of PANSS symptoms or neurocognitive composite are to be interpreted as conditional on each other as well as other covariates. Separate causal effects indicate independent causal pathways through PANSS symptoms and the neurocognitive composite to working hours.
Limitations in the design of JUMP prevents assessment of the causal effect of the CBT or CR intervention. A comparison group, with only the VR part and without CBT or CR, would facilitate this, but was not feasible [43]. How can an improvement in occupational outcome be attributed to for example the CBT intervention, and not simply the "Hawthorne effect"? The causal pathways in the present model can shed some light on this question. The CBT group had a significantly larger improvement on PANSS negative symptoms than the CR group at post intervention. Also, PANSS negative symptoms at post intervention predicted working hours at followup. This indicates an indirect effect of the CBT intervention (relative to CR) on working hours (mediation) through PANSS negative symptoms (with some additional assumptions) [44]. Likewise, the CR intervention can have had an indirect effect (relative to CBT) through some other mediator. Both groups had an increase in working hours, but with a difference of 3.3 h in favor of the CBT group (non-significant difference). Three time-points is necessary for mediation analysis to control the timesequence. Interestingly, with multiple mediators, pathspecific effects can be assessed even when the correct sequence of the mediators is unknown [45]. However, using only one time-point for each variable, power to detect different pathways is limited. Lack of power has also probably played a role in the non-significant group × exposure interactions in the present causal models.
This study is the first attempt to assess causal effects in JUMP, where time-varying confounding and potential bias from loss-to-follow-up have been adjusted for and effect modification by time-varying covariates assessed. The large improvement in PANSS negative symptoms in the CBT group is in contrast to the findings in a recent meta-analysis of effects of CBT on negative symptoms in schizophrenia [4]. The authors reviewed the literature and found that beneficial effects of conventional CBT on negative symptoms in schizophrenia from older studies were associated with lower study quality and not supported by more recent studies [46,47]. The CBT augmentation in JUMP was not conventional CBT, which in fact may provide some explanation, both for the change in negative symptoms and for the improved occupational functioning. The part of negative symptoms that refers to social isolation was effectively reduced by job placement in a social environment. Other negative symptoms accounted for by dysfunctional expectancies [4], could be alleviated by successful matching of the job with the participant's preference and ongoing support with financial security. In this way, the negative symptoms have been approached both directly and indirectly, distant from the conventional therapy in a therapists' office. Another recent example of positive effect of CBT on negative symptoms in an indirect way (not included in the above meta-analysis) focused primarily on measures to improve functional outcome (GAF) [48].
Even though a rich body of research on vocational rehabilitation the last decades has contributed to evidence based programs, challenges remain [49]. There is a need for more studies of integrated treatment and vocational rehabilitation [13], to disentangle treatment components to target e.g. negative symptoms [4] and to scale up services in rehabilitation [49].

Conclusion
High performance of g-estimation in a small psychiatric data set with few repeated measures and time-varying confounding and effects, was demonstrated.
Augmented vocational rehabilitation showed causal effects of intervention targets with the strongest and increasing effect from negative symptoms, on number of working hours.
Combination of therapy and activation (indirect and direct approach) might explain improvement in both cognition and negative symptoms, and shed some light on effective ingredients for improved treatment of negative symptoms.