A demonstration of a multi-method variable selection approach for treatment selection: Recommending cognitive–behavioral versus psychodynamic therapy for mild to moderate adult depression

ABSTRACT Objective: We use a new variable selection procedure for treatment selection which generates treatment recommendations based on pre-treatment characteristics for adults with mild-to-moderate depression deciding between cognitive behavioral (CBT) versus psychodynamic therapy (PDT). Method: Data are drawn from a randomized comparison of CBT versus PDT for depression (N = 167, 71% female, mean-age = 39.6). The approach combines four different statistical techniques to identify patient characteristics associated consistently with differential treatment response. Variables are combined to generate predictions indicating each individual’s optimal-treatment. The average outcomes for patients who received their indicated treatment versus those who did not were compared retrospectively to estimate model utility. Results: Of 49 predictors examined, depression severity, anxiety sensitivity, extraversion, and psychological treatment-needs were included in the final model. The average post-treatment Hamilton-Depression-Rating-Scale score was 1.6 points lower (95%CI = [0.5:2.8]; d = 0.21) for those who received their indicated-treatment compared to non-indicated. Among the 60% of patients with the strongest treatment recommendations, that advantage grew to 2.6 (95%CI = [1.4:3.7]; d = 0.37). Conclusions: Variable selection procedures differ in their characterization of the importance of predictive variables. Attending to consistently-indicated predictors may be sensible when constructing treatment selection models. The small N and lack of separate validation sample indicate a need for prospective tests before this model is used.


Introduction
Major depressive disorder is a highly prevalent, debilitating mental disorder that is currently ranked as the single largest contributor to global disability (World Health Organization, 2017). Among the most frequently utilized psychotherapies for depression are cognitive behavioral therapy (CBT) and psychodynamic therapy (PDT). Two randomized clinical trials have found PDT noninferior to CBT in the outpatient treatment of depression (Driessen et al., 2013;Gibbons et al., 2016). These results are in line with meta-analytic findings reporting no significant differences between CBT and PDT for depression (Barth et al., 2013;Driessen et al., 2015). These minimal efficacy differences, along with differential therapeutic theories used in CBT and PDT (Hoffart & Johnson, 2017), raise the question whether individual patients can be identified that might benefit more from one of these treatments than the other. If so, treatment selection could improve outcomes in depression by helping individuals select the specific intervention that is most likely to be successful . DeRubeis et al. (2014) developed a treatment selection approach that can be used to identify each individual's optimal treatment based on multiple patient characteristics, using data from a randomized clinical trial. This approach is called the Personalized Advantage Index (PAI). The core concept behind the PAI approach is to identify pre-treatment patient characteristics that are associated with differential response to treatment (so-called moderators) and, using these variables, to build a statistical model that can generate predictions for an individual in two (or more) treatments. This approach shares many common features with other approaches to treatment selection . For each individual, the treatment with the best predicted outcome is defined as the indicated treatment. In the case of a two-treatment comparison, an individual's PAI is a single number derived through the subtraction of their predictions in one treatment from the other. The PAI provides a directional indication of which treatment the individual should receive, as well as information about the strength of the recommendation, represented by the size (in absolute value) of the PAI.
In their initial demonstration, DeRubeis et al. (2014) analyzed data from a randomized comparison of antidepressant medication and CBT and found that, for patients with large predicted advantages in one treatment over the other (60% of sample), those who received their PAI-indicated treatment had superior outcomes relative to patients who received the non-indicated treatment, with an effect size (Cohen's d = 0.58) larger than that reported in a recent systematic review of drug-placebo differences (Turner, Matthews, Linardatos, Tell, & Rosenthal, 2008). Huibers et al. (2015) published similar findings applying the PAI approach to a comparison of cognitive therapy versus interpersonal therapy for adult outpatient depression. Related analyses based on the PAI approach have generated models aimed at differentiating placebo and antidepressants responders (Webb et al., 2018), minimizing risk of dropout (Zilcha-Mano et al., 2016) and relapse (Schweizer et al., submitted;Vittengl, Clark, Thase, & Jarrett, 2017). The principles on which the PAI is based have also been applied to treatment selection in post-traumatic stress disorder (Deisenhofer et al., 2018;Keefe et al., 2018). These studies represent but one strand of research on treatment selection. Other approaches include the M * approach (Niles, Loerinc, et al., 2017;Niles, Wolitzky-Taylor, Arch, & Craske, 2017;Smagula et al., 2016;Wallace, Frank, & Kraemer, 2013) initially introduced by Kramer (Kraemer, 2013), the use of nearest neighbor modeling by Lutz and colleagues (Lutz et al., 2006), and a series of papers by Uher, Iniesta and colleagues (Iniesta, Hodgson et al., 2018;Iniesta, Malki et al., 2016;Uher et al., 2012). For a comprehensive review of this literature, we suggest recent reviews by Cohen and DeRubeis (2018), Gillan and Whelan (2017), and Kessler (2018).
Different statistical methods can be applied to select patient characteristics to be used in generating treatment recommendations. For instance, the initial applications (DeRubeis et al., 2014;Huibers et al., 2015) relied on a domain-based backwards stepwise-regression (Fournier et al., 2009) for variable selection, in which potential predictors were first grouped by domains (e.g., history of illness, demographic and life circumstances, cognitive dysfunction etc.) in each of which backwards step-wise variable selection was performed, and then the retained predictors from each domain were combined and a final backwards step-wise variable selection was performed to generate the final model. A recent PAIbased treatment selection paper by Vittengl et al. (2017) used a series of single-variable models to establish the statistical significance of independent moderators, and then used backwards and forwards step-wise variable selection procedures to reduce the set. Another recent PAI analysis relied upon a machine-learning approach called Random Forest (RF) for variable selection (Zilcha-Mano et al., 2016). Building on this work, Keefe et al. (2018) used a two-stage variable selection approach in which they used RF followed by a stepwise AICpenalized bootstrapped method. This variability is discussed in Cohen and DeRubeis' (2018) review of treatment prediction reports in depression. They noted that there is very little consistency in the variable selection approaches that have been employed in this area. This heterogeneity is problematic because different variable selection approaches applied to the same dataset can lead to different conclusions about variable importance (Bleich, Kapelner, George, & Jensen, 2014), and treatment recommendations can vary based on which variables are included in the model. In their review, Cohen and DeRubeis (2018) identified 43 reviews (and meta-analyses) of predictors of treatment response in depression but, as many of these reviews noted, no coherent picture of which predictors are most important has emerged. The variability and lack of replicability in efforts to identify predictors of response in depression may be partially explained by the heterogeneity of the statistical approaches that are used to identify predictive variables.
This methodological heterogeneity also makes it difficult to determine which variable selection procedure should be used in the context of treatment selection. The strengths and weaknesses of different approaches might make them more or less attractive for a given purpose. For example, variable selection with elastic net regularization (ENR) can handle high numbers of potential predictors and can overcome issues of high correlations between baseline variables (Friedman, Hastie, & Tibshirani, 2010;Zou & Hastie, 2005). However, it does not have the capability to account for unspecified non-linear relationships in the way that is possible when using RF (Garge, Bobashev, & Eggleston, 2013). It is unlikely that any single variable selection procedure will be optimal for all situations. It will also be difficult to identify which single approach one should use without a sufficiently large dataset to allow for a training sample (in which one could try every approach and see which one appears to work best) and a held-out test sample (in which to show that the results hold, and are not due to chance findings). Unfortunately, RCT samples with relevant treatment comparisons are rarely large enough to support these efforts (Kessler, 2018), and the use of large non-randomized datasets risks potential confounds (e.g., selection effects) that could bias treatment selection efforts c.f. Kessler, 2018).
Potentially useful predictors could be identified by observing which variables are consistently selected across multiple studies. However, due to the variability of statistical approaches employed across different papers  and the lack of relevant studies in which the collection of potential moderators in this study have been assessed and compared, it would be unwise to rely solely upon the consistency of findings from the existing literature. Another type of consistency that could inform variable selection is the consistency with which variables are selected across multiple methods within the same sample. Constructing models using variables that are selected consistently across multiple different feature selection approaches within the same data can result in improved model performance (Kuhn & Johnson, 2013). Knowledge of the different methodologies could be used to understand whether those variables that are identified in some approaches but not in others are inconsistently identified due to weak or noisy effects, and thus should be considered poor predictors, or whether this pattern can be attributed to shortcomings of specific approaches (Kuhn & Johnson, 2013). For example, a variable might be selected by an ensemble-of-trees approach (e.g., RF) but excluded by another approach based on classic regression (e.g., ENR) because it involves a three-way interaction or non-linear interaction that was not considered in the latter classic approach (Kuhn & Johnson, 2013). Using these principles, we aimed to demonstrate an improved PAI approach by generating individual treatment recommendations for adult outpatients with depression deciding between CBT versus PDT. We introduce a novel selection process that synthesizes the results of four different variable selection techniques (RF, ENR, Bayesian Additive Regression Trees and the AIC-penalized bootstrapped approach) by selecting the patient characteristics that are consistently identified as associated with differential response.

Design and Participants
This paper draws on data from a randomized clinical trial comparing CBT and PDT in the outpatient treatment of depression (Driessen et al., 2013), which included 341 patients who met DSM-IV criteria for a major depressive episode and scored 14 or higher on the Hamilton Rating Scale for Depression (HAM-D; Hamilton, 1960). The Dutch Union of Medical-Ethic Trial Committees for mental health organizations approved the study design and the study protocol was published (Driessen et al., 2007). Efficacy results of this study are reported elsewhere, with no significant treatment differences found on any of the outcome measures (Driessen et al., 2013(Driessen et al., , 2015(Driessen et al., , 2017. Two prior papers examined which subgroups of patients in this trial might benefit more from one of the treatments than the other. One (Kikkert et al., 2016) was a replication study examining obsessive-compulsive and avoidant personality disorder traits as potential moderators of treatment efficacy that failed to replicate previous findings in that regard (Barber & Muenz, 1996), while the other applied model-based recursive partitioning to 23 potential moderators to identify subgroups of patients that might benefit Psychotherapy Research 139 specifically from one of the two treatments . However, these studies used different patient subsamples, examined only a subset of the potential predictors, and were not designed to produce a model that could be used to generate treatment recommendations for individual patients.
As part of the trial protocol, severely depressed (HAM-D >24) patients at baseline were offered adjunctive antidepressant medication (N = 129). As the treatment effects observed in these individuals could have been a result of the psychotherapy, the medication, or both, they were excluded from the current analyses. Thus, this report relates to the patients with moderately severe depressive symptoms (baseline HAM-D = 14 to 24) who were treated with psychotherapy only (N = 212). Of these 212 individuals, 17 were removed for having too much missing baseline data (≥50% missing baseline predictors). Finally, an additional 28 individuals who dropped out before attending at least 4 sessions were excluded. As our goal was to build a model to answer how individuals who received a meaningful course of CBT or PDT fared, we felt that individuals who dropped out very early in treatment would not be informative for our models. In the extreme, the "outcome" for patients who dropped out prior to attending a single session does not reflect response to CBT or PDT. Additionally, we were less confident in our ability to impute valid week-16 outcomes for these early dropouts. We decided to remove from our analyses patients who attended 3 or fewer therapy sessions. This reduced our sample from 195 to 167. Thus, the final sample comprised 167 patients: 75 in the CBT and 92 in the PDT condition (see Supplemental Figure S1 for a Patient Flow Chart). Baseline sample demographic characteristics for the final sample are presented in Supplemental  Table S1.

Interventions
Both PDT and CBT encompassed 16 individual 45minute sessions within 22 weeks and were conducted according to a published treatment manual (de Jonghe, 2005; Molenaar, Don, van den Bout, Sterk, & Dekker, 2009). CBT was based on the principles described by Beck, Rush, Shaw, and Emery (1979) and included behavioral activation and cognitive restructuring according to a session-by-session protocol with homework assignments. Short psychodynamic supportive psychotherapy (de Jonghe et al., 2013) represented the psychodynamic intervention. This modality involved an open patient-therapist dialogue that used supportive and insight-facilitating techniques to address the emotional background of the depressive symptoms by discussing current relationships, internalized past relationships, and interpersonal patterns.

Measures
HAM-D scores were used as the outcome measure for this study. Trained research assistants (masterlevel graduate students in clinical psychology) assessed the HAM-D according to the Dutch scoring manual (de Jonghe, 1994) at baseline, week-5, week-10, and week-22 (post-treatment). Assessors were not blind to treatment condition. Assessors engaged in one-hour peer supervision sessions bi-weekly, in which audiotaped interviews were discussed. The average intraclass correlation coefficient over 46 audiotaped assessments scored by multiple assessors was .97. Supplemental Table  S2 lists the 49 patient characteristics considered during variable selection, all of which were assessed at pre-treatment.

Building the Personalized Advantage Index Model
All analyses were performed in R (Team, 2000). Preprocessing and random forest-based imputation of missing data (Stekhoven & Buhlmann, 2012) was performed on baseline and outcome data prior to variable selection (details provided in Supplemental). Post-treatment HAM-D scores were missing for 20 of the 167 participants (12.0%). Future analyses in larger samples should perform these steps (especially imputation) separately for the training and test samples to avoid this form of double-dipping. Categorical variables (e.g., relationship status) were turned into binary variables and binary variables without sufficient variability (variables whose smallest category made up < 20% of the sample) were excluded (Kuhn & Johnson, 2013). Outliers for continuous variables were winsorized, and some variables with skewed distributions were log-transformed (see Supplemental Table S2 for more details).

Variable Selection
To select the patient characteristics associated with treatment outcome, we applied a multi-phase selection procedure that combines four different variable selection methods, each of which has been used in recently published treatment selection analyses (Bleich et al., 2014;Iniesta et al., 2016;Keefe et al., 2018;Zilcha-Mano et al., 2016). The first step was to apply three approaches to identify predictors of (differential) treatment response: 1. mobForest (Garge et al., 2013; random forest type approach for model-based recursive partitioning; referred to simply as random forest or RF), 2. glmnet (Friedman et al., 2010;Elastic Net Regularized Regression or ENR), and 3. bartMachine (Kapelner & Bleich, 2016; Bayesian Additive Regression Trees or BART). The second step was to reduce the variables identified by at least two of these three approaches using bootStepAIC (Rizopoulos, 2009; bootstrapped backwards stepwise AIC-penalized model selection; Austin & Tu, 2004). We will now describe each of these methods in more detail, and discuss their relative strengths and limitations. Supplemental Table  S3 provides additional details of each method and contrasts different features.
Random Forest is a recursive partitioning approach that can accommodate large numbers of predictor variables as well as complex relationships including non-linear and higher order interactions (Kapelner & Bleich, 2016). RF builds upon recursive partitioning approaches like classification and regression trees and model-based recursive partitioning. It addresses model instability by randomly selecting features and creating many "tree models", the predictions of which are aggregated to generate stable predictions (Austin & Tu, 2004). RF also allows for information from weaker predictors to be incorporated in situations where they might otherwise be dominated by stronger predictors, such as in bagging (Garge et al., 2013). The model function of RF can be specified as "y ∼ tx," which forces the approach to select splits that maximize the difference in the treatment condition coefficient between subgroups, thus focusing on identifying moderators of the treatment effect. When RF is used for variable selection, the permuted variable importance is generated by comparing the mean square error (MSE) of the predictions in the held-out (out of bag) samples when the real values are used to the MSE when permuted values for a given predictor are used. The extent to which the MSE increases when permuted values are used indicated how "important" that variable is (Garge et al., 2013). Variables that surpass the recommended threshold (which is set based on the largest observed "noise" variable, for which the permuted data improves the MSE, relative to the real data) are selected.
Elastic Net Regularization can provide a hybrid of the Lasso and Ridge regression approaches, combining the L1 and L2 penalizations to allow for the selection of a parsimonious set of variables that predict outcome (Hastie, Tibshirani, & Friedman, 2009). We used the R package glmnet (Friedman et al., 2010) to implement ENR variable selection, and used Zou and Hastie's (2005) recommended default value for the alpha parameter (alpha = 0.5).
Uses of ENR in the literature of variable selection and treatment selection have only investigated prognostic models in which a single treatment is modeled (Chekroud, Gueorguieva et al., 2017;Chekroud, Zotti et al., 2016;Iniesta et al., 2016). Current implementations of ENR in R do not accommodate variable selection for models in which moderators are of primary interest. In order to adapt ENR for the purpose of identifying moderators, we split the sample into each of the two treatment groups, and then constructed prognostic models within each group. Variables that were retained as predictors in only one condition, or that were selected in both but specified with differing coefficient values, were identified as potential moderators of treatment effects. We refer the reader to Cohen and DeRubeis' (2018) review (specifically, their Figure 1) for a more in-depth discussion of why variables with these relationships are candidate moderators. As one example, consider a variable that is selected by ENR in one treatment condition and specified with a positive coefficient, and that is selected in the other treatment condition but specified with a negative coefficient. This information could suggest that a disordinal relationship exists between that variable and treatment, such that individuals with higher levels on that variable do worse relative to individuals with lower levels in one treatment, whereas individuals with lower levels on that variable do worse relative to individuals with higher levels in the other treatment.
Bayesian Additive Regression Trees builds on ensemble-of-tree methods such as RF by incorporating an underlying Bayesian probability model (Chipman, George, & McCulloch, 2010). BART and RF have similar strengths insofar as they both can handle large numbers of predictors, and can accommodate non-linear and higher order interactions. The inclusion of the Bayesian prior improves upon other tree-ensemble approaches by introducing regularization, which reduces the likelihood that the ensemble will become dominated by any single tree (Genuer, Poggi, & Tuleau-Malot, 2010). Kapelner and Bleich adapted the bart Machine R-package (2016) to help focus model building on moderators. To achieve this aim, they introduced a parameter that forces the search for variable splits to focus more on treatment than other variables, thus introducing more interactions between treatment and other variables. This is conceptually similar to when researchers only consider interactions between treatment and baseline variables (and not interactions between baseline variables themselves), or to how RF can specify the splitting criteria to evaluate the difference in the treatment coefficient for the model y ∼ tx. Bleich et al. (2014) adapted BART to Psychotherapy Research 141 extract informed prior information about variable importance, and provided an interaction plot feature that can be used to identify potential 3-way interactions. The ICEbox package in R (Goldstein, Kapelner, Bleich, & Pitkin, 2015) allows for the visualization of predictive relationships in BART models, including non-linear and higher-order interactions between variables and treatment. The N most important interactions identified by BART are retained, where N was decided based on the number of variables selected by Random Forest (which uses a permutation test to determine an importance threshold cutoff). 1 See the Supplemental Variable Selection, as well as Garge et al. (2013) for more details on how the threshold is determined.
We decided to reduce the variables consistently selected by the above three approaches using a specific fourth approach for the following reason: If the model that was used to generate predictions relied on linear or logistic regression, then the variables selected by RF or BART could lead to a model with poor fit, if, for example, these variables relied on non-linear relationships or higher order interactions. The Boot-StepAIC package (Rizopoulos, 2009) performs variable selection using a stepwise AIC-penalized bootstrapped approach (Austin & Tu, 2004). By only including the moderator relationships identified in the other three approaches (and their corresponding main effects), this search generated a model emphasizing the prediction of differential treatment response, while reducing the chance that predictors that require unspecified linear or higher-order interactions were included. 10,000 bootstrapped training samples were drawn, and within each training sample backwards elimination was used to select variables that independently contributed to predicting outcome. Austin and Tu (2004) recommend selecting variables that are retained in at least 60% of bootstrapped samples, but this recommendation is specific to prognostic variables (main effects only). As we were Figure 1. Visualization of the moderator relationships. Conditional plots with confidence bands for the conditional mean generated using R package visreg from the final model estimated in the complete sample. Conditioning for each plotted variable uses the mean value for all other variables. The X-axes represent the standardized/centered scores that were used during analysis. interested in interactions, we relied on the consistency of the direction of the coefficients across the 10,000 bootstrapped samples. By using a threshold of 95% consistency in sign of the moderator coefficient, variables with smaller effects that were consistent in the direction with which they predict differential response across treatments could be included. The primary goal of this step was to ensure that the variables selected will function properly and consistently, and increase the likelihood that the final model will replicate in future samples drawn from the same population.

Generating PAIs
Based on the set of variables selected, outcome predictions were generated for each study participant in both of the treatments. To avoid the risk of overconfidence that could occur when evaluating model-performance on individuals whose data were used to set modelweights, these predictions were generated using tenfold cross validation (CV). 10-fold CV is recommended based on its good bias and variance properties in small samples (Kuhn & Johnson, 2013). For each of the 10 folds, individuals in that fold were held out, and data from the patients in the other 9 folds were used to generate a linear regression model in which end-of-treatment HAM-D score was predicted by the set of selected predictor variables (main effects for each variable and terms representing their interactions with treatment). The data from the patients in the held-out fold were then used to generate predictions for those patients in each treatment. For each individual, the difference in the predicted HAM-D score in CBT and PDT is their PAI. Individuals with a lower predicted HAM-D score in CBT (and thus a better predicted outcome in CBT) were then classified as "CBT-indicated" and individuals who had a better predicted outcome in PDT were labeled "PDT-indicated". The size of the PAI is taken to be an indication of the strength of the treatment recommendation (DeRubeis et al., 2014).

Evaluating PAIs
To characterize the expected utility of the PAIs for guiding treatment selection, we compared the average end-of-treatment HAM-D scores of individuals who got their indicated treatment (based on their PAI scores) against that of participants who received their non-indicated treatment. Next, we looked within the subgroup indicated to need CBT, and compared HAM-D scores for those who received their indicated treatment (CBT) to those who received their non-indicated treatment (PDT). We then performed the analogous comparison for those identified as "PDT-indicated." In order to investigate the importance of the strength of these recommendations following earlier PAI analyses (DeRubeis et al., 2014;Huibers et al., 2015), we then evaluated the above comparisons within a subset of the sample using an a priori determined cutoff: the strongest 60% of PAIs (the 60% of the largest absolute value PAIs). The entire 10-fold cross-validation procedure and evaluation was repeated 1000 times to account for the influence of the selection of the 10 folds on the results (Kuhn & Johnson, 2013). Accuracy statistics for the predictions were assessed within each of the 1000 10-fold CVs. Findings, representing averaged results across all 1000 runs, are presented below.  Reiss, Peterson, Gursky, & McNally, 1986), NEO Five Factor Inventory (NEO-FFI; Hoekstra, Ormel, & De Fruyt, 2003) Extraversion subscale, and Patient Request Form Psychological Needs subscale (Veeninga & Hafkenscheid, 2004). We note that although both the depressed mood subscale of the BSI and the HAM-D measure the construct of depression, the correlation between the two scales was low (r = .23). One explanation for this might be that the HAM-D measures a more broad set of symptoms (e.g., sleep, libido, appetite, psychomotor retardation, etc.) than the 6 items that are captured by the BSI's depressed mood subscale: suicidal thoughts, loneliness, sad/depressed mood, lack of interest, hopelessness, and worthlessness. There were no significant differences between the treatment groups for any of the variables selected for the final model (see Supplemental Table S4). Table II presents the final model with weights set using the full sample.

Psychotherapy Research 143
Baseline HAM-D score was included as both a main effect and as an interaction with treatment, but it did not appear to have a significant moderator relationship in the context of the final model. The moderator relationships included in the final model are visualized in Figure 1. We refer the reader to a recent review on treatment selection by Cohen and DeRubeis (2018) for a detailed discussion of how to approach interpretation of moderator relationships in treatment selection.

PAI Results
The mean absolute error of the predictions was 5.5 (SD = 7.1), and the associated R-squared was 0.18. Individuals who received their model-indicated treatment had better outcomes than those who received their non-indicated treatment (see Figure 2). The mean end of treatment HAM-D scores for individuals who received their PAI-indicated treatment (all results averaged across the 1,000 CVs), was 12.3 (SD = 7.6, N = 82.8); the average for those receiving their non-indicated treatment was 13.9 (SD = 7.9,   Table I. Summary of variable selection results for all variables selected by at least one approach. Three different variable selection approaches based on Random Forest, Elastic Net Regularization, and Bayesian Additive Regression Trees (BART) were applied to the full set of 49 potential baseline predictors. The 16 potential moderators that were selected by at least two of these three approaches were then submitted, along with one three-way interaction identified by BART (NEO Neuroticism × Married × Treatment), to a final variable selection stage with BootStepAIC. Bold text indicates the variables selected by BootStepAIC based on a criteria of at least 95% consistency of the coefficient sign for the interaction with treatment across 10,000 bootstrapped samples. * although BSI 8 was selected by BootStepAIC, its p-value in the final model built in the full sample was .43, and so, following the recommendation of Kuhn and Johnson (2013) to favor simpler models, it was not included in the final model. N = 84.2). This reflected, on average, a 1.6 points advantage for those receiving their indicated treatment (95% CI = 0.5 to 2.8; Cohen's d = 0.21, 95% CI = 0.07 to 0.37). When we restricted our evaluation to the largest (absolute value) 60% of PAIs (indicated mean = 12.1, SD = 6.4, N = 50.8; nonindicated mean = 14.7, SD = 7.6, N = 49.2), we found that the effect of treatment selection grew to 2.6 points (95% CI = 1.4 to 3.7; average Cohen's d = 0.37, 95% CI = 0.19 to 0.54).

Discussion
Helping service-users and clinicians make betterinformed treatment decisions is one of the core goals of precision medicine in mental health. In depression, treatment selection models can improve the ability to identify the best treatment among available options . Here, we have described a treatment selection model based on patient characteristics that could be used to decide between cognitive-behavioral and psychodynamic therapy for those with mild-to-moderate depression not taking antidepressants. The differential prediction described here relied on four factors: anxiety sensitivity, depression symptom severity, extraversion, and psychological treatment needs. Although in this investigation the aim was to develop and provide a first test of a multivariable model that could inform treatment selection, it nonetheless is important to attempt an understanding of the basis for each variable's contribution to the model. In the following we provide tentative, speculative interpretations of the findings, taking into account the directions of the relationships observed.
The ASI reflects a person's beliefs that anxiety experiences have negative somatic, psychological or social consequences. Higher scores on this measure were associated with superior outcomes in PDT relative to CBT, and vice versa for lower scores. Patients with higher baseline depressed mood, as measured by the BSI, tended to improve more in CBT, whereas those with lower BSI scores tended to respond better to PDT. Those with higher scores on a measure of extraversion (on the NEO) fared better in CBT than in PDT. The reverse was true for those low on extraversion. The Patient Request Form Psychological Needs scale assesses a patient's needs for a psychological treatment. Higher scores on this measure predicted better response to PDT, relative to CBT, and the reverse prediction was obtained for those with lower scores on this measure. Thus, in contrast to anxious, introverted patients, patients who were relatively more extraverted and who had low psychological treatments needs were better matched to CBT. We would speculate that these patients typically express themselves more and have already talked about their feelings and problems with others without much hesitation. They might be more in need of the structured approach of CBT, directed strongly at adapting behavior and changing cognitions through practical exercises. It may be that PDT is more efficacious for patients who search for a psychological solution to their depressive symptoms. Anxious and introverted patients, who have a tendency to avoid focusing on their problems despite a need to do so, may find that the supportive milieu of PDT fostered explorations of their feelings and problems in a way that was appropriate for their individual needs and capacities.

Limitations
This study has a number of limitations. Treatment selection is likely to be most effective when the interventions under consideration differ substantively and substantially in their mechanisms and targets . Relative to comparisons between, for example, medications and psychotherapy, the similarity of these two psychotherapies likely resulted in a decreased potential to identify individuals who are strongly indicated to need one treatment over another. Nevertheless, among the 60% of patients with the strongest PAIs, a d-type effect size of 0.37 was observed for receiving the indicated versus contraindicated treatment.
Patients seeking treatment for depression have many options, including other psychotherapies and medication; this model cannot inform the decision of whether or not to pursue treatments other than CBT and PDT. This model would at best be valid for use in similar populations. It cannot be known how the model would perform if it were applied to patients whose values on predictors were outside the observed range on the predictor measures, or if used in the context of a population of those with severe depression (defined in this trial as HAM-D >24) or in patients who are also taking antidepressant medications.
We know that therapist effects can account for significant variability in patient outcomes in psychotherapy (Lutz, Leon, Martinovich, Lyons, & Stiles, 2007). These analyses ignored potential therapist effects, but future work in larger samples could attempt to incorporate variables that predict which patients should be seen by which therapists. The generalizability of nomothetic data (upon which these analyses relied) to individuals has recently been called into question (Fisher, Medaglia, & Jeronimus, 2018), but recent work has demonstrated how ideographic approaches could be used to guide precision selection and ordering of therapeutic techniques or modules (Rubel, Fisher, Husen, & Lutz, 2018).
Despite our use of cross-validation during the weight-setting stage, our use of the full sample during variable selection and imputation could lead to model overfitting and inflated relationships (Fiedler, 2011), and as noted by Hastie et al. (2009), represents a form of double-dipping that can increase risk of overconfidence. Kessler et al. (2017) leveled a valid criticism of the early publications on treatment selection in depression (DeRubeis et al., 2014;Huibers et al., 2015), arguing that, because variable selection was performed within the sample on which the model was evaluated, "any attempt to use the coefficients in these models to predict differential treatment response in a new sample of patients would almost certainly yield less positive effects than those suggested by the results of studies" (p. 6). Kriegeskorte, Simmons, Bellgowan, and Baker (2009) also note that this approach can lead to distorted descriptive statistics and invalid statistical inference. The size of the sample used in these analyses, although drawn from the largest RCT comparing these two treatments to date, did not allow for a true hold-out. As described by Hastie et al. (2009), the CV scheme applied here has given an unfair advantage to the predictors as they were chosen on the basis of the full sample; therefore, we do not approximate an evaluation of our models in a completely independent test set. Thus, we stress that this model would need to be validated in an independent sample before being considered for use as a clinical decision tool.
Finally, we must acknowledge that the sample size in which these analyses (and the vast majority of the existing literature on treatment select) were performed was likely insufficient for this work. Detecting group differences associated with small effect sizes (∼d = .20 -.30) using a multivariable regression model with just 6 predictors (with an alpha level of p < .05 and 80% power) requires a sample of 686 per group (Cohen, 1992). The sample sizes needed for detecting moderator effects are significantly larger (Aguinis, Beaty, Boik, & Pierce, 2005). A recent simulation by Luedtke, Sadikova, and Kessler (2018) suggested that at least 300 patients per treatment arm are needed for adequate statistical power to detect clinically significant improvements in treatment response due to treatment selection. Very few RCTs meet these sample size requirements, leading Kessler to propose the use of large observational studies, pragmatic trial designs, or pooled datasets (Kessler, 2018).
Despite these shortcomings, we believe this work represents an important step towards the types of studies that would address many of these criticisms. These analyses propose potential moderators and present a specific model that could be investigated in future studies and identify an important issue that merits increased attention by those interested in precision medicine: methodological heterogeneity in variable selection.

Future Directions
The treatment selection subfield of precision medicine in mental health is still in its developmental stage, and the statistical methods described in this and other similar papers are constantly evolving (Zilcha-Mano, 2018). Although replication and external validation are essential steps that should precede the implementation of any specific treatment selection model, the publication and discussion of candidate predictors, models and statistical approaches are equally important, as they set the foundation for future efforts.
A wide variety of feature selection techniques have been employed in recent efforts to construct treatment selection models in mental health, and no clear guidance exists as to which approach is best. We have presented an example of a new variable selection approach that incorporates several of the leading techniques in order to identify reliable predictors of differential response to treatment. We propose this specific combination as a starting point and suggest that future efforts should explore different permutations, such as adding other methods (e.g., Support Vector Machines), reducing the number of approaches used, using different combinations, or adjusting the settings within each of these techniques. Examples of the latter include adjusting the thresholds for the inclusion of variables and specifying different tuning parameters (Kuhn & Johnson, 2013).

Conclusion
We refined the Personalized Advantage Index approach to generate individual treatment recommendations for adults with mild to moderate depression not taking antidepressants who are deciding between CBT versus PDT. Our novel approach synthesized the results of four different variable selection techniques by selecting the patient characteristics that are consistently identified as associated with (differential) treatment outcome. Although no significant efficacy differences were found between CBT and PDT across the total sample, the resulting treatment recommendations suggested that for the majority of the individual patients, one of the treatments could be predicted to be more efficacious than the other based on a model including four pretreatment patient characteristics (anxiety sensitivity, depression symptom severity, extraversion, and psychological treatment needs). The small sample and lack of a separate validation sample indicate the need for prospective tests (e.g., Lutz, Zimmermann, Müller, Deisenhofer, & Rubel, 2017) before using this model for treatment selection, but these findings add to a growing literature on the potential for modelguided treatment recommendations to improve patient outcomes for depression.

Acknowledgments
We thank Robert DeRubeis for helpful comments on the manuscript and support of this work. We also wish the acknowledge the reviewers, whose feedback greatly improved this article and made us feel like we had gained three insightful coauthors. This work was supported by a grant from MQ: Transforming mental health MQ14PM_27.

Funding
Financial support for this work was provided by MQ Foundation to ZC [MQ: Transforming mental health MQ14PM_27]. MQ had no role in the study design, Psychotherapy Research 147 collection, analysis, or interpretation of the data, or in writing the manuscript or the decision to submit the article for publication. The randomized clinical trial from which data were drawn for this study was financed by an unrestricted research grant by Wyeth Pharmaceuticals, The Netherlands. Arkin Mental Health Care, The Netherlands, financially supported research logistics and the contributions of ED, HLV, and JJMD. Vrije Universiteit Amsterdam, Faculty of Behavioral and Movement Sciences, Section Clinical Psychology, The Netherlands, financially supported ED's contributions to the study. None of the sponsors had a role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; nor in the preparation, review, or approval of the manuscript. The opinions and assertions contained in this article should not be construed as reflecting the views of the sponsors.

Supplemental data
Supplemental data for this article can be accessed doi:10.1080/10503307.2018.1563312.