Supplemental materials accompanying "Correcting for bias in psychology: A comparison of meta-analytic methods"

Publication bias and questionable research practices in primary research can lead to badly overestimated effects in meta-analysis. Methodologists have proposed a variety of statistical approaches to correct for such overestimation. However, it is not clear which methods work best for data typically seen in psychology. Here, we present a comprehensive simulation study in which we examined how some of the most promising meta-analytic methods perform on data that might realistically be produced by research in psychology. We simulated several levels of questionable research practices, publication bias, and heterogeneity, and used study sample sizes empirically derived from the literature. Our results clearly indicated that no single meta-analytic method consistently outperformed all the others. Therefore, we recommend that meta-analysts in psychology focus on sensitivity analyses—that is, report on a variety of methods, consider the conditions under which these methods fail (as indicated by simulation studies such as ours), and then report how conclusions might change depending on which conditions are most plausible. Moreover, given the dependence of meta-analytic methods on untestable assumptions, we strongly recommend that researchers in psychology continue their efforts to improve the primary literature and conduct large-scale, preregistered replications. We provide detailed results and simulation code at https://osf.io/rf3ys and interactive figures at http://www.shinyapps.org/apps/metaExplorer/ .

where n and v are the sample sizes and variances of the groups. For Cohen's d, the variance can be calculated as v d = n 1 + n 2 n 2 n 2 + d 2 2(n 1 + n 2 − 2) · n 1 + n 2 n 2 n 2 − 2 .
We calculated effect size estimates at the study-level using the above formulas.

Funnel plots
The influence of bias in meta-analysis can sometimes be seen by comparing the effect size estimates to the standard errors of those estimates (or some other indicator of sample size) with a funnel plot (Light & Pillemer, 1984). In a typical funnel plot, the reported effect size is plotted on the x-axis and the standard error is plotted on the inverted y-axis. The most precise estimates (i.e., those with the smallest standard error and largest sample) will tend to converge on the true effect size, whereas the more imprecise estimates will spread evenly on either side of the true effect, with studies equally likely to overestimate as underestimate the true effect. That is, the amount of deviation from the true effect increases as estimates become more imprecise, leading to a funnel-like pattern ( Figure 1A). In the presence of bias, fewer studies will be present in the lower corner of the funnel where results would be non-significant or of the wrong sign ( Figure 1B). In this case, the funnel plot will appear asymmetrical, with more imprecise studies finding larger effects than more precise studies. The blue triangle displays the region of non-significant studies. In the case of complete publication bias (i.e., only significant studies entered the published literature), no studies are present in the non-significant region. In this way, a funnel plot can reveal patterns that may indicate bias. Figure 1A shows funnel plots of simulated meta-analytic data sets. These data sets vary in the true values of the underlying effect, δ, and heterogeneity, given as the standard deviation of the the distribution of true effects, τ . Note that in this panel, none of these meta-analyses have been affected by bias, and the difference between the random-effects model estimate (marked as a solid vertical line and a "X" along the horizontal axis) and the true value (marked as a dashed vertical line and a filled dot) is very close to zero. Figure 1B, in contrast, shows newly generated samples from these same conditions but under complete publication bias. Note the clear rightward asymmetry of the funnel plots in Figure 1B as compared to in Figure 1A, as well as the resulting overestimation in the random-effects model estimates: Along the horizontal axis, each X has been shifted to the right of the true value.
As is illustrated in Figure 1B, publication bias induces a relationship-in this case, a positive correlation-between effect size estimates and their standard errors. However, such a correlation can also have benign causes. It may be, for example, that expensive, smallsample manipulations have stronger effects than inexpensive, large-sample manipulations. Similarly, when a literature contains both large and small effects, and researchers use power analyses to plan their samples sizes accordingly, the large effects will be studied with smaller, less-precise samples. Sequential designs can also induce this correlation (Lakens, 2014;Schönbrodt, Wagenmakers, Zehetleitner, & Perugini, 2015): Studies measuring a large effect can stop early for efficacy, whereas studies measuring a small effect can stop at later stages of the sequential design after continuing data collection. Finally, sometimes a relationship between effect size and standard error is built into the calculation of an effect size's precision (e.g., see the equation for Cohen's d above, which includes sample size, as well as Macaskill, Walter, & Irwig, 2001;Peters, Sutton, Jones, Abrams, & Rushton, 2006). In these scenarios, effect size and standard error would also be correlated, but not because of bias.
Because this correlation between sample size and effect size can have several causes, some bias-inducing and others benign, such a correlation is typically called a "small-study effect" (Sterne, Gavaghan, & Egger, 2000). Such small-study effects do not necessarily indicate publication bias or QRPs. Because several of the methods we examine here adjust for small-study effects generally as though such effects always represent publication bias, they may overadjust under certain conditions.

Estimation of the Monte Carlo simulation error
To check whether 1000 simulations were enough to get sufficiently stable estimates, we ran simulations with 10,000 replications for a random selection of 30 conditions. Then, these 10,000 runs were divided in 10 batches of 1000 simulations each. Monte Carlo simulation error, which is the standard deviation of the Monte Carlo estimator taken across repetitions of the simulation (Koehler, Brown, & Haneuse, 2009), was on average 0.002 for the effect size estimate (95% quantile: 0.008), and on average 0.003 (95% quantile: 0.008) for the RMSE estimate.

Percentages of Valid Estimates
Some methods do not always return a valid estimate, either due to inherent limitations (e.g., p-curve and p-uniform only work if ≥ 1 significant studies are in the set), or due to failed convergence of the estimation method. All reported results have to be read as conditional results: In all cases where an estimation method produced an estimate, the method had the reported performance.
Random effects, PET-PEESE, and WAAP-WLS always returned an estimate, but not the other methods. Table 1 shows the rates of valid estimates in each condition.

Technical notes: PET, PEESE, PET-PEESE
The typical effect of PET and PEESE in the presence of publication bias is a downward correction of the meta-analytic estimate. When no publication bias is present, however, it can happen that random variations in the sample induce a negative correlation between sample size and effect size, which leads to an upward correction. In the current simulations we kept these upward corrections. In an applied setting, however, we recommend that analysts be very skeptical when PET or PEESE have a slope of the reversed sign.
Notably, both PET and PEESE are examples of weighted-least squares metaregression and are therefore distinct in some ways from the fixed-and random-effects metaanalysis models described above. The specifics of this difference are discussed in detail elsewhere (Thompson & Sharp, 1999;Stanley & Doucouliagos, 2015); however, in practice, the result of the difference is that the estimates from weighted-least squares meta-regression models will have relatively larger standard errors, and thus, relatively wider confidence intervals than standard meta-analysis models. This is not necessarily a negative in the face of heterogeneity and publication bias, and authors have argued for the use of both types of models (Thompson & Sharp, 1999;Stanley & Doucouliagos, 2015;Moreno et al., 2012).
The same holds true for PEESE, but replacing sqrt(v) with v.
It is worth noting that the downward bias shown by PET and PEESE when δ was large may be due to the fact that the standard error of d is a function of both the sample size and the observed d (see the equation in the introduction). This relationship between d and its standard error is such that larger d leads to larger standard errors, thereby creating a small-study effect that mimics that of publication bias and leads to an overadjustment.

Technical notes: p-uniform
We used the default "P" method from the puniform package (van Aert, 2017) for R, which relies on the Irwin-Hall distribution. The package returns a one-tailed p-value by default. As all other methods use two-tailed p-values, we doubled the resulting p-value to achieve comparable results.

Technical notes: p-curve
We used the test for right skew with the Stouffer method for the hypothesis test of evidential value. A non-significant skew means that H 0 : δ = 0 is not rejected. p-curve also provides a test that tests whether the p-curve is flatter than a reference line at 33% power, which can be seen as an indicator for lack of evidence and non-rejection of H 0 . Typically both tests agree in the sense that when one test is significant, the other is not. In a small fraction of all simulations, however, both tests were significant, indicating that the p-curve is both flatter than the reference line and steeper than zero. By only considering the skewness test, these cases were treated as H 0 rejections.

Assessing p-curve's performance to recover the average effect size of all included studies
The standard goal of a random effects meta-analysis is to estimate the mean and the variance of all conducted studies. Simonsohn, Nelson, and Simmons (2014), in contrast, put forward a different interpretation of the p-curve estimate: "It is the average effect size one expects to get if one were to rerun all studies included in the p-curve" (p. 667). That means, it does not attempt to recover the mean of all conducted studies, but rather the true mean of all significant studies (i.e., the studies which are entered into the p-curve analysis), corrected for publication bias.
In our main analyses we evaluated all methods with regard to their ability to recover the mean of all studies. In this supplemental analysis, we report how well p-curve estimates Sterne, J. A., Gavaghan, D., & Egger, M. (2000). Publication and related bias in metaanalysis: Power of statistical tests and prevalence in the literature. Journal of clinical Epidemiology,53 (11), 1119-1129. Thompson, S. G. & Sharp, S. J. (1999). Explaining heterogeneity in meta-analysis: A comparison of methods. Statistics in Medicine,18 (20), 2693-2708. van Aert, R. C. (2017). Puniform: Meta-analysis methods correcting for publication bias.   In the absence of publication bias, data points form a symmetrical funnel, conforming closely to the true effect size when the standard error is small and spreading evenly when the standard error is large. Heterogeneity in the true effect size leads to greater spread. (B) Publication bias selectively removes non-statistically-significant effect size estimates. This censorship of small effect size estimates leads to an asymmetrical funnel and overestimation of the true effect size. Figure Figure 18 . Null hypothesis rejection rates (H 0 RR) for all methods with strong publication bias. Color coding is as follows: darkest = H 0 RR < .50; medium = .50 ≤ H 0 RR < .80; lightest = .80 ≤ H 0 RR. Note: When this δ > 0, H 0 RR is statistical power; when δ = 0, H 0 RR is Type I error or the false positive rate.