Perceived morality of direct versus indirect harm: Replications of the preference for indirect harm effect

https://doi.org/10.15626/MP.2019.2134 Article type: Replication Report Published under the CC-BY4.0 license Open data: Yes Open materials: Yes Open and reproducible analysis: Yes Open reviews and editorial process: Yes Preregistration: Yes Edited by: Rickard Carlsson Reviewed by: Daniël Lakens, Arvid Erlandsson Analysis reproduced by: André Kalmendal All supplementary files can be accessed at the OSF project page: https://doi.org/10.17605/OSF.IO/3SCAF

Judgments of morality do not only depend on the result of an action, but also on the way that it was performed. For instance, acts of omission are considered more moral than acts of commission, despite leading to the same result (omission bias) (Spranca, Minsk, & Baron, 1991). In their 2002 article, Royzman and Baron found that people preferred indirect harm to direct harm and considered indirect harm as more moral (Studies 1 and 2). In addition, omission bias (Jamison, Yay, & Feldman, 2020) was 1 Joint first authors 2 Joint fourth authors 3 Corresponding author found to be weaker for indirect compared to direct harm (Study 3).What is the difference between direct and indirect harm? Consider two actors, Ann and Bob, with Ann inflicting harm on Bob. An example for direct harm would be for Ann to harm Bob by pushing him off the swing. An example of indirect harm would be for Ann to saw down the tree branch to which the swing is attached, which would then in turn lead to Bob falling down and getting hurt. Both actions lead to the same outcome involving harm -Bob getting hurt -yet the difference is regarding the direct link in the causal chain of events. In principle, indirect action could be performed without Bob ever being involved and, in such case, would result in no harm to Bob. Royzman and Baron (2002) hypothesized and found that even if a negative outcome is the same, people judge the morality of actions leading to that negative outcome as dependent on whether there was a direct or indirect link between the action and outcome. This in turn resulted in a strategic preference for indirect harm. In order to minimize accountability when inflicting harm, people show a preference for inflicting indirect over direct harm.
Impact of "The preference for indirect harm" Preference for indirect harm is central in the understanding of moral judgment. In his seminal study, Milgram (1974) observed that people were more likely to commit harm if they did not have physical contact with the victim, i.e., when the harm they had to inflict to the experimenter's confederates was less direct. In general, dislike for physical contact with the victim may be caused by an overall a preference for indirect harm. Cushman, Young, and Hauser (2006) summarized and tested three principles of harm: action, intention, and contact. The second principle, which they termed the 'intention principle', is an extension to the preference for indirect harm: people prefer harm as a byproduct rather than the main goal of an action. They found corroborating evidence for indirect harm as being an intuitive guide to moral judgment, building on work by Haidt and Hersh (2001), showing that participants were unable to explain why they would prefer indirect to direct harm. Hauser, Cushman, Young, Kang-Xing Jin, and Mikhail (2007) found further support for preference for indirect harm across cultures, including that participants were unable to readily provide explanations for it. In line with these results, more recent research found further support for the intuitive nature of preference for indirect harm, as evaluation mode (joint vs. separate) moderated the effect (Paharia, Kassam, Greene, & Bazerman, 2009).
Preference for direct harm can be linked to various practices observed in everyday life. For example, Bennett (1966) compared direct and indirect action leading to the same outcome -the death of a fetus. Some Catholic hospitals -opposed to abortion on principle -would consent to performing hysterectomy on pregnant women whose lives were in danger, while they would not consent to perform an abortion. The hysterectomy would not only kill the fetus, but also make the woman sterile. In these cases, Catholic hospitals would prefer an action that leads to a worse indirect harm (the death of the fetus and lifetime infertility for the woman) than a direct action leading to a less harm (the death of the fetus), on the religious grounds that indirect harm to the fetus is acceptable, while direct harm is not.

Choice of study for replication
We chose the Royzman and Baron (2002) study based on two factors: absence of direct replications and impact. To the best of our knowledge there are no published direct replications of this study thus far. The article has had significant impact on scholarly research in the area of moral psychology. At the time of writing, there were 173 Google Scholar citations of the article and many important follow-up theoretical and empirical articles, such as the Cushman et al. (2006) three principles of harm, and the investigation of Hauser et al. (2007) on the dissociation between the conscious nature of moral judgments (such as preference for indirect harm) and the intuitive nature of moral justifications (such as the intuition principle).
The original article consisted of three scenariobased studies using university (Study 1: N = 176) and online samples (Study 2: N = 54; Study 3: N = 69). In Studies 1 and 2 Royzman and Baron (2002) asked participants to directly compare actions that lead to the same amount of harm and the same amount of a beneficial outcome. In the first scenario of Study 2, for example, study participants had to compare the morality of two actions (action A and action B) leading to the same harmful outcome -preventing an alcoholic patient from receiving a liver transplant -either by lowering his priority on an organs transplant list (direct option) or by increasing everyone else's priorities (indirect option), by indicating whether they perceived action A or action B to be more wrong. In the present investigation, we conducted two replication attempts of the two scenarios fully detailed in Study 2 of Royzman and Baron (2002).

Original findings in target article
A summary of the findings in the target article is provided in Table 1. The preference for indirect harm effect was d = 0.70, 95% CI [.40; 0.99], a medium to strong effect. They examined considerations and found support for all proposed mechanisms, with statistically significant correlations (r = .47 to r = .70) when participants deemed the consideration a reason for moral judgment ('predicted' column in Table 1). They found weaker and sometimes statistically non-significant correlations (r = .01 to r = .16) when participants did not deem the consideration a reason underscoring a preference for indirect harm ('opposite' column in Table 1). Note. Correlations with the morality question, original study, according to whether or not it was cited as a reason for a moral judgment, from Royzman and Baron (2002), p.174. 'Predicted' indicates the share of responders indicating that the direct action was more wrong; 'Opposite' indicates the share of responders indicating that the indirect action was more wrong. All correlations above .092 are significant at α = .05.

Pre-registration
In each of the replication studies, we pre-registered the experiment on the Open Science Framework and data collection was launched soon after. Pre-registrations, power analyses, disclosures, and all materials used in the experiments are available in the supplementary materials. These together with data and code were shared on the Open Science Framework (project: osf.io/ewq8g; pre-registration Hong Kong undergraduate sample: osf.io/qdn2m; pre-registration online American sample: osf.io/hwsdc).

Power analyses and deviations from power analysis preregistration
Power analyses indicated that 24 participants would be sufficient to have 95% power of detecting the original effect (d = 0.70) with a one-tailed alpha of .05, using a one-sample t-test as in the original article. The preregistration for the first data collection planned to sample 70 participants among Hong Kong University students, a decision based on convenience, as these participants were students in a Psychology course. Of that sample, we were able to collect 49 participants, given that participation was voluntary. After excluding the students who designed this very replication, 46 participants remained. Sensitivity analyses indicate that this sample size provides approximately 99.8% power to detect the original effect with a one-tailed alpha of .05.
The second online data collection on Amazon Mechanical Turk (MTurk) was part of a larger project of replications of psychology findings and this study was combined with other replications, random presentation order. The final sample size (N = 314) is due to power analyses related to the other replications running in the same data collection. Sensitivity analyses indicate 99.9%+ power to detect the original effect with a one-tailed alpha of .05.

Procedure
The first replication was considered a pre-test and conducted in an undergraduate course at a university in Hong Kong. Students worked in teams of 3 to 6 to design and run a series of replications, and one of the replications was Royzman and Baron's (2002) study. The students then served as the target sample for the experiments designed by their classmates in which they had not designed and had no knowledge prior to participation. The course materials covered classic judgement and decision-making literature, which means that the students were made aware of a wide array of heuristics and biases, and the experiment therefore should be considered a very conservative test of the effect in a non-naive sample.
Students were randomly assigned into replication teams with different target studies for replication. Student groups designed the experiment survey, conducted effect size calculations, ran power analyses, and wrote pre-registrations. Pre-registrations on the OSF and data collections were managed by course instructor. All the students registered in the course were invited to take part as respondents in the study. To ensure anonymity, students were only asked to indicate which replication group they belonged to and those were later excluded from the data analysis of the study they designed. The final sample included the students that were not involved in planning the study, totaling 46 participants (15 males, 31 females; Mage = 20.2, SDage = 0.99).
For the second replication, two advanced course undergraduate students unrelated to the first replication worked independently to analyze the target article. They conducted effect-size calculations, power analyses, and each separately wrote a preregistration plan. They then reviewed each other's work and made final revisions, reviewed by the teaching assistant and the course coordinator. Both plans were pre-registered on the OSF prior to data collection by the corresponding author, who was the course instructor of the first replication and the advanced course. The final sample included 314 American MTurk workers, recruited using Turk-Prime.com (Litman, Robinson, & Abberbock, 2017) (173 males, 141 females, Mage = 36.8, SDage = 11.3).
We note that the pre-registration plans included different references to possible exclusion criteria addressing generalized factors such as seriousness, English proficiency, etc. We conducted our analyses both with and without exclusions and found that exclusions had little effect on the results. For the sake of brevity, the findings reported below are without any exclusions. A comparison of the target article sample and the replication samples is provided in Table 3. In both replication attempts, participants evaluated the two scenarios described in detail (out of the eight total scenarios; six were not described) in Royzman and Baron's (2002) Study 2, assessing participants' preference for indirect harm.
The following was the organ transplant scenario (Scenario 1 in target article): "X is in charge of a computer database controlling the distribution of available organ transplants. The first person in line for a difficult-toget liver transplant is Mr. Y. Mr. Y was an alcoholic and his drinking ruined his liver. Y no longer drinks. The rules say that past alcohol use should not be considered, but X still thinks that Y should not get priority, so he decides to break the rules and prevent Y from getting the next liver. He can do this in two ways: •[Direct] X can lower Y's priority score by 20 points.
•[Indirect] X can raise everyone else's priority score by 20 points." The following was the zoo scenario (Scenario 2 in the target article): "A zoo has been created to conserve 200 species of wild animal that have become extinct elsewhere. The zoo is now threatened with a parasitic disease that infects the animals. X, the zookeeper, has two options: Painlessly poison the animals in which the parasite reproduces, thus saving the other animals. Five species will become extinct.
•[Indirect] Poison the parasites. The same poison will cause five animal species to become extinct.
In both cases, X is sure that he will save most of the species and lose five. The five lost species are of equal value in both cases." Measures Morality. After each of the two scenarios, participants were asked which of the two options was morally worse (1 = A is much more wrong; 2 = A is a little more wrong; 3 = Equal; 4 = B is a little more wrong; 5 = B is much more wrong; note that higher scores indicate higher morality for the indirect option).
Reasons: Considerations for morality evaluations. Participants compared the two options in each scenario on five factors: Directness, intentionality, appearance, and action-omission on a five point scale (1 = factor is more applicable to the direct harm option thus the option is more immoral; 2 = direct harm option is not more immoral even though factor is more applicable to it; 3 = factor is equally applicable to both the options [equal morality]; 4 = factor is more applicable to indirect harm option, thus the option is more immoral; 5 = indirect harm option is not more immoral even though the factor is more applicable to it), and probability of harm on a three point scale (1 = More likely to cause harm in A than in B; 2 = Equally likely to cause harm in A and B; 3 = More likely to cause harm in B than in A). Measures are reported in full in the supplementary.

Replications evaluation
We aimed to compare the replication effects with the original effects in the target article (d = 0.70, 95% CI [0.40; 0.99]) using two methods: (1) we categorized the comparison of effects using the criteria set by LeBel, Vanpaemel, Cheung, and Campbell (2019), and (2) we conducted equivalence testing using the TOSTER module (Lakens, Scheel, & Isager, 2018 Figure 1. Plots for the morality ratings prior to categorizing. The two plots on the first row are for Hong Kong sample and the two plots on the second row are for the American sample. The first of every plot pair is for the organ scenario, and the second is for the zoo scenario. The scale is from 1 to 5, with 3 representing the mid-point. Higher values indicate higher morality ratings for the indirect option.

Preference for indirect harm
Violin plots for the raw morality ratings (on a scale from 1 to 5) are provided in Figure 1. Across the two scenarios in two experiments the ratings were higher than the midpoint neutrality rating of 3.
For the analyses, we followed the method set by Royzman and Baron (2002) and recoded the morality ratings as 0 for the indifference point (3 = Equal), -1 for the direct action being more wrong (1 = A is much more wrong; 2 = A is a little more wrong), and -1 for the indirect action being more wrong (4 = B is a little more wrong; 5 = B is much more wrong). We then ran a series of one-sample, one-sided t-tests comparing to the converted midpoint of 0, followed by dependent t-tests comparing the organ and zoo scenarios in each sample (two-sided), and finally equivalence testing comparing to the effects of the target article original findings. Note that this strategy, albeit not ideal given the low number of response categories (three) and the grouping of responses, was the one used by the original authors. We therefore complemented these analyses with non-parametric testing (Wilcoxon's signed-rank test) for the one-sample tests against the scale midpoint and for the comparison between scenarios (alpha = .05). The effect size comparisons should, however, be interpreted with caution, since the original effect size was obtained from data across eight scenario (we did not have access to the remaining six scenarios, nor to the original data, and to the effect sizes per scenario). The findings are summarized in Table 4. The findings were consistent across the two replication attempts, with similar point estimates and overlap in 95% confidence intervals. The effects in both replications were in the same direction and supported the original study's findings, but with weaker effects.  (Lakens et al., 2018). "W" indicates the W statistics Wilcoxon's signedrank non-parametric test.

Reasons: Considerations for morality evaluations
We followed the procedure in the target article to test reasons for morality evaluations and the preference for indirect harm effect by examining correlations between ratings of morality and considerations -directness, appearance, omission, and intent. Ratings were coded as either being more applicable to the direct, indirect, or neither option, and then as either being a reason or not for the morality ratings. The findings are summarized in Table 5.
We found support for the original study findings with medium to strong correlations (Hong Kong organ: r = .29 to .71; Hong Kong zoo: r = .32 to .90; MTurk organ: r = .36 to .56; MTurk zoo: r = .49 to .63) between each factor and morality ratings when the factor was indicated as a reason, and much weaker correlations, of which half were negative, contrary to predictions (Hong Kong organ: r = -.11 to .26; Hong Kong zoo: r = -.20 to .22; MTurk organ: r = .10 to .29; MTurk zoo: r = .13 to .29) when the factor was not indicated as a reason. Probability ratings were all positive and ranged from r = .14 to .50 across the samples and scenarios. Royzman and Baron (2002) furthered add an indication to better contextualize the psychological mechanisms underlying preference for indirect harm. They classified answers to the considerations into two categories, 'predicted' and 'opposite'. 'Predicted' represented the share of responders indicating that the direct action was more wrong (thus indicating preference for indirect harm, in line with predictions); 'Opposite' represented the share of responders indicating that the indirect action was more wrong (thus indicating preference for direct harm, contrary to predictions). Royzman and Baron (2002) further classified these answers based on whether participants find that the specific consideration is a reason for moral judgment (indicated in Table 5 as 'Reason') or not (indicated in Table 5 as 'Not a Reason', except for Probability). The researchers found that, in general, when indicating that the specific consideration is a reason for moral judgment more participants showed preference for indirect harm and indicated the direct action as more wrong (ranging from 15% to 16.9%) whereas fewer participants indicate that the indirect action is more wrong (ranging from 2.8% to 3.9%). Similarly, when indicating that the specific consideration is not a reason for moral judgment, more participants showed preference for indirect harm and indicate the direct action as more wrong (ranging from 17.4% to 32.4%) whereas fewer participants indicated that the indirect action is more wrong (ranging from 5.1% to 6.5%).
In the replications we conducted, findings were broadly in line with the results of Royzman and Baron (2002). When indicating that the specific consideration is a reason for moral judgment, more participants showed preference for indirect harm and indicated that the direct action is more wrong (ranging from 17.6% to 66.7%) whereas fewer participants indicated that the indirect action is more wrong (ranging from 0% to 12.3%). Similarly, when indicating that the specific consideration is not a reason for moral judgment more participants showed a preference for indirect harm and indicate direct action is more wrong (ranging from 13.4% to 51.6%) whereas fewer participants indicated that the indirect action is more wrong (ranging from 6.5% to 29%). Overall, in all cases the proportions of participants who indicated the direct action is more wrong (thus indicating preference for indirect harm) were larger than the proportion of participants indicating the indirect action is more wrong, irrespective of whether they considered the specific consideration to be a reason for moral judgment or not.
Overall, in both replication attempts, we successfully replicated the correlational evidence that Royzman and Baron (2002) presented when investigating potential factors underlying the preference for indirect harm (probability of harm, intent, appearance, omission, and directness). This suggests that all of these factors likely play a part as psychological underpinnings of the preference for indirect harm. However, the evidence presented is correlational and only shows a statistical association rather than a neat cause-effect path. Further research may experimentally investigate the causality of these associations by manipulating intent or appearance within direct and indirect harm, for example. This is especially interesting in light of the literature showing that moral judgment in general, and preference for indirect harm in particular, are intuitive processes for which people are unable to quickly provide a justification (Cushman et al., 2006;Paharia et al., 2009) or explain why people prefer indirect harm to direct harm.

Mini meta-analysis effect summary
We summarized the findings of the two replications studies together with the target article original findings using mini meta-analyses for each of the scenarios to assess the overall effect size (Goh, Hall, & Rosenthal, 2016;Lakens & Etz, 2017 - Table 5 Reasons for morality and the preference for indirect harm effect: frequencies and correlations Note. Correlation (Pearson's r) between morality and considerations, according to whether or not it was cited as a reason for a moral judgment. 'Predicted' indicates the share of responders indicating that the direct action was more wrong; 'Opposite' indicates the share of responders indicating that the indirect action was more wrong. * p < .05; ** p < .01; *** p < .001. We successfully replicated findings from Royzman and Baron (2002) Study 2 with a non-naive undergraduate sample from Hong Kong and an American MTurk sample. These results provide empirical support for the preference for indirect harm phenomenon: people tend to prefer indirect harm over direct harm. We summarize the replications as "signal and consistent" according to the LeBel et al.'s (2019) replication success criteria, yet we note that equivalence tests indicated overall weaker effects compared to the target article findings. Mini metaanalyses of the replications and original findings indicated weak to medium effects that are different from null.

Hong
What may explain the weaker effects? Sample and time are the typical suspects. Royzman and Baron (2002) study was conducted using an Internet sample, resembling the MTurk sample in the replications, although MTurk workers are likely more experienced in participating in online studies (Chandler, Mueller, & Paolacci, 2014). Compared to the original sample, the Hong Kong sample was of a different cultural and linguistic background and had a much higher familiarity with heuristics and biases. We believe, however, that both sample and the passing of time are limited explanations given our other judgment and decision-making replications with similar samples showing high consistency between these two samples and the original findings (e.g., Chandrashekar et al, 2020;Chen et al., 2020). We cannot, however, rule out any possibility with confidence, and the many differences between the original study and our replications make it difficult to determine the cause. A possible future direction is to conduct a meta-analysis on the literature testing for moderators.
Our findings suggest that the classic phenomenon is replicable, yet that we may need to update our expectations regarding effect size. Replications are especially useful in this regard. Researchers can now use the replications' effect-size as an updated and more conservative estimate of the effect when designing their follow-up studies.
Gilad Feldman (corresponding author -GF from now on and in the table below) was the course instructor for fundamentals and advanced social psychology courses (PSYC2020/3052) and led the two reported replication efforts in those courses. GF supervised each step in the project, conducted the pre-registrations, and ran data collection. Ignazio Ziano (joint first author-IZ from now on and in the table below) integrated the two replication efforts into a manuscript with validation and further extensions of the statistical analyses. GF and IZ jointly finalized the manuscript for submission.
Yu Jie Wang and Sydney Susanto Sany (joint first authors) conducted the US replication as part of the advanced social psychology course (identified as Students PSYC3052 in the table below).
Long Ho Ngai, Yuk Kwan Lau, Iban Kaur Bhattal, Pui Sin Keung, Yan To Wong, and Wing Zhang Tong conducted the Hong Kong replication as part of the fundamentals of social psychology course (joint fourth authors; identified as Students PSYC2020 in the table below).