Local Fit Evaluation of Structural Equation Models Using Graphical Criteria

Evaluation of model fit is critically important for every structural equation model (SEM), and sophisticated methods have been developed for this task. Among them are the 2 goodness-of-fit test, decomposition of the , derived measures like the popular root mean square error of approximation (RMSEA) or comparative fit index (CFI), or inspection of residuals or modification indices. Many of these methods provide a global approach to model fit evaluation: A single index is computed that quantifies the fit of the entire SEM to the data. In contrast, graphical criteria like d-separation or trek-separation allow derivation of implications that can be used for local fit evaluation, an approach that is hardly ever applied. We provide an overview of local fit evaluation from the viewpoint of SEM practitioners. In the presence of model misfit, local fit evaluation can potentially help in pinpointing where the problem with the model lies. For models that do fit the data, local tests can identify the parts of the model that are corroborated by the data. Local tests can also be conducted before a model is fitted at all, and they can be used even for models that are globally underidentified. We discuss appropriate statistical local tests, and provide applied examples. We also present novel software in R that automates this type of local fit evaluation. Translational Abstract Models with many variables are routinely tested against observed data to determine whether the model does a good job describing the actual data. These tests are often performed in a global fashion, meaning that a single number is computed to describe the overall fit between model and data. This makes it difficult to determine whether the whole model or only a small portion of the model may be wrong. We present an alternative approach in which the global test is broken down into local tests. These local tests focus on single implications of the model, and inform the researcher whether these implications are likely true or false. Through several examples, we discuss local tests that can be derived from models in which all variables are observed, and local tests that can be derived from models in which some variables are assumed to be unobserved (latent). Then we explain what statistical tests can be performed, and present an R package that automates the task of finding and evaluating local tests in large models.

Evaluation of model fit is an integral part of any research project that involves structural equation models (SEMs). Researchers who formulate SEMs are interested in whether the proposed model has adequate fit to the observed data, and they often spend a large amount of time on this testing process. There are a variety of fit measures that are routinely reported, including the global 2 statistic of the model, decompositions of the 2 , and various derived fit indices. In addition, researchers often examine standardized residuals between the observed and model-implied covariance matrix and may also consult modification indices and expected parameter change values. The 2 statistics are often reported with a significance test, whereas the derived fit measures are typically evaluated based on cut-off values and are more interpreted like effect sizes.
Researchers almost always report the global 2 statistic and its associated p value (Jöreskog, 1969). In addition, the 2 is often decomposed in portions that are attributable to the measurement model, or the structural part of the model. This decomposition is sometimes referred to as a two-step procedure in which first the fit of the measurement model is evaluated, and only after that, the fit of the full model that includes both measurement and structural portions (Anderson & Gerbing, 1988). Going even further in this decomposition, James, Mulaik, and Brett (1982) suggested that portions of the model that posit the existence or the absence of an effect should be tested separately. Such a decomposition, along with derivation of fit indices, is given in Lance, Beck, Fan, and Carter (2016).
The global 2 is not universally endorsed. An often-levied criticism is that the test yields rejection of the model if sample sizes increase, even in the presence of small misspecification. Additional fit measures are often reported instead. A popular one is the root mean square error of approximation (RMSEA; Steiger, 1990), which is derived from the model 2 , the degrees of freedom, and the sample size. Another one is the comparative fit index (CFI; Bentler, 1990), which evaluates the relative distance between a null model (in which all variables are assumed to be independent) and the actual model. Both the RMSEA and the CFI rely on cut-off values that are based on approximate rules as to what constitutes adequate fit. The RMSEA is usually also supplemented with a confidence interval and a test of close fit (Steiger, 2004), which is a significance test of the observed RMSEA against a minimum threshold, usually .05.
A form of local fit assessment are so-called modification indices or expected parameter change values. Modification indices are single-degree of freedom 2 tests that show what would happen to the overall global 2 if an additional arrow would be added to the model. Expected parameter change values quantify the magnitude of potentially added paths. This data-driven approach to model modification has occasionally been criticized as being too a-theoretical, capitalizing on chance, and leading to models that often cannot be replicated with new samples (MacCallum, 1986;MacCallum, Roznowski, & Necowitz, 1992). 1 The reliance on 2 -based measures is so prevalent that alternatives are hardly ever considered. However, it is possible (and we argue fruitful) to perform local testing beyond the modification index and expected parameter change. In our experience, a majority of applied researchers using SEM are unaware that such tests even exist, and local tests are currently not featured in any of the leading SEM programs. The goal of the present article is the following: We will present two graphical criteria, d-separation and trek-separation, that yield two local tests, conditional independence and tetrad tests. We will explain how both graphical criteria can be applied to enumerate local tests, and how the derived statistical tests can be performed on data. In particular, we will show that such tests can be performed either by classic significance testing, or by interval estimation of the associated effect sizes. We then will present a series of examples that explain how local fit testing can lead to insights about model misspecification. Some of these local tests are already implemented in existing software (Bauldry & Bollen, 2016;Hipp, Bauer, & Bollen, 2005;Scheines, Spirtes, Glymour, Meek, & Richardson, 1998;Textor, Hardt, & Knüppel, 2011), but we will also present a novel software package, "dagitty" (Textor, van der Zander, Gilthorpe, Liśkiewicz, & Ellison, 2016), written in R, that automates all of these tasks and can be used in conjunction with already existing SEM software, notably lavaan (Rosseel, 2012). It is important that our article does not aim to exhaustively compare the differences between existing fit measures and the local tests we propose. Instead, we simply provide a tutorial on what kind of local tests exist and how they can be applied. We occasionally draw comparisons to other types of model testing strategies, but we do not claim that local tests will be superior to all other types of testing in all situations.
We should also point out that local tests have a long history in SEM research. In a classic textbook by Saris and Stronkhorst (1984), the authors briefly discussed some of the methods that we cover later. They ultimately dismissed local fit evaluation as obsolete and concluded that local fit evaluation has the primary disadvantage of potentially yielding contradicting results. Some tests could suggest support for the model; other tests could indicate rejection of the model. This was problematic, because one would have to make a decision about the model as a whole. A reasonable counterargument is that this is an advantage of local fit evaluation over global testing. Local tests should be able to tell which parts of a model are supported by the data, and which ones are not, and we should seek out this information. A further concern by Saris and Stronkhorst (1984) was efficiency. This has been partly addressed by the rapid advance in computing power over the past three decades, and importantly, efficient graphical criteria are now available that allow us to rapidly derive local tests from the graphical model structure.
Our work also draws on several earlier publications (Bollen & Ting, 1993Pearl, 1988Pearl, , 2000Shipley, 2000) that discussed various approaches to local fit evaluation. Our own contribution is to introduce readers to local fit evaluation, by providing them a comprehensive review of existing tests based on graphical criteria, by explaining how our novel software can automate the process of local fit evaluation, and by introducing some novel local fit evaluation ideas, including the development of local equivalents for global fit indices like RMSEA or CFI.

Types of Local Tests
We will describe two types of local tests that are based on graphical criteria. These local tests can be derived before any data are observed, because they rely purely on the structure of the graphical model. The statistical tests that are associated with these graphical criteria can be performed once data are available.
Before describing the two criteria and tests, we first define the following terms: A graphical model consists of variables and paths. A path can be a one-or two-headed (bidirected) arrow between two variables. A one-headed arrow denotes a direct causal relationship between two variables. A double-headed arrow denotes an influence between two variables that is caused by an unobserved latent variable. We refer to models without latent variables as path models. We further define a route as any sequence of paths that can be obtained by moving through the model along paths (where moving against the direction of the arrow is allowed). Paths can occur multiple times in a given route. 2 With these definitions in hand, we can now turn to the two graphical criteria.

d-Separation
The d-separation criterion was first developed by Pearl (1995). Introductions to d-separation in the social sciences are provided by Hayduk and Glaser (2000), and in the context of missing data by Thoemmes and Rose (2014) and Thoemmes and Mohan (2015). We will now review the d-separation criterion using the concepts of routes, as defined previously.
The d-separation criterion informs the researcher what kind of conditional independencies are encoded in a graphical model. It is important that these conditional independencies will hold under all parameterizations of the model. That means that if d-separation implies a particular conditional independence, then this conditional independence must hold regardless of the functional form a particular arrow will take on. For example, certain arrows between two variables in a graphical model could signify linear relationships, or complex nonlinear relationships, and in both instances a conditional independence would be implied if the d-separation criterion holds.
For didactic reasons, we will first explain d-separation using four special cases; namely, four trivially small models with three variables and two paths. These four models form the building blocks for the more general definition. These models use the three variables X, M, and Y and contain a path between X and M, as well as another path between M and Y. In the graphical models literature, these models are typically named chain Table 1. Each of these models can be expressed as a small set of regression equations. The collider corresponds to a single equation M ϳ X ϩ Y whereas the other three models each correspond to two different equations. Each of these models implies a certain (conditional) independence statement.
Assume that we have four data sets that were generated by the models in Table 1. In the collider case, X and Y must be independent variables; otherwise, their error terms would be correlated. Therefore, the collider model has the testable implication Cov(X, Y) ϭ 0. The collider model is the only model out of the four models in which this unconditional independence holds. In the other three cases, X and Y are in general unconditionally dependent (e.g., correlated), which is not a testable implication because the dependence can be arbitrarily weak. This illustrates the first type of testable implication that we can derive from a graphical model, namely, vanishing covariances.
The second type of implication we can derive is a vanishing conditional (or partial) covariance. Consider the fork model, which corresponds to the following two equations: Based on these equations, Cov(X, Y) can be written as Cov(␤ 1 M ϩ ⑀ 1 , ␤ 2 M ϩ ⑀ 2 ). However, if we now hold M constant, then this expression reduces to It follows that Cov(X, Y | M) ϭ 0. A similar argument could be made for the chain and the inverse chain. In the collider case, however, conditioning on M renders X and Y dependent if both associated regression coefficients are nonzero, and we get no testable implication on the conditional covariance Cov(X, Y | M).
In other words, the implied unconditional and conditional independencies are reversed by conditioning on M (Table 1). So far, we have only included paths with a single arrowhead (directed paths), but triplets with bidirected arrowheads would work the exact same way. For example, the model obtained by taking the chain model and replacing the path X ¡ Y with a bidirected arrow, yielding X ↔ M ¡ Y, would still be a chain, because the variable in the middle (from now referred to as the "midpoint") has one arrowhead going in and one arrowhead going out. Replacing the second path with a bidirected arrow, yielding X ↔ M ↔ Y, would result in a collider model, because the midpoint has two arrowheads pointing to it. As an alternative, one could consider that every path with bidirected arrowheads is simply an expression of an unobserved variable with two paths with directed arrows, for example, X ↔ Y, would become X ¢ L ¡ Y. The same criterion as before is then applied to the expanded model that includes the latent variable L. Results with respect to the observed variables from the model with bidirected paths and the model with latent variables would be identical.
We can generalize these principles to arbitrarily large models, using the notation defined previously. First, if there exists no route at all between two variables X and Y, then the graphical model implies that X and Y are independent. However, this can only be the case if the model consists of at least two independent parts that are not connected by any paths. Such models rarely occur because they would only be used if the data to be modeled consisted of two independent subsets. In the previous examples, we have considered all possible routes of length 2 (i.e., routes that consist of two paths) between X and Y. We have seen that X and Y are guaranteed to be uncorrelated only if the route between them contains a collider. This translates to arbitrary models as follows. We call a given route between two variables X and Y open if it does not contain any colliders. Likewise, if it does contain a collider, we may call it closed. The implication Cov(X, Y) ϭ 0 holds if and only if there exists no open route between X and Y. Note that there can be many routes between X and Y, but we need to require that all of these routes are closed. Using this simple rule, we can identify the variable pairs that are guaranteed to be uncorrelated simply by tracing the paths in a model.
This previous rule only identifies unconditional independence implications and did not consider that we may condition on a set of variables Z. In our previous examples, we saw that conditioning on the midpoint potentially induces a correlation for colliders and removes a correlation for chains, inverse chains or forks. We can generalize this as follows. A route between X and Y is open with respect to Z if for every triplet of variables in this route that form the collider model, the midpoint of this collider triplet is in Z, and for every triplet of variables that forms a chain, fork, or inverted fork, the midpoint of this triplet is not in Z. Equivalently, we may also say that a route between X and Y is closed with respect to Z if for at least one triplet of variables in this route that form the collider model, the midpoint of this collider triplet is not in Z, or alternatively, for at least one triplet of variables that forms a chain, fork, or inverted fork, the midpoint of this triplet is in Z. For two variables X and Y and a given set Z of other variables, the implication Cov(X, Y | Z) ϭ 0 must hold if and only if there exists no route between X and Y that is open with respect to Z. Now, for any given variable pair (X, Y) in the model, we can have one of the following cases.
1. X and Y are connected by a path. Then no unconditional independence statement is implied, and it is impossible to find a set Z that leads to a conditional independence, because the direct path between X and Y always remains an open route. We then say that X and Y are d-connected.
2. X and Y are not connected by a path, and there exists no open route between them. This implies Cov(X, Y) ϭ 0. We then say that X and Y are d-separated.
3. X and Y are not connected by a path, but there exist open routes between them. However, there exists a set Z such that all routes between X and Y are closed by Z. This implies Cov(X, Y | Z) ϭ 0. We then say that X and Y are d-connected, and are d-separated given Z. Note that there may be more than one set of variables that closes all routes.
4. X and Y are not connected by a path, but there exist open routes between them that cannot be closed by any set Z. This yields no implication, and can only occur in models with bidirected arrows and/or cycles. We then say that X and Y are d-connected and cannot be d-separated.
To illustrate these cases, examine the simple model presented in In summary, d-separation is a graphical criterion that allows us to enumerate the unconditional and conditional vanishing covariances between variables in a SEM. All of these d-separation constraints imply certain (conditional) independencies, and thus each of these constraints provides a local test of the model. Before we describe how these independencies are tested on actual data, we now present the second graphical criterion for local fit evaluation.

Trek Separation
The zero conditional covariances implied by d-separation are only directly testable if the set Z on which we need to condition consists entirely of observed variables. In typical SEMs where the only observed variables are the manifest indicators of latent variables, we will therefore not get any conditional covariance implications at all. However, in models with latent variables there are additional constraints that cannot be identified using d-separation. A different graphical criterion called "trek separation," or t-separation, can be used to apply local fit evaluation to linear models in such cases as well. However, one important limitation of t-separation is that it does not apply to models that contain cycles. 4 This second type of local tests was first considered by Spearman (1904) in his analysis of vanishing tetrad constraints. Vanishing tetrad constraints apply to foursomes of variables from the model, say X, Y, Z, and W. The tetrad XY ZW is defined as a difference between two products involving four covariances: A tetrad XY ZW is said to vanish if XY ZW ϭ 0 holds in the population. Though in general, the set of vanishing tetrads of a particular SEM depends on the model parameters, there are a  This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
number of tetrads that must always vanish under any set of models that share the same graphical structure. Therefore, just like the d-separation constraints considered above these vanishing tetrad constraints can be derived purely from the model structure without any reference to data. The t-separation criterion is not commonly applied in the field, and most researchers instead resort to the recommendation by Bollen and Ting (2000) to determine vanishing tetrads empirically from the implied covariance matrix of a model instance with randomly chosen parameters (Johnson & Bodner, 2014). There is no inherent advantage of using this simulation-based approach as opposed to a graphical criterion, and we believe that the widespread use of the simulation approach is partly because of the somewhat intricate form of the original t-separation argument. We provide here an alternative, yet equivalent, definition that we believe is more accessible.
First, we define a directed route as a route that consists only of forward-pointing directed arrows, for example, X ¡ Y ¡ Z but not X ¡ Y ¢ X. For two sets of variables I, J, we define C(I, J) as the set of those variables from which there is a directed route to at least one variable in I and a directed route to at least one variable in J. In a graphical model, C(I, J) is referred to as the set of "common ancestors" of I and J. The tetrad representation theorem (Spirtes et al., 2000) says that tetrad I 1 J 1 J 2 I 2 vanishes if and only if one of the following two conditions is met: 1. There exists a variable M I that lies on every directed route from C(I, J) to I ϭ {I 1 , I 2 } (the "outer pair" of the tetrad variables).
2. There exists a variable M J that lies on every directed route from C(I, J) to J ϭ {J 1 , J 2 } (the "inner pair" of the tetrad variables).
Note that these two conditions are not mutually exclusive-a single variable M could fulfill them both, that is, M ϭ M I ϭ M J . The variable M I or M J in the previous conditions is sometimes called "bottleneck" or "choke point." To see how this rule can be applied, we provide a small example in Figure 2. In this example X, Y, Z are indicators of a latent variable U 1 and W is the single indicator of a latent variable U 2 . 5 We are interested in whether there is a vanishing tetrad constraint. Any tetrad will vanish if one of the conditions above holds; namely, that there is a variable M I or M J that serves as a bottleneck for all effects of sets of common ancestors. In this example, there are only four observed variables. For any two pairs I and J, we can form from these observed variables, the set C(I, J) consists of a single variable, U 1 , because U 2 is only an ancestor of W, but not of any other variable. Now let us consider the tetrad XY ZW , where I ϭ {X, W} and J ϭ {Y, Z}. The directed routes of C(I, J) ϭ {U 1 } on X, W are U 1 ¡ X, and U 1 ¡ U 2 ¡ W. U 1 lies on every directed route, and is thus a bottleneck, satisfying the first tetrad condition noted previously, and thus the tetrad XY ZW will vanish. Of course, this implies that tetrads that are formed on the same difference of covariances (but with reversed order of variables) will also vanish; for example, the tetrad Y XWZ , or XZY W . In fact, all three unique tetrads that can be formed in this model will vanish.
Our general definition above implies all vanishing tetrads that can be read off a graphical model. In fact, this definition subsumes the types of vanishing tetrads from a previous typology for measurement models by Kenny (1979). In this typology, X, Y, Z, W are assumed to be indicators each connected to a single latent variable. Kenny (1979) lists the following conditions as implying a vanishing tetrad XY ZW . 3. Consistency of epistemic correlations: X, W, Y are indicators of U1 and Z is an indicator of U 2 . Again, U 1 and U 2 are correlated. In this case, U 1 can act as M I in our definition (but note that U 2 cannot act as M J ).
Thus, one can verify that t-separation implies all tetrads from the typology by Kenny (1979). The typology is very useful, because it allows to classify the potentially large number of tetrads that are yielded by a single model. However, the typology is only well defined for measurement models in which indicators load only on one single construct. As soon as this is violated, the distinction between the different types of tetrads becomes blurry, because tetrads may then belong to more than one category. In contrast, the t-separation criterion allows one to identify all vanishing tetrads implied by an arbitrarily complex model, including those that have indicators load on more than one latent variable.
We conclude this explanation with a short mathematical argument why our graphical criterion causes the tetrads to vanish. This argument is a simplified version of the tetrad representation theorem (Spirtes et al., 2000). The tetrad XY ZW is in fact the determinant of the 2 ϫ 2 matrix 7 which is zero if and only if there exist a such that 5 We assume that the reliability of the measurement of W as an indicator of the latent U 2 is known, and thus a constraint on the variance of the latent U 2 can be used to allow for a single indicator item and still have a globally identified model. 6 In this case, the common ancestor of X, W, Y, Z is the implicit latent variable represented by the bidirected arrow U 1 ↔ U 2 . It is helpful to replace bidirected paths by explicit latent variables, as explained above, before evaluating the graphical criterion for vanishing tetrads. 7 Larger submatrices can be considered. Those yield pentad and even higher order constraints. Sullivant, Talaska, & Draisma (2010) give a detailed account of such constraints and show that both t-separation and d-separation can be derived from their general definitions. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Assume that there exists a single variable M I that serves as a bottleneck and therefore transmits all effects from the C(X, Z, Y, W) to the outer pair of variables I ϭ {X, W}. Then we can decompose

Effect Sizes and Statistical Tests for Local Fit Evaluation
We now discuss how local tests would be performed once data are available. In the online supplemental information, we provide R code that performs the tests on simulated data without using any external packages.

Effect Size and Tests for d-Separation Constraints
Every d-separation statement of the form "X and Y are d-separated by Z" leads to a statistically testable constraint on the probability distributions that are compatible with the assumed model. In the simplest case, when the set Z is empty, d-separation implications take the form of unconditional independence. The most basic way to test unconditional independence between two variables with a statistical test is to compute the correlation coefficient and apply a significance test. If the two variables are statistically independent, their correlation coefficient is expected to be zero and tests of it should (under repeated sampling) only yield significant results with a frequency equal to the Type I error rate of the test. 8 Things become more complicated when the set Z is not empty, which implies a conditional independence between X and Y. A general strategy for testing conditional independence is regressing both X and Y on Z, and then testing for independence between the residuals of these regressions. If X and Y are indeed conditionally independent given Z, then these residuals should be statistically independent as well. Performing this analysis using linear regression leads to the partial correlation coefficient r XY.Z , where the period behind the two variables in the subscript denotes the variables that are being partialed out. This partial correlation coefficient is a natural effect size measure for d-separation constraints, since correlation coefficients are very familiar to applied researchers.
An important caveat is that conditional independence only implies zero partial correlation if the relationships between X and Z and between Y and Z are indeed linear. A nonzero correlation between regression residuals does not immediately mean that the tested variables are truly conditionally dependent. Instead, the regression may also have failed to capture the form of dependence between X or Y and Z, and therefore have generated incorrect residuals. Fortunately, the basic approach of examining residuals generalizes to many kinds of regression, which enables semiparametric conditional independence testing (Shipley, 2002). This means that instead of using residuals from linear regression mod-els, we could estimate flexible semiparametric models that approximate the true relationships more closely than linear trends, and rely on residuals from these models instead. We shall illustrate this later in a worked example.
Significance testing in general and in the case of testing conditional independence using correlation coefficients is not without shortcomings. One possible concern is that rejection of the null hypothesis does not inform us about the amount of misspecification, especially in large samples. In addition to the p value, we may of course also inspect the confidence interval, which puts the focus more on the magnitude of the effect, and the uncertainty of the estimate. By examining the midpoint of the interval (the point estimate) researchers can directly examine the magnitude of the correlation coefficient. Correlation coefficients are one possible measure of the size of the effect, and being on a metric that is readily interpretable by most researchers should facilitate judgment about the importance of a rejected significance test. As an example, in a large sample of several thousand participants, a correlation coefficient of .01 may indeed yield a significant result, but researchers may feel any attempt to "repair" such a small violation may result in overfitting of the model to the dataset at hand. A difficulty in this approach is the reliance on cut-off values, as there is some uncertainty about what cut-off should be chosen, and if a cut-off can be universally used in all circumstances.
Besides judging the absolute value of the correlation coefficient, we may also conduct tests of close fit, where observed correlation coefficients are not tested against zero, but some other value that is chosen to be of sufficiently small magnitude. For example, one may test whether the observed correlation coefficient is significantly more extreme than Ϯ.05 or some other reasonably small value. The resulting p value of this test then indicates whether an observed correlation coefficient deviates significantly from a minimally acceptable amount of misfit. This approach is widely used; for example, to construct a test of close fit based on the RMSEA fit index.
Significance tests of correlation coefficients against nonzero values are not routinely implemented in standard software. However, there are several ways to obtain such tests. The way that we propose to conduct these tests is to first use Fisher's Z transformation of both the observed correlation coefficient and the minimum tolerable correlation that one wants to test against. Correlation coefficients that have been transformed using Fisher's Z transformation have an approximately normal sampling distribution with a standard error of 1 ͙NϪ3 . 9 To test for deviations that could go in either direction, we square the Fisher's Z-transformed value, and then perform a one-sided significance test using a 2 distribution with noncentrality parameter that is identical to the squared value of a Fisher's Z-transformation of the nonzero value that one wants to test against. 8 Such tests come with all advantages and disadvantages of significance testing (Nickerson, 2000;Wagenmakers, 2007). Some practitioners might prefer to perform Bayesian statistics, for example, in the form of a Bayes Factor, or a posterior distribution with a Bayesian credible interval (Kruschke, 2010). 9 When computing correlation coefficients from residuals of regression models, the degrees of freedom are further diminished by the size of the conditioning set. This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
We provide R code for a demonstration of this type of test for the simple model in Figure 1, but for now we defer any further numerical examples to later sections. Here it suffices to say that the d-separation constraints and the implied (conditional) independencies can be statistically tested (either against zero or a meaningfully small value), and their effect size can be observed.

Effect Size and Tests for t-Separation Constraints
The parametric statistical tests of vanishing tetrads follow naturally from the definition of the tetrad as a difference between two products of covariances. There are several statistical tests that can be applied. Wishart (1928) proposed a test statistic formed by dividing the value of the tetrad by a SE derived from the covariance matrix of tetrads. Other test statistics, for example, by Kenny (1974) compute canonical correlations. The test statistic of Wishart's test converges to a normal distribution in large samples. In small samples, where this convergence is questionable, bootstrapping of the standard error is highly recommended.
In addition to the significance tests, we may again consider an effect size and its confidence intervals. Tetrads computed from a correlation matrix are bounded by Ϫ1 and 1. As such, they are on a familiar effect size metric and their magnitude can be easily assessed. Like in the case of the d-separation constraint, we may want to perform a test of close fit against some nonzero but minimally acceptable value of misfit for the tetrad. This test of close fit is very similar to the one presented previously. Here we first standardize both the observed tetrad and the minimally acceptable value by dividing both by the standard error of the tetrad, and upon squaring use a noncentral 2 distribution to derive p values.
We provide R code to compute tetrads for the simple model in Figure 2. We will again defer actual numerical examples to later sections. For now, it again suffices to say that t-separation constraints and the implied vanishing tetrads can be easily parametrically tested (either against zero or a meaningfully small value), and their effect size can be observed.

Software
Whether a method becomes widely used and adopted by applied researchers depends in no small part on software. To our knowledge, none of the currently available software programs that estimate SEMs offers even an option for local fit evaluation. This includes the open-source software lavaan (Rosseel, 2012), which we have used in our example code, but also all of the currently available commercial SEM software. The Web-based DAGitty software (Textor et al., 2011) provides an interface to draw a graph and returns a list of implied conditional independencies, but does not itself compute the tests. The accompanying R package "dagitty" by Textor et al. (2016) fills this existing gap and provides functions in R that perform all relevant tasks. In particular, the software can read a graphical model using both DAGitty notation, or lavaan notation. The software can find every possible d-separation, and t-separation constraint, and can display those before any data has been collected. Once data are collected, the program can perform all conditional independence, and vanishing tetrad tests. Both normal theory standard errors, or bootstrapped standard errors are supported. The program reports significance tests, along with confidence intervals, the magnitude of the correlation or tetrad as a simple effect size measure, and an optional test of close fit. For d-separation constraints only, the software can perform both a parametric and a semiparametric conditional independence test based on local polynomial regression (LOESS).

Illustrative Examples
We will now present some intentionally simplified examples to showcase the basic behavior of local tests and compare it with other, more traditional forms of fit evaluation. Note that these examples are for illustrative purposes only. We do not attempt a full comparison, for which large simulation studies, and not simple examples, would be needed.
The presented local tests rely purely on the assumed graphical structure, and therefore they can be derived before data has been collected. This means that the researcher has a chance to think about what assumptions he or she is making, and whether these assumptions seem plausible, given the current theoretical knowledge. For example, a researcher may realize that his or her assumptions imply that two variables in the model must be independent (d-separated), given another set of variables in the model. A priori this may or may not be plausible, and the local test encourages this kind of critical thinking.
Local tests can be performed immediately after data collection, even before the model itself has been fitted (as the local tests do not require global identification). This contrasts with the 2 test, and any procedure or fit index based on it, which requires a fitted model. Any local test that fails to be refuted during this stage of testing provides some evidential support for the model, and every local test that is rejected weakens support for the model, inviting the researcher to think about what part of the model is likely incorrect. The fact that the tests can be conducted without having to worry that a model may have trouble converging to a maximum likelihood solution is a potential advantage of local tests.
Local tests can also provide information about which part of a large SEM violates the observed data. This is not to say that local tests will always be able to pinpoint the exact misspecification, but unlike a purely global test they can at least sometimes succeed in doing so. In this regard, local tests are similar, but not identical to, modification indices or standardized residuals. In cases in which it is known that the misspecification is because of a missing arrow, both modification indices and local tests can inform us which specific arrow needs to be added. But in other cases, in which misspecification is, for example, because of a missing variable in a model, or several misoriented arrows, modification indices could be misleading. Local tests, on the other hand, do not immediately suggest certain arrows to be added, but inform the researcher which implications of his or her assumptions are violated. They therefore encourage researchers to think critically about these assumptions, why they could be violated, and how this violation could be remedied. This may result in the inclusion of another arrow, but it can also result in different changes to the model, for example, inclusion of a latent variable. Finally, a difference between local tests and modification indices is that the latter always require a fitted model, whereas the local tests can be computed before model fitting.
Through a small set of worked examples, we now demonstrate the behavior of both types of local tests discussed in this article. This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
We provide R code for all examples in the online supplemental material. This code includes the data-generating models, all standard output of the SEM software lavaan (Rosseel, 2012), and the local tests provided by the R package dagitty (Textor et al., 2016).

Identifying Misfit Location in Path Models
To demonstrate how local tests operate, we first use simple path models consisting only of manifest variables that are all assumed to be perfectly measured. Such models yield tests based on d-separation constraints. Assume that the true data-generating model looks like Figure 3a. In this example model, variable X has an indirect effect on Y that is mediated by the observed variables M 2 and M 3 (whose error terms are correlated with each other), and the unobserved variable U 1 . We only observe a proxy of U 1 , namely M 1 , which in this model is not a mediator, but simply caused by the unobserved U 1 . An applied researcher, however, proposes the model in Figure 3b that is identical to the true model, except that error terms of M 2 and M 3 are uncorrelated with each other, and that U 1 is not in the model, and in its place is the variable M 1 . This model-contrary to the truth-assumes that the effect of X on Y is fully mediated by M 1 , M 2 , and M 3 . It also incorrectly assumes the absence of any common cause of M 2 and M 3 except X. The model of the researcher has certain implications, encoded in the following d-separation constraints. First, X is d-separated from Y given M 1 , M 2 , and M 3 . Second, every pair of the variables M 1 , M 2 , and M 3 is d-separated from each other given X. Together there are thus a total of four d-separation constraints.
We can now compare these constraints with the ones in the true model. The first constraint of the assumed model is that X and Y are d-separated from each other, given the three variables M 1 , M 2 , and M 3 . This does not hold in the true model (because the route that traverses U 1 is still open). Likewise, the constraint that M 2 and M 3 are d-separated given X is incorrect, because in the true model these two variables are connected by a bidirected arrow. On the other hand, the two remaining constraints that M 1 is d-separated from both M 2 and M 3 given X is correct in both the assumed and the true model. Hence, we would expect that two of the local tests will be violated, and that the remaining two tests would not be violated.
Based on the true model, we simulated a single dataset with 1,000 datapoints. We used standardized variables, and set all path coefficients to.4. We fitted the incorrect model, and observed simple measures of global fit and tests of local fit. The global 2 test of the model soundly rejected it, 2 (4) ϭ 311.2, p Ͻ .001.
Other global fit indices also suggested misfit, CFI ϭ .821, RMSEA ϭ .277, and standardized root mean square residual (SRMR) ϭ .129. There are a variety of other fit indices that could have been computed, but for demonstration purposes we only report this subset. Based on the global tests alone, we cannot determine whether the assumption of full mediation, or the assumption of residually uncorrelated mediators (both expressed in the d-separation constraints) is more likely violated.
The local tests yielded the following results. The constraint between variables M 1 and M 2 was not violated, r M 1 M 2 .X ϭ .02, p ϭ .55, thus indicating that this particular restriction of the model is in agreement with the data. The small effect size (r ϭ .02), and the nonsignificant test of close fit (p ϭ .85 against a minimally acceptable value of .05) further strengthen this belief. Likewise, the constraint between variables M 1 and M 3 was not violated, r M 1 M 3 .X ϭ .02, p ϭ .64, again with a very small effect size and a test of close fit with a high p value of .88. On the other hand, the constraint between variables M 2 and M 3 showed a strong violation, r M 2 M 3 .X ϭ .46, p Ͻ .001, indicating to the researcher that the model may be misspecified in a way that this conditional independence is incorrect. The relatively large effect of r ϭ .46, and a test of close fit with a p value much smaller than.001 corroborate this view. Lastly, the constraint between variables X and Y also showed strong violation, r XY.M 1 M 2 M 3 ϭ .33, p Ͻ .001 (test of close fit, p Ͻ .001), and points the researcher to another part of the model that exhibits misfit, namely that the three variables M 1 , M 2 , and M 3 do not fully mediate the effect between X and Y.
It is important that these local tests do not directly suggest any particular modification (e.g., adding a directed arrow), but point to the specific implications of conditional independence that are violated. This is a general feature of local tests that are based on d-separation. Instead of suggesting direct fixes to a model, they confront the researcher with the implications of the structural assumptions that were made in the model, and whether they are refuted by data. However, knowing the graphical rules that these tests are based on does provide some immediate clues as to which modifications could, and which could not, remedy the misfit. Specifically, if a d-separation implication is falsified, then possible reasons include that (a) a direct path between the separated variables is missing; (b) an element of the conditioning set is not perfectly measured; (c) an element of the conditioning set is truly a collider on some path between the separated variables. We can immediately see, for instance, that possible changes that could repair the failed test r XY.M 1 M 2 M 3 ϭ 0 include (a) adding a path from X to Y; (b) introducing a measurement model for one of the Ms, (c) reversing an arrow from Y to an M. We also realize that actions that will not repair the implication include (a) adding any arrow between the Ms, (b) reversing an arrow from X to one of the Ms, or (c) introducing a measurement model for X and Y.
At this point, it is informative to compare and contrast the results of the local tests with those of the more commonly used modification indices and the inspection of the standardized residuals. Using the exact same data and model as earlier, we may also request modification indices. This model yields a total of 21 modifications indices. The three largest indices all have a value of 226.3. They suggest either the addition of a bidirected arrow between M 2 and M 3 , or the addition of a directed arrow either from M 2 to M 3 or vice versa. What these three modification indices share is that when their suggested change is implemented, the This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
resulting model no longer violates the local test that suggested conditional independence between M 2 and M 3 . In fact, if one were to compute a p value for this modification index (which is a one degree of freedom 2 test), it would be numerically very similar to the corresponding local test. In this situation, the local test and the modification index are virtually identical. 10 This will always be the case in which a single, unique local test can be relaxed through the addition of a path or bidirected arrow. Likewise, the matrix of standardized residuals showed that the largest residual was between M 2 and M 3 . That this fortunate behavior of the modification index is not guaranteed can be seen by the second local test. After modifying the model by adding an arrow between M 2 and M 3 , 18 additional modification indices can be identified. Four of these indices have the same value of 52.9. One of these modification indices suggests connecting X and Y with a reciprocal directed arrow. The matrix of standardized residuals also suggest that the largest residual is contained in the covariance of X and Y, with all other residuals being quite small. The remaining three modification indices suggest connecting Y with one of the three mediators M 1 , M 2 , or M 3 with a bidirected arrow. What these modification indices have in common is that when implemented the resulting model would not violate the other local test anymore. However, what these tests also have in common is that they all suggest an incorrect modification of the model. In other words, none of the resulting models (even though better fitting) aligns with the true model.
This small example demonstrated that global tests can inform the applied researcher that a model does not fit the data. Local tests on the other hand inform the applied researchers which of the model's implications in the form of conditional independencies is violated. This is not to say that local tests can always identify a correct model, but they do identify local sources of misfit.

Identifying Misfit Location in Latent Variable Models
A second example involves latent variables, and hence the use of tetrad tests. Consider the data-generating model in Figure 4, in which a latent variables L X causes another latent variable L Y. Both latent variables have three indicators each. However, some of the manifest indicators are correlated with each other. In particular, variable X 1 is correlated with X 3 , indicating that the latent construct L X does not fully capture all relationships between the indicators. Also, X 1 is correlated with Y 1 , indicating the potential presence of some shared methods variance. Now suppose that a researcher fits a model that is identical to the true model but does not include the correlations among indicator variables. This model has additional constraints on various tetrads that are not present in the true model. We generated a single dataset with 1,000 datapoints from the true model, using again completely standardized variables. All path coefficients from latent variables to the indicators and the coefficient between the latents were set to.7. The correlations between individual items X 1 and X 3 , and X 1 and Y 1 were set to.25. Fitting this model yielded a large 2 statistic, 2 (8) ϭ 263, p Ͻ .001. Likewise, CFI (.892) and RMSEA (.179) suggested rejection. The SRMR suggested adequate fit (.049). In fact, even fitting an unrestricted model (the first step of the four-step testing procedure suggested by Mulaik and Millsap (2000) in which every item is allowed to load on any of the two factors, essentially an exploratory factor analysis) yielded bad fit, 2 (4) ϭ 150, p Ͻ .001. In addition to these global tests, we can also easily compute the implied vanishing tetrads of this model. This model yields a total of 27 vanishing tetrads.
Unlike in the case of d-separation constraints, tetrad tests are a bit more complicated to evaluate. One reason is that the tetrad tests tend to be much more numerous, and second, they do not map to conditional independencies between observed variables. The large number of tetrad tests means that adjustment for multiple testing is usually recommended. When we adjusted the p values of the 27 tetrad tests in our example model with the Bonferroni-Holm method, we obtained 12 significantly violated tetrads (at ␣ ϭ .05). For example, two of the most strongly violated tetrads were: X 1 ,X 2 ,Y 1 ,X 3 ϭ Ϫ.13, p ϭ 2.4 ϫ 10 Ϫ17 X 1 ,Y 1 ,Y 2 ,X 3 ϭ .10, p ϭ 7.6 ϫ 10 Ϫ15 .
Even though their absolute effect size was somewhat modest, tests of close fit (against an arbitrary value of .05), also indicated that the observed tetrad was significantly larger than this threshold, with p values far below .001, even after adjustment for Type I error inflation.
Both violated tetrad constraints postulate that the set I ϭ {X 1 , X 3 } can be t-separated from another set J that includes Y 1 (in the first tetrad J ϭ {X 2 , Y 1 }; in the second tetrad, J ϭ {Y 1 , Y 2 }). So this would indeed suggest to add the missing covariance between X 1 and Y 1 . That same conclusion would also be reached when examining any set of most strongly violated tetrads. However, instead of adding residual covariances, another course of action is also suggested by the local tests: All 13 significantly violated tetrads involve the indicator X 1 . Therefore, it would also appear reasonable to drop this indicator from the model altogether. Indeed, the result would also be a wellfitting model in this case.
In contrast to these tetrad tests, we may examine standardized residuals, and the 21 possible modification indices. The largest standardized residuals were concentrated on the covariances between X 1 and Y 1 , X 1 and Y 3 , and Y 1 and X 3 . The modification index with the largest expected difference in the 2 statistic is the covariance between X 1 and Y 1 , 232.91. It may be plausible to drop a variable such as X 1 from the model based on this set 10 In the local test, the test statistic is simply the ratio between the estimate and its standard error. If we square this quantity, we get a test statistic that is very close, but not identical to the corresponding modification index. The small difference is merely because the modification index is in fact a score test, while in the regression model, we use a Wald test. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
of modification indices, but it may also be plausible to add a bidirected arrow. Reexamining the resulting modification indices yields two suggested modifications, either adding a covariance between X 1 and X 3 , or a directed arrow from L Y to X 2 , both with an expected change in the 2 statistic of 29.31. Therefore, one of the modification indices correctly identifies the second missing covariance, whereas the other one does not.

Local Tests With Nonconverging Models
In the next example, we want to demonstrate how local tests can be used when a global model does not converge (and thus modification indices cannot be computed). Consider the model in Figure 5a.
For this model, we chose specific path coefficients that lead to convergence problems: The variable I in this model effectively acts as an instrumental variable to render the part involving X and M identified. However, the effect of J on X counters the effect of I on X because I and J are negatively correlated. The assumed model in Figure 5b does not include the effect of J on X. Therefore, the variable I will appear to be very weakly correlated with X and the model is therefore locally not identified.
Based on the true model, we simulated a single dataset with 1,000 datapoints, and set standardized path coefficients to the values shown in Figure 5a. We tried to fit the incorrect model, but as expected it would not converge. Without convergence, it was also impossible to observe any global fit indices, let alone any of the modification indices.
The local tests on the other hand can be performed without any problems. The assumed model implies five conditional independencies. One of these was very strongly violated. This implication states that J and X are independent given I (r JX.I ϭ .19, p Ͻ 10 Ϫ22 ), thus casting doubt on this particular independence. A test of close fit (against the threshold .05) also resulted in a very small p value, 1.45 ϫ 10 Ϫ17 . As we saw earlier, every violation of a local test can be remedied by adding an additional arrow (although without guarantees that this is the correct fix). A researcher faced with these local tests should question the violated assumptions, and think about ways in which they could have been violated. However, in this case, adding the direct arrow J ¡ X is in fact the correct course of action, and the model that includes this additional arrow-that is, the true model-converges without problems, and shows no violations of the two remaining local tests.
We now give a second example of a nonconverging model, this time involving the use of latent variables and tetrad tests. At the same time, we use this example to illustrate the interplay between local tests and constraints on the model parameters. Consider the model in Figure 6a. A single latent variable U affects all observed variables X 1 to X 4 . Both X 1 and X 2 , and likewise X 3 and X 4 share an additional covariance. We choose the path coefficients such as shown in Figure 6a. Note that one of the items has a very weak loading. To be able to estimate this particular model from data, some constraints need to be imposed, because otherwise the model is not identified. One possible constraint is shown in the assumed model in Figure 6b, in which the factor loading of X 1 and the loading of X 3 are forced to be identical, indicated by the shared letter c in the Figure. Based on the true model, we simulated a single dataset with 100 datapoints, a smaller sample size deliberately chosen to force nonconvergence. The standardized path coefficients were set to the values shown in Figure 6a. Then, we attempted to fit the incorrect model (that included the equality constraint, because a model without constraints cannot be estimated). This model did not converge.
In a next step, we computed the local tetrad tests. In this example, there is only one single tetrad test, X 1 ,X 3 ,X 4 ,X 2 ϭ .04, p ϭ .11, indicating no violation. The small magnitude of the tetrad and a test of close fit (against a value of .05) that yields a p value of .49 confirm this further. The local test only tests the implications of the structure of the model, and thus the equality constraint (that is necessary to estimate the model) does not influence the local test. The fact that the tetrad test does not show a violation bolsters faith in the actual structure of the model, and suggests that the equality constraint is the likely culprit responsible for the nonconvergence. Changing the equality constraint to the other two loadings, for example, X 1 and X 4 yields a converged model with decent, although not perfect, fit, 2 (1) ϭ 2.826, p ϭ .093, CFI ϭ .991, RMSEA ϭ .135, and SRMR ϭ .019. In this example, modification indices would not have helped to detect the problem even if the model had converged, as modification indices are based on the entire model, including any imposed constraints. Local tests only test the structure without regard to the imposed equality constraints, which correctly informs the researcher about the source of the problem.

Nonparametric Testing of d-Separation Constraints
Last, we give a small example to illustrate how local fit evaluation can be used to disentangle structural and distributional as-  This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. pects when assessing model fit. Note that this is only possible for d-separation but not tetrad tests, since the latter only work with linear SEMs. To give the simplest possible example, we generated data from the three-variable mediation model shown in Figure 7a. Here, we imposed quadratic rather than linear dependencies between variables, as can be seen in Figure 7b and c, though we used additive Gaussian noise like in a linear SEM. We simulated a dataset with 1,000 datapoints in this manner. The mediation model fits poorly to this dataset, 2 (1) ϭ 222.28, p Ͻ .001. RMSEA also indicates poor fit (.47), though CFI and SRMR do not (.957, .016), respectively. The standard local test, which is based on linear partial correlation, also indicates a violation (r XZ.Y ϭ Ϫ.32, p Ͻ .001). This is because linear regression fails to capture the true shape of the functional relations between X, Y and Z, which then lets the residuals appear correlated, as can be seen in Figure 7 (d). However, when we instead use local polynomial regression to estimate the functional relations, then the residuals are no longer correlated (r XZ.Y ϭ Ϫ.02, 95% confidence interval [CI: Ϫ.09 to .04]; Figure 7e).
In the face of such results-a failing parametric test whose semiparametric version passes-a researcher might conclude that the source of misfit is not the model structure, but instead the distribution of the data. In contrast, modification indices for our model suggest to either add a direct path or covariance between Z and Y, or a direct path from X to Z. All these modifications would lead to a saturated model with zero degrees of freedom, which of course can fit every covariance matrix. However, such modifications would obscure the true source of misfit, and lead to an incorrect model structure.
In summary, we tried to show through a series of simple examples how local tests can be used, and what kind of information they yield. We also contrasted them with the more commonly used modification indices. What we observe is that local tests first and foremost test the structure and the implied independencies of a model. In some instances, the local tests will align with the modification indices, in other instances they will be quite different. Local tests do not generally suggest additions of particular arrows in the model, but directly test the assumptions of the model. This information can then be used to revise a model by thinking about ways that a violated assumption could emerge. As we have demonstrated, local tests can be used even in cases in which global identification of a chosen model is not possible.

Practical Implications
Our examples so far served the purpose of illustrating the behavior of local tests, and were not meant to represent realistic SEM analyses. In particular, path models without any latent variables are rare (certainly in psychology). As discussed above, researchers who test full latent variable SEMs will most often not rely on a single global test to evaluate their model, but will more often use some variation of the two-step procedure (Anderson & Gerbing, 1988) to evaluate the measurement and structural portions of their models separately. In this final section, we will explore how the local fit evaluation ideas presented in the previous sections can be incorporated into such a two-step test of a full latent variable SEM, which could be considered a more realistic representation of practice than the examples given earlier.

d-Separation Testing of Latent Variable Models
Our examples so far suggest that tetrad testing is the only option available for latent variable SEMs. This approach has, however, some disadvantages: first, failing tetrad constraints are more difficult to interpret than failing conditional independencies, and second, complex SEMs can imply hundreds to thousands of vanishing tetrad constraints. Here, we describe a strategy that can be used to apply d-separation-based local fit evaluation to latent variable SEMs.
The key idea of our strategy is that, for each d-separation implication I of a path model M, it is possible to create another path model M I that has I as its one and only implication. Sometimes this can be done by adding paths to a model, but in general this will not be possible. As a counterexample, consider the path model X ¡ Y ¡ Z ¢ W. This model implies that r YW ϭ 0 and r XW ϭ 0. To remove the second implication, we would need to add a path between X and W. However, every path that we add would change the first implication from r YW ϭ 0 to r YW.X ϭ 0.
Instead, the following algorithm always works: Let r YW.Z ϭ 0 be the desired implication. Then (1) create a saturated model by This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
linking all variables with bidirected paths, (2) remove the path X ↔ Y, and (3) It is easy to verify that the resulting model (1) implies r XY.Z ϭ 0; (2) does not imply r XY.Z Ј ϭ 0 for any proper subset Z= ʚ Z or superset Z= ʛ Z; and (3) no variables except X and Y can be d-separated. If we apply this strategy to our mediation model from Figure 3, we get the four models shown in Figure 8.
We can use such single-implication models to evaluate each d-separation constraint I entailed by the structural part of a latent variable SEM M separately by replacing the structural model with a single-implication model for I. As we shall explain below, this naturally leads to local versions of several known fit indices for latent variable SEMs.

Extending a Two-Step SEM Analysis by Local Fit Evaluation
In many latent variable SEMs, the measurement model contributes the vast majority of degrees of freedom. Since the global 2 is an additive statistic, good fit of the measurement model can obscure poor fit of the structural model. In the two-step procedure (Anderson & Gerbing, 1988), one establishes the validity of the measurement model separately before moving on to test the path model by first testing a version of the model in which the structural part is saturated. Having established the measurement model, we can then assess the fit of the structural model using the 2 difference statistic for C10, where SN stands for the structural null model and F denotes any fit measure that is monotone with respect to nestedness and increases with poorer fit. Such indices are always bounded between 0 and 1. We obtain a local version of any such index simply by replacing F M with F M I . To illustrate these ideas, we again used our partial mediation model from Figure 3. We now treat this model as a full latent variable SEM, in which each variable is measured by three indicators, and set the loading of each indicator to 0.8. The path coefficients in the structural part of the model were all kept at 0.4, and we generated a sample of 500 datapoints. We then fitted each of the models in Figure 3, and compared the fit to the structurally saturated model and the original model. For each single-implication model fit, we compute the RMSEA I as well as C9 and C10 indices using F ϭ 2 /df. The results of this analysis are shown in Table 2. First, we observe that while the overall model appears to fit well according to its RMSEA, the structural part actually fits poorly as can be seen from the RMSEA-P of 0.17. Inspecting each implication points to the same two violated constraints that we identified in our path model analysis. All fit indices agree that the lacking conditional covariance between M 2 and M 3 is a more severe problem than the omission of the relevant variable U 1 . Interestingly, using the cutoff value of .99 for C9 or the cutoff value of .01 for C10, as Lance et al. (2016) suggested, properly separates the wrong from the correct conditional independence implications in this example.
In summary, we have given suggestions how local fit evaluation could be incorporated into state-of-the-art SEM analyses.
The key idea behind our suggestions is the fact that implications can be tested individually by constructing specific singleimplication models, and comparing their fits to alternative models. This leads to natural local equivalents of a wide variety of fit indices, including but not limited to all indices in the recent taxonomy by Lance et al. (2016). Figure 8. Generating single-implication path models. These four path models each imply a single one of the four vanishing partial covariances implied by the model in Figure 3. The unique partial covariance that is constrained to 0 is shown below each model. This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Discussion
In this article, we explained the underlying logic of local fit evaluation, and showed how these tests can be derived using simple graphical criteria. There are three aspects of local tests that are potentially useful: We can enumerate them directly after designing our model, and directly judge whether current theoretical knowledge supports the implied constraint (though this is likely only feasible for d-separation constraints). We can perform the tests before fitting a model to data, which helps to test models that do not converge, and we can apply them after a model fit indicated a significant misfit to potentially pinpoint where the exact problem is. Manifest and latent variables yield different types of local tests, but both share the fact that they can be enumerated before data has been collected, and they can be parametrically tested before or after a model has been fitted.
Failure to converge can occur in practice for reasons including identification problems (global or local), numerical issues with optimization algorithms, small sample sizes, and specification errors. Faced with such problems, researchers may be tempted to achieve convergence by adjusting the model because the SEM toolbox currently does not offer many other options. Local tests can help in such situations to find problems with the structural part of the initial model. The reason that local tests work in such cases is that they are based on very basic methodology: they are simple statistical tests of population parameters that can always be computed and, unlike maximum likelihood methods such as the 2 test, do not require any numerical optimization algorithms. Thus, a major advantage of local tests is that they are unaffected by convergence issues.
Most researchers agree that model misfit, as indicated by a significant 2 or other fit measures, is a serious problem that should be investigated carefully. Faced with a nonfitting model, researchers could report various fit indices to argue that the problem is of minor importance, inspect whether measurement portions or structural portions of the model are responsible for the misfit, or in an extreme case simply discard the model outright. Neither option is completely satisfying. Even if adjunct fit indices indicate reasonable fit, this still does not help to understand which parts of the model caused the misfit. On the other hand, a model can still be useful or largely correct despite a minor misfit, and outright rejection of the whole model does not always appear warranted. For models that fail the 2 test, but produce reasonable fit indices, one should nevertheless investigate and report the results of local tests to help the reader understand what the exact reasons are for the misfit, or in other words, which of the predictions of the model are least consistent with the data. Paired with examination of the absolute magnitude of the tests and tests of close fit, the severity of the violations can be assessed and communicated.
Modification indices are related to local tests: Every locally testable implication of a model is derived from missing paths in the model, and thus a failing test can always be repaired by adding a path to the model (although this may influence the other tests and create new failed tests). Thus, a local test (of a converging model) indicates significant misfit if and only if one (or several) modification indices show significant improvement of fit. Why should we then apply local tests instead of familiar modification indices to converging models? As our examples show, modification indices can be misleading if the model fails to fit for a reason other than a missing or wrong path (such as non-normality or lack of representation of measurement error), and may prevent the researcher from thinking about such other reasons at all. Since it is always possible to improve model fit by adding paths, the researcher may end up with an incorrect model if the reason for misfit was not a missing path in the first place. Thus, conceptually, local tests differ from modification indices in that they force the researcher to think about possible reasons for misfit. In this aspect, local tests are more similar to examination of standardized residuals between model-implied and sample covariances, another means of diagnosing model misfit.
The number of conditional independence constraints in a path model is moderate-it equals the number of missing paths, or in other words, the degrees of freedom (for identified models). However, the number of tetrad constraints for medium-sized or large latent variable models is huge, and can appear daunting. An issue that arises in this context is the multiple testing problem. As is well known, conducting multiple significance tests carries with it an increased risk of Type I errors. With a large number of local tests, false positives can become more frequent, and some Type I error adjustment appears to be warranted. Stringent adjustment however also influences the statistical power of tests. In addition, the statistical power of the local tests may vary widely within a single model. A test of a violation of large magnitude may have very high power, while the same model may have a violation of smaller magnitude, and subsequently a test that is underpowered. It may be helpful to consider a priori what Note. RMSEA ϭ root mean error of approximation. Note that C9 and C10 are fit indices that only apply to the path model or parts of it, and are therefore not defined for the whole model. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
magnitude of a violation can be reliably detected with sufficient power, given an assumed sample size, and significance level. A complicating matter in the context of local tests is that many of the tests are not independent, and therefore p value adjustment techniques should be used that do not rely on the tests being independent. Independent subsets of local tests can be derived under the hypothesis that the postulated model is correct, which has been done for both conditional independencies (Shipley, 2000) and for tetrads (Bollen & Ting, 1993). However, this approach defies the purpose of local fit evaluation as presented here, because we wish to test the individual implications of the model separately rather than the model as a whole. In that regard, the approach presented here is closer in spirit to the tetrad-based Bayesian posterior predictive checks discussed by Johnson and Bodner (2014), who also consider all tetrads rather than an independent subset. To help researchers navigate large numbers of tetrad tests, we suggest to separately investigate the three levels of Kenny's tetrad typology (Kenny, 1979), which is also supported by the "dagitty" package. This may allow for easier interpretation of the results.
In summary, we argue that local tests are a valuable addition to the applied researcher's toolbox. They are not meant to replace global tests, and in fact, this article does not argue, nor provides evidence in favor of, abandoning global tests. Such an argument, if it were even sensible, would require large scale simulation studies, and analytic derivations. We consider local fit evaluation as a supplement that can foster a more thorough account of model fit. We do believe that local fit evaluation can provide helpful diagnostic information, especially when models fail to converge or fail to fit. To facilitate local fit evaluation in practice, we have implemented the methods discussed here in the R package "dagitty," which is available for R on CRAN.