Florida Obsessive‐Compulsive Inventory and Children's Florida Obsessive Compulsive Inventory: A reliability generalization meta‐analysis

Abstract Background The Florida Obsessive Compulsive Inventory (FOCI) and its pediatric version, the Children's Florida Obsessive Compulsive Inventory (C‐FOCI), are instruments for evaluating obsessive‐compulsive symptomatology. Method A reliability generalization meta‐analysis was conducted to estimate an average reliability of the scores and to identify study characteristics that explained the heterogeneity among scores. Using Kuder–Richardson 20 (KR‐20) and Cronbach's α, a total of 23 and 20 independent samples were included in the meta‐analysis for the FOCI and C‐FOCI. Results We found an average KR‐20 of 0.826 for the FOCI's Symptom Checklist and an α of 0.882 FOCI's Symptom Severity. An average KR‐20 of 0.740 was found for the C‐FOCI's Symptom Checklist, while an average α of 0.794 was found for the C‐FOCI's Symptom Severity. Moderator analyses showed that the source of the coefficients (i.e., whether they were reported by the authors of the primary study or estimated by the meta‐analysts) was an important variable for the FOCI Symptom Severity, and that the focus of the study (i.e., whether it was psychometric or applied) and the sample size were relevant for the C‐FOCI Symptom Checklist. Conclusions Considering that the FOCI and C‐FOCI are scales characterized by their brevity and ease of use, and the reliabilities obtained here, both scales are well suited for screening purposes.

Although in the literature we often find misleading statements such as "this test is reliable," reliability, however, is not an inherent property of a test and it can vary each time the test is applied to a different sample of participants (Crocker & Algina, 1986;Streiner & Norman, 2008). Thus, anytime a study makes use of a scale, authors should report a reliability estimate with the data at hand (Appelbaum et al., 2018). Unfortunately, researchers often fail to report reliability estimates based on their own participants' scores, and instead it is common to find references to the reliability obtained in the original test validation study. With this practice, which is known as reliability induction, researchers erroneously assume that their data will offer the same reliability. Previous studies have reported reliability induction rates for psychological tests to be between 54% and 78.6% Vacha-Haase et al., 1999;Whittington, 1998). Checking the reliability of the scores of newer test applications is important not only to ensure that the measure itself is reliable but also because reliability affects the effect sizes. If the test scores obtain a lower reliability, it may attenuate effect sizes that are calculated with that instrument (Vacha-Haase et al., 2000).
Nevertheless, we can use meta-analytic methods to obtain a representative reliability value for an instrument by integrating different reliability estimates obtained across studies. This is often referred to as reliability generalization (RG; Vacha-Haase, 1998). Furthermore, if there is heterogeneity across reliability estimates based on the same test, an RG meta-analysis allows us to examine whether some characteristics of the studies (i.e., moderators) might account for the variability of the reliability coefficients (Henson & Thompson, 2002;Rodriguez & Maeda, 2006;Sánchez-Meca et al., 2013). Examples of study characteristics that can affect the reliability are the mean and variability of test scores, the target population (community, undergraduate, clinical), or whether the original version or an adaptation (to different idioms, cultures, or countries) of the test was applied.

| Objectives
The aim of this study was to conduct an RG meta-analysis to (a) estimate an average reliability for the FOCI and C-FOCI by integrating all the studies that have applied the scales and reported a reliability estimate with the data at hand; (b) assess whether there is a large heterogeneity between reliability estimates for the same instrument and, if so, perform moderator analyses to identify study characteristics that account for such variability, and (c) estimate the reliability induction rate for the FOCI and C-FOCI.

| METHOD
Review methods and reporting were performed according to the Reliability Generalization Meta-Analysis (REGEMA) guidelines . The REGEMA checklist for the present meta-analysis is available at https:// osf.io/zdhvk/. We did not preregister a study protocol.

| Literature search
To identify studies that could be relevant for the RG meta-analysis, we systematically searched Google Scholar, ProQuest, PsycInfo, and PubMed from 2007, the year the FOCI was published, to November 2020. We used the term "Florida Obsessive Compulsive Inventory" as the keyword to be found anywhere in the article. With this keyword we were able to identify studies that applied both the FOCI and the C-FOCI. In addition, references cited in studies that had applied any of these scales were also consulted to identify additional studies that applied the FOCI or the C-FOCI.

| Eligibility criteria
To be included in this RG meta-analysis, studies had to meet three criteria: (a) to be an empirical study that applied the FOCI or the C-FOCI to one or more samples of participants; (b) to report any reliability estimate (internal consistency, temporal stability, interrater agreement) with the data at hand or to include enough information to allow us calculate them manually; and (c) to be written in either English or Spanish language. In addition, for studies that applied the FOCI or the C-FOCI and did not report any reliability estimate with the data at hand, the corresponding author was contacted and, if they sent any reliability estimate, the study was also included in the meta-analysis. N = 1 or case series studies were excluded. Participant samples could belong to any kind of target population (community, clinical, or subclinical) and papers could be either published or unpublished (e.g., a PhD thesis dissertation).

| Instruments
A coding form was developed to extract relevant study characteristics and reliability estimates. The coding form is available at https://osf.io/yqvhs/ and contains information about all coded variables.

| Procedure
For the reliability estimates, we extracted the KR-20 for the Symptom Checklist and Cronbach's α for the Symptom Severity subscales. Besides, for those that had a reliability estimate available, we extracted data on the following potential moderators: (a) mean and standard deviation of the scores in each scale; (b) mean and standard deviation of participants' age; (c) study sample size; (d) gender distribution (% male); (e) publication year. (f) sample ethnicity (% Caucasian); (g) target population (community, undergraduate students, clinical, or mixed); (h) location of the study (country, continent); (i) test version (original or other); (j) administration format (paper-and-pencil or online); (k) study aim (psychometric or applied); (l) if psychometric, the focus of the psychometric study (FOCI or other). The coding form was applied to the studies that reported any reliability estimate of the FOCI or the C-FOCI with the data at hand. In addition, from studies that did not report reliability estimates several characteristics of the samples were also extracted: target population (clinical vs. nonclinical), mean and SD of total test scores, mean age, gender distribution (% male), and ethnicity (% of Caucasians). These variables were extracted to conduct a sensitivity analysis to explore whether the composition and variability of the inducing and reporting studies were similar. The information extraction was carried out by one author (A. S. L.).

| Data analysis
Separate meta-analyses were conducted for each subscale of the FOCI and C-FOCI tests. Random-effects models were assumed as heterogeneity among the reliability estimates was expected. The reliability coefficients were weighted by the inverse variance, using the restricted maximum likelihood method to estimate the between-studies SANDOVAL-LENTISCO ET AL.
variance. The 95% confidence bounds were computed according to the improved method proposed by Hartung and Knapp (2001). To normalize the sampling distribution and to stabilize the sampling variances of the reliability coefficients, we applied Bonett's transformation (Bonett, 2002) to the internal consistency reliability coefficients: where α i denotes the reliability coefficient of each study, ln is the natural logarithm, and L i is the transformed coefficient. After the statistical integration, the L i coefficients were back-transformed to the reliability coefficient metric (α i = 1 − e L i ) to facilitate interpretation and, as a sensitivity analysis, we also conducted the metaanalyses with the untransformed coefficients (Sánchez-Meca et al., 2013).
Heterogeneity among reliability estimates was assessed in each meta-analysis using the Q statistic, the I 2 index and the calculation of 95% credibility intervals to estimate a plausible range of reliability estimates for each of the (or any future) studies (Riley et al., 2011). I 2 values of approximately 25%, 50%, and 75% have been suggested before as reflecting low, moderate, and large heterogeneity, respectively (Higgins et al., 2003), although such benchmarks were proposed in the context of clinical trials and hence they might not be applicable to the context of RG studies. If there was evidence of heterogeneity, we performed moderator analyses to identify which of the study characteristics accounted for this heterogeneity. We applied mixed-effects meta-regression models for continuous variables and subgroup analyses for the categorical variables with the improved method proposed by Knapp and Hartung (2003). R 2 indices were calculated to estimate the amount of variance accounted for by each moderator (López-López et al., 2014). Furthermore, we selected the moderators that were either statistically significant or explained at least 10% of the variance (i.e., R 2 > 0.10) and made moderator models with all of their possible combinations for every scale that presented considerable heterogeneity. To compare which of the models fits best the data, we made use of an Information Theory approach based on the Akaike Information Criterion (AIC).
For each model, we calculated the corrected version of the AIC (AIC C ), recommended for small sample sizes (Symonds & Moussalli, 2011), using the maximum likelihood estimate. All statistical analyses were conducted using the metafor package in R (Viechtbauer, 2010) and the script is available at https://osf.io/65zpj/. We then removed duplicated results and, based on abstract and title, excluded references that were not empirical studies. Next, we full-text reviewed 189 references, from which 133 did not apply the FOCI or C-FOCI. Forty-two articles applied the FOCI, from which 13 studies (30.9%) reported some reliability estimate with the data at hand and were included in the meta-analysis. The remaining 29 studies (69.1%) induced the reliability from previous applications of the test (by omission or by report). However, four studies reported the mean and standard deviation of the scores, enabling us to use the KR-20 formula for the Symptom Checklist subscale, and we also obtained reliability estimates from further three studies after contacting authors via email, including those in the meta-analysis as well and adding up to a total of 20 studies and 23 independent samples for the FOCI. Furthermore, 14 studies applied the C-FOCI, from which 11 studies (78.6%) reported some reliability estimate with the data at hand and were included in the metaanalysis. The remaining three studies (21.4%) induced reliability from previous applications of the test (by omission or by report). However, one study reported the mean and standard deviation of the scores, allowing us to calculate KR-20 for the Symptom Checklist subscale, and one author sent us a reliability estimate, so these studies were also included in the meta-analysis, adding up to a total of 13 studies with 21 independent samples. References from the studies that were included in the analyses are available at https://osf.io/6tvna/. The total sample size for the RG study of the FOCI was N = 4076 for the Symptom Checklist and N = 5898 for the Symptom Severity subscales, with range 18-986 (mean = 226; SD = 260) and 47-1224 (mean = 281; SD = 305), respectively. Most studies were carried out in North America (52.2%), followed by Asia (26.1%), Oceania (13.0%), and Europe (8.7%). The majority of studies included only 32 | people with a clinical condition (56.52%), followed by undergraduate/graduate nonclinical students (34.78%), nonundergraduate/graduate people without any known clinical condition (4.35%) and a mix of them (4.35%).
Finally, concerning the reliability induction rates, 69.1% of the studies that applied the FOCI induced the reliability by omission (not making any reference to the reliability of the scale) or by report (explicitly making reference to the reliability shown in previous applications). Regarding the C-FOCI, 21.4% of the studies that applied it induced the reliability by omission or report. The database can be found at https://osf.io/pnfwq/.

| Mean reliability and heterogeneity
Although our purpose was to synthesize all types of reliability coefficients, studies that applied the FOCI or the C-FOCI only reported α or KR-20 coefficients. Thus, our results were based on these types of internal consistency F I G U R E 1 Flowchart of the study selection process SANDOVAL-LENTISCO ET AL. | 33 coefficients.   Table S3). In particular, the larger the sample size and the SD of test scores, the larger the reliability.

| Analysis of moderator variables
Weighted analysis of variances on the transformed reliability coefficients for each categorical moderator variable were also carried out. Full moderator analysis results are available at https://osf.io/hgzmk/. There was strong evidence of a difference when comparing the mean reliability coefficients grouped by the source from where the coefficient was obtained (reported in the paper vs. obtained from the authors upon request) for FOCI Symptom F I G U R E 5 Forest plot of the α coefficients reported for C-FOCI Symptom Severity. The lines at both sides of the diamond show the limits of the 95% prediction interval. C-FOCI, Children's Florida Obsessive Compulsive Inventory.

| Model comparison
From all the possible moderator combinations, only the five best models (according to the AIC C ) are presented (see Table 2). Because no moderator was statistically significant nor explained more than 10% of the variance for the FOCI Symptom Checklist, we did not perform any model comparison for this scale. For the FOCI Symptom Severity, the model that best fits our data was the one including the test adaptation and the source of the estimates as moderators, with an Akaike weight of 0.419. Models containing the source only and test adaptation and source as moderators achieved similar but slightly worse fit, all of which contained the source as a moderator. For C-FOCI Symptom Checklist, the model with the lowest AIC C was the one containing the sample size only as moderator, with an Akaike weight of 0.55, achieving slightly better fit than the model with the intercept only. Last, for C-FOCI Symptom Severity, the best model was the one containing gender distribution (% male) as moderator, also achieving slightly better fit than the model with the intercept only.

| DISCUSSION
We performed an RG meta-analysis to characterize how reliability of the FOCI and the C-FOCI varies across studies. The FOCI showed an average reliability of 0.826 for the Symptom Checklist subscale and 0.882 for the Symptom Severity subscale, whereas the C-FOCI showed an average reliability of 0.740 for the Symptom Checklist subscale and 0.794 for the Symptom Severity subscale. Acording to Cicchetti (1994), these values can be considered as good (FOCI) and fair (C-FOCI) reliability coefficients, respectively. If we compare these values to those obtained in other RG meta-analyses of other OCD scales such as the Y-BOCS, the original Padua Inventory Compulsive Scale López-Pina et al., 2015;Núñez-Núñez et al., 2022;Rubio-Aparicio et al., 2020;Sánchez-Meca et al., 2011, we see that they have similar or slightly better psychometric properties. Nonetheless, we want to again emphasize that the FOCI and C-FOCI are the only OCD scales that briefly examine both OCD symptoms and their severity, while also being brief, simple, and self-reported. The average reliabilities obtained here are above 0.80 for the FOCI and 0.70 for the C-FOCI. Considering the guidelines given by Nunnally and Bernstein (1994), where 0.70 is considered an acceptable standard for instruments used in basic research and quick assessments, 0.80 is suitable for research comparing the scores of different groups, and 0.90 is needed for clinical settings where diagnostics or other important decisions are made, the average reliabilities obtained in the present study make both scales seem especially relevant to be applied on community population for screening purposes of OCD symptoms and, additionally for the FOCI, also to be applied when conducting a research involving group comparisons.
A large heterogeneity among coefficients was found for both FOCI subscales and in the C-FOCI Symptom Checklist subscale so we performed moderator analyses to identify which study characteristics could be explaining this variability. For continuous moderators, we found that none of them were statistically associated with the reliability coefficients, although there was some evidence that sample size and standard deviation of the scores accounted for a substantial part of the variability among the reliability estimates. The standard deviation of the scores has been previously found to be a source of systematic variation of the reliability coefficients (Botella et al., 2010). Thus, what might be surprising here is that, even though in many cases the standard deviation explained a considerable portion of the variance, it did not reach statistical significance. Considering the scarce number of studies that have reported the standard deviation of the test scores, this lack of statistical significance might be due to a low statistical power. Another aspect to take into consideration is the fact that most studies only applied the FOCI or C-FOCI to a particular sample (either people with a clinical condition or without one, but not both). This might be artificially restricting the standard deviations, making them very similar across studies and, therefore, preventing their potential role as a moderator variable. For the categorical moderators, we found that, in the FOCI's Symptom Severity subscale, the source from where the coefficient was obtained (either reported in the paper or requested to the authors) was a significant moderator. Notably, studies that reported the coefficients showed higher reliability compared to studies where our team requested them from the authors. This finding can be interpreted as indicating the existence of reporting bias of the reliability coefficients, such that studies are less likely to report the reliability coefficient obtained with the data at hand when the observed value is small. We also found that, in C-FOCI's Symptom Checklist, studies whose psychometric focus was the C-FOCI showed higher reliability coefficients than studies whose psychometric focus was another scale. Lastly, another remarkable result was that the administration format was not a significant moderator in any of the scales, that is, there was no difference between the reliability estimates when they were obtained in a pencil and paper format compared to when they were obtained online. Thus, both the FOCI and the C-FOCI might be used as online screening tools to assess OCD symptoms easily and quickly.
Model comparison also showed that, for the C-FOCI Symptom Checklist, sample size might be a relevant moderator of the reliability coefficients, with a positive association between sample size and the magnitude of the reliability coefficients. Last, for the C-FOCI Symptom Severity, gender might also be important as a moderator, since the model containing gender achieved the best fit. Specifically, there appears to be a negative trend toward the percentage of males in the sample and the reliability scores.
Finally, we also calculated the reliability induction rate for both scales. For the FOCI, we found a rather high induction rate (69.05% of the studies induced the reliability). This finding is in line with previous appraisals of this practice Vacha-Haase et al., 1999;Whittington, 1998). Considering that this scale is rather recent, this result suggests that the practice of inducing reliability of test scores from previous studies is still extended. On the other hand, only 21.43% of the studies that applied the C-FOCI induced the reliability by omission or report. This considerably lower rate of induction (compared to the FOCI and other scales) might be due to the fact that half of the studies that applied the C-FOCI had a psychometric focus (by contrast, only 5.25% of the studies that applied the FOCI had a psychometric focus), and it is reasonable to expect that psychometric studies of a test will calculate and report reliability coefficients.

| Limitations
Although many studies have applied the FOCI or C-FOCI in their research, the number of studies that have reported a reliability estimate with the data at hand is considerably smaller. This, together with the fact that missing data on potential moderator variables was common, may have limited the results of the moderator analyses. Another limitation was the failure of authors to report important study characteristics whose potential influence on the reliability estimates can be investigated in an RG meta-analysis. In particular, many studies did not report the mean and standard deviation of test scores, two very important moderators in the context of RG studies.
On another note, this RG meta-analysis was based on Cronbach's α (and KR-20, a special case of α). Because Cronbach's α does not require multiple test administrations to estimate the reliability of a scale, it is the most commonly reported reliability coefficient (Hogan et al., 2000;Scherer & Teo, 2020). However, using Cronbach's α as a measure of reliability presents some drawbacks. First, this coefficient is not invariant to the numbers of items in the scale, larger number of items will inflate Cronbach's α (Tavakol & Dennick, 2011). Second, it relies on some assumptions that are usually not met in practice (τ-equivalence, uncorrelated item errors, unidimensionality, etc.), resulting in an over or underestimation of the reliability (McNeish, 2018). For this reason, Cronbach's α is being criticized and it is advised that future primary studies should, instead, make use of composite reliability measures such as the omega coefficients or measures of maximal reliability such as the coefficient H, allowing future RG meta-analyses to be based on those coefficients (Scherer & Teo, 2020). Nonetheless, studies published to date have seldom used these alternative reliability coefficients, so an RG meta-analysis based on them is not yet feasible for most measurement instruments.