Gender biases in impressions from faces: Empirical studies and computational models

Trustworthiness and dominance impressions summarize trait judgments from faces. Judgments on these key traits are negatively correlated to each other in impressions of female faces, implying less differentiated impressions of female faces. Here we test whether this is true across many trait judgments and whether less differentiated impressions of female faces originate in different facial information used for male and female impressions or different evaluation of the same information. Using multidimensional rating datasets and data-driven modeling, we show that (a) impressions of women are less differentiated and more valence-laden than impressions of men and find that (b) these impressions are based on similar visual information across face genders. Female face impressions were more highly intercorrelated and were better explained by valence (Study 1). These intercorrelations were higher when raters more strongly endorsed gender stereotypes. Despite the gender difference, male and female impression models-derived from separate trustworthiness and dominance ratings of male and female faces-were similar to each other (Study 2). Further, both male and female models could manipulate impressions of faces of both genders (Study 3). The results highlight the high-level, evaluative effect of face gender in impression formation-women are judged negatively to the extent their looks do not conform to expectations, not because people use different facial information across genders but because people evaluate the information differently across genders. (PsycINFO Database Record (c) 2019 APA, all rights reserved).


Analysis of the Effects of Raters' Gender Stereotype Endorsement
When testing for the effect of the raters' GSE on facial impressions (see Table S2 for the traits used in the GSE questionnaire), we conducted an additional analysis on the structure of impressions using four factor scores of GSE (each of which represents specific subtypes of gender stereotypes) in addition to the analyses reported in the main text (Study 1b). In the analyses reported in the main text, we used the GSE score, the sum of each rater's responses to all items in the questionnaire (see the main text for details). The additional analysis yielded consistent results with the main analyses.
To examine if the relationship between the impression differentiation and the rater GSE is affected by gender-or valence-specificity of stereotypes, we computed four factor scores for each rater -stereotype gender [male/female] × stereotype valence [positive/negative]. Each factor score represented the extent to which each rater supported gender-and valence-specific stereotypes. The four-factor confirmatory factor model revealed an acceptable albeit minimum level of fitness (χ 2 (164) = 826.21, P < .001; CFI = 0.827; RMSEA = 0.093, 90% CI = [0.087,0.099]; SRMR = 0.067). The resulting factor loadings of this four-factor solution showed the expected relationships between the factors and the questionnaire responses (see Table S2 for details).
We then simply replicated the analyses reported in the main text using each GSE factor score rather than the GSE sum score. Although we explored the relationship between impression differentiation and the rater GSE in relation to the specific subtypes of stereotype (gender × valence), people's attitudes towards the four subtypes (i.e., male and positive, male and negative, female and positive, and female and negative) go hand in hand (Glick & Fiske, 1996;Glick et al., 2000Glick et al., , 2004. That is, if one holds one subtype of gender stereotypes, say, stereotypes about stereotypically male and positive traits (e.g., "men are more analytical than women"), then the person is likely to hold the other three gender stereotype subtypes as well (e.g., "men are more hostile than women", "women are more gullible than men", "women are more nurturing than men").
Thus, we did not expect any uniquely distinct effect of the stereotype subtypes on the effect of GSE on impression differentiation. As we expected, the results for all four subtypes (factor scores) were consistent with the results reported in the main text. First, when a rater GSE factor score increased, regardless of the subtype of GSE, the correlation between impressions became stronger for both male and female faces, and the quadratic regression model explained more variance than the linear regression model (Fig. S2). Second, female face impressions had stronger inter-correlations and larger amount of variance explained by valence than male face ratings did (see below and Figure S2 for details). explained a larger amount of variance than the linear models did (male faces: F(1,136) = 12.38, P < .001; female faces: F(1,136) = 44.74, P < .001). Across the factor score, female face ratings had higher correlational coefficients (ts > 6.32, Ps < .001) and larger amount of variance explained by the valence component than male face ratings (ts > 12.87, Ps < .001). < .001). Across the factor score, female face ratings had higher correlational coefficients (ts > 7.43, Ps < .001) and larger amount of variance explained by valence than male face ratings (ts > 14.35, Ps < .001). 30.75, P < .001). Across the factor score, female face ratings had higher correlational coefficients (ts > 2.14, Ps < .033) and larger amount of variance explained by valence than male face ratings (ts > 7.69, Ps < .001).

Role of Raters' Gender Stereotypes about Female
It should be noted that in all four cases, although the quadratic models explained significantly more variance than the linear models, the magnitude of the quadratic effects was much smaller than the magnitude of the linear effects, and the increase in the intercorrelations of trait ratings as a function of GSE was largely monotonic (Fig. S2).

Analysis of the Effects of Rater Gender
When testing for the effect of raters' gender on facial impressions, we conducted an additional analysis using a 2 [face gender] × 2 [rater gender] repeated measures ANOVA on the absolute values of the interimpression correlational coefficients in addition to the analyses reported in the main text (Study 1b). In the analyses reported in the main text, we used Jennrich (1970) tests of matrix equality (see the main text for details) instead of an ANOVA because the dataset violates the assumption of sample independence. However, given that ANOVA is known to be rather robust to violations of independence, we report the additional result below in the Supplemental Material. The additional analysis yielded consistent results with the main analyses (we calculated a generalized eta-squared (ηG 2 ) as the measure of the effect size of each effect (Olejnik & Algina, 2003) to account for the repeated measures design).

Building of Male and Female Face Impression Models in Study 2
A data-driven face modeling approach allows one to build models of impressions without a priori assumptions about the effect of specific facial features (e.g., the size of the nose) for impressions (Dotsch & Todorov, 2012;Funk, Walker, & Todorov, 2016;Gosselin & Schyns, 2001;Jack & Schyns, 2017;Mangini & Biederman, 2004;Oosterhof & Todorov, 2008;Todorov, Dotsch, Wigboldus, & Said, 2011;Walker & Vetter, 2009;. In the standard, hypothesis-driven approach, different facial features are manipulated. However, the combinations of features rapidly proliferate as the number of features increases (Jack & Schyns, 2017;, damaging the feasibility and/or the statistical power of the investigation. For example, a simple factorial design for the investigation of the effect of even only ten binary facial features (e.g., a long vs. short nose) would lead to 2 10 experimental conditions. The data-driven approach prevents this by presenting a relatively small number of faces (e.g., 300 faces), which randomly vary in their features.
In Studies 2-3, we used the statistical face space model of FaceGen 3.2 (Singular Inversions) that captures the variance from a large sample of real human faces with 100 orthogonal dimensions Todorov & Oosterhof, 2011). Each dimension represents the variance in a holistic combination of features. A single face is represented as a vector in the statistical space (i.e., an array of 100 numbers).
Using this approach, one can generate an unlimited number of faces by randomly sampling parameters and generating the corresponding faces as images. Participants then judge the randomly sampled faces on a trait of interest, e.g., trustworthiness. One can model the trait judgment by extracting the change in face parameters that are correlated with the change in the trait judgment. With the resulting model, we can visualize what aspects of facial appearance change when an impression of the trait changes.
The trait model can be applied to any new face to make it appear more or less trait-like (e.g., trustworthy) by moving its corresponding parameters across the modeled judgment. With these manipulated faces, one can study what types of facial cues (e.g., emotional facial gestures, perceived physical strength) predict the perceived level of the trait. In addition, a data-driven statistical face model allows one to vary a particular perceived trait of faces while specifically controlling for another trait (Oh, Buck, & Todorov, 2019;Todorov, Dotsch, Porter, & Oosterhof, 2013). For example, Oh and colleagues (2019) manipulated the perceived competence of faces while controlling for facial attractiveness, thereby effectively suppressing the halo effect underlying competence impressions. This procedure found that facial masculinity is one of the ingredients of competence impressions, revealing gender biases in competence impressions.

Validation of Male and Female Face Impression Models in Study 3
To create the face stimuli for the validation studies, we manipulated their level of perceived trustworthiness and dominance. We added -3, -2, -1, 0, 1, 2, and 3SDs to the trustworthiness or dominance value of the 25 randomly generated male and 25 randomly generated female faces, using either the male or the female model. In other words, we moved the coordinates of the faces in the face space along one of the four gender-specific trait models.
This was a different approach from that of previous validation studies, which did not control for the gender of the faces: Todorov and colleagues (2013) manipulated the trait dimension of randomly generated faces to take specific values (i.e., -3, -2, -1, 0, 1, 2, and 3SDs on each trait model). These procedures are inappropriate when validating gender-specific trait models that are inherently correlated with gender in raters' perception. For instance, male faces are perceived as more dominant than female faces, and female faces are perceived as more trustworthy than male faces (e.g., Sutherland et al., 2013; Studies 1a and 1b in the main text). Because of these correlations, these procedures would decrease gender-related differences between male and female face sets, as they would project all the faces, regardless of their gender, onto the trait dimensions with the same values. As a result, we would essentially be generating less male-like male faces and less female-like female faces for the validation stimulus set.
Note that the parameters of the average male face and the parameters of the average female face used here were based on samples of actual male and female faces. That is, 3D laser scans of these male and female faces were used to construct the FaceGen statistical face space and extract the 100 face parameters. In the current project, by adding dimension values of [-3 to 3SDs] to the original faces (rather than assigning the faces to the dimension values as in the previous approach), we maintained the gender-related facial information in the male and female faces.
When cross-validating gender-specific impression models using ANOVAs, we calculated a generalized eta-squared (ηG 2 ) as the measure of the effect size of each effect (Olejnik & Algina, 2003) to account for the repeated measures design of the experiment.

Figure S1. The distribution of raters' responses to gender stereotype endorsement (GSE) questions in
Study 1b. In every question, raters showed a bias away from the middle score (blue dotted line) in the direction consistent with gender stereotypes (ts > 7.02, Ps < .001). The bigger purple (traits associate with women) and green dot (traits associate with men) denote the mean response, and the smaller black dots raw responses. All missing values were replaced using 10-nearest neighbor imputation.  For each gender, 300 faces were generated as variations of the gender-specific average face. The face shape and face reflectance were varied randomly.       Note. Numbers indicate pairwise Pearson coefficients between computational impression models (100 parameters each). Boldface indicates P < .001. Figure S5 visualizes the same information except for the bottom two rows in the table. The original trustworthiness and dominance models from Oosterhof & Todorov (2008) were built without taking into account face gender.

Male models Female models
Trait Model