PANDORA Talks: Personality and Demographics on Reddit

Personality and demographics are important variables in social sciences and computational sociolinguistics. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first dataset of Reddit comments of 10k users partially labeled with three personality models and demographics (age, gender, and location), including 1.6k users labeled with the well-established Big 5 personality model. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.


Introduction
Personality and demographics describe differences between people at the individual and group level. This makes them important for much of social sciences research, where they may be used as either target or control variables. One field that can greatly benefit from textual datasets with personality and demographic data is computational sociolinguistics (Nguyen et al., 2016), which uses NLP methods to study language use in society.
Conversely, personality and demographic data can be useful in the development of NLP systems. Recent advances in machine learning have brought significant improvements in NLP systems' performance across many tasks, but these typically come at the cost of more complex and less interpretable models, often susceptible to biases (Chang et al., 2019). Biases are commonly caused by societal biases present in data, and eliminating them requires a thorough understanding of the data used to train the model. One way to do this is to consider demographic and personality variables, as language use and interpretation is affected by both. Incorporating these variables into the design and analysis of NLP models can help interpret model's decisions, avoid societal biases, and control for confounders.
The demographic variables of age, gender, and location have been widely used in computational sociolinguistics (Bamman et al., 2014;Peersman et al., 2011;Eisenstein et al., 2010), while in NLP there is ample work on predicting these variables or using them in other NLP tasks. In contrast, advances in text-based personality research are lagging behind. This can be traced to the fact that (1) personality-labeled datasets are scarce and (2) personality labels are much harder to infer from text than demographic variables such as age and gender. In addition, the few existing datasets have serious limitations: a small number of authors or comments, comments of limited length, non-anonymity, or topic bias. While most of these limitations have been addressed by the recently published MBTI9k Reddit dataset (Gjurković andŠnajder, 2018), this dataset still has two deficiencies. Firstly, it uses the Myers-Briggs Type Indicator (MBTI) model (Myers et al., 1990), which -while popular among the general public and in business -is discredited by most personality psychologists (Barbuto Jr, 1997). The alternative is the well-known Five Factor Model (or Big 5) (McCrae and John, 1992), which, however, is less popular, and thus labels for it are harder to obtain. Another deficiency of MBTI9k is the lack of demographics, limiting model interpretability and use in sociolinguistics.
Our work seeks to address these problems by introducing a new dataset -Personality ANd Demographics Of Reddit Authors (PANDORA) -the first dataset from Reddit labeled with personality and demographic data. PANDORA comprises over 17M comments written by more than 10k Reddit users, labeled with Big 5 and/or two other person-ality models (MBTI, Enneagram), alongside age, gender, location, and language. In particular, Big 5 labels are available for more than 1.6k users, who jointly produced more than 3M comments.
PANDORA provides exciting opportunities for sociolinguistic research and development of NLP models. In this paper we showcase its usefulness through three experiments. In the first, inspired by work on domain adaptation and multitask learning, we show how the MBTI and Enneagram labels can be used to predict the labels from the wellestablished Big 5 model. We leverage the fact that more data is available for MBTI and Enneagram, and exploit the correlations between the traits of the different models and their manifestations in text. In the second experiment we demonstrate how the complete psycho-demographic profile can help in pinpointing biases in gender classification. We show that a gender classifier trained on a large Reddit dataset fails to predict gender for users with certain combinations of personality traits more often than for other users. Finally, the third experiment showcases the usefulness of PANDORA in social sciences: building on existing theories from psychology, we perform a confirmatory and exploratory analysis between propensity for philosophy and certain psycho-demographic variables.
We also report on baselines for personality and demographics prediction on PANDORA. We treat Big 5 and other personality and demographics variables as targets for supervised machine learning, and evaluate a number of benchmark models with different feature sets. We make PANDORA available 1 for the research community, in the hope this will stimulate further research.

Background and Related Work
Personality models and assessment. Myers-Briggs Type Indicator (MBTI;Myers et al., 1990) and Five Factor Model (FFM; McCrae and John, 1992) are two most commonly used personality models. Myers-Briggs Type Indicator (MBTI) categorizes people in 16 personality types defined by four dichotomies: Introversion/Extraversion (way of gaining energy), Sensing/iNtuition (way of gathering information), Thinking/Feeling (way of making decisions), and Judging/Perceiving (preferences in interacting with others). The main criticism of MBTI focuses on low validity (Bess and Harvey, 2002;McCrae and Costa, 1989 MBTI, FFM (McCrae and John, 1992) has a dimensional approach to personality and describes people as somewhere on the continuum of five personality traits (Big 5): Extraversion (outgoingness), Agreeableness (care for social harmony), Conscientiousness (orderliness and selfdiscipline), Neuroticism (tendency to experience distress), and Openness to Experience (appreciation for art and intellectual stimuli). Big 5 personality traits are generally assessed using inventories e.g., personality tests. 2 Moreover, personality has been shown to relate to some demographic variables, including gender (Schmitt et al., 2008), age (Soto et al., 2011), and location (Schmitt et al., 2007). Results show that females score higher than males in agreeableness, extraversion, conscientiousness, and neuroticism (Schmitt et al., 2008), and that expression of all five traits subtly changes during lifetime (Soto et al., 2011). There is also evidence of correlations between MBTI and FFM (Furnham, 1996;McCrae and Costa, 1989).
After MyPersonality dataset (Kosinski et al., 2015) became unavailable to the research community, subsequent research had to rely on the few smaller datasets based on essays (Pennebaker and King, 1999), personality forums, 3 Twitter (Plank and Hovy, 2015;Verhoeven et al., 2016), and a small portion of the MyPersonality dataset ) used in PAN workshops (Celli et al., 2013, 2014Rangel et al., 2015).
To the best of our knowledge, the only work that attempted to compare prediction models for both MBTI and Big 5 is that of Celli and Lepri (2018), carried out on Twitter data. However, they did not leverage the MBTI labels in the prediction of Big 5 traits, as their dataset contained no users labeled with both personality models.
As most recent personality predictions models are based on deep learning (Majumder et al., 2017;Xue et al., 2018;Rissola et al., 2019;Wu et al., 2020;Lynn et al., 2020;Vu et al., 2020;, large-scale multilabeled datasets such as PANDORA can be used to develop new architectures and minimize the risk of models overfitting to spurious correlations. User Factor Adaption. Another important line of research that would benefit from datasets like PANDORA is debiasing based on demographic data (Liu et al., 2017;Zhang et al., 2018;Pryzant et al., 2018;Elazar and Goldberg, 2018;Huang and Paul, 2019). Current research is done on demographics, with the exception of the work of Lynn et al. (2017), who use personality traits, albeit predicted. Different social media sites attract different types of users, and we expect more research of this kind on Reddit, especially considering that Reddit is the source of data for many studies on mental health (De Choudhury et al., 2016;Yates et al., 2017;Sekulic et al., 2018;Cohan et al., 2018;Turcan and McKeown, 2019;Sekulic and Strube, 2019).

PANDORA Dataset
Reddit is one of the most popular websites worldwide. Its users, Redditors, spend most of their online time on site and have more page views than users of other websites. This, along with the fact that users are anonymous and that the website is organized in more than a million different topics (subreddits), makes Reddit suitable for various kinds of sociolinguistic studies. To compile their MBTI9k Reddit dataset, Gjurković andŠnajder (2018) used the Pushift Reddit dataset (Baumgartner et al., 2020) to retrieve the comments dating back to 2015. We adopt MBTI9k as the starting point for PANDORA.
Ethical Research Statement. We are following the ethics code for psychological research by which researchers may dispense with informed consent of each participant for archival research, for which disclosure of responses would not place participants at risk of criminal or civil liability, or damage their financial standing, employability, or reputation, and if confidentiality is protected. As per Reddit User Agreement, users agree not to disclose sensitive information of other users and they consent that their comments are publicly available and exposed through API to other services. The users may request to have their content removed, and we have taken this into account by removing such content; future requests will be treated in the same way and escalated to Reddit. Our study has been approved by an academic IRB.

MBTI and Enneagram Labels
Gjurković andŠnajder (2018) relied on flairs to extract the MBTI labels. Flairs are short descriptions with which users introduce themselves on various subreddits, and on MBTI-related subreddits they typically report on MBTI test results. Owing to the fact that MBTI labels are easily identifiable, they used regular expressions to obtain the labels from flairs (and occasionally from comments). We use their labels for PANDORA, but additionally manually label for Enneagram, which users also typically report in their flairs. In total, 9,084 users reported their MBTI type in the flair, and 793 additionally reported their Enneagram type. Table 1 shows the distribution of MBTI types and dimensions (we omit Enneagram due to space constraints).

Big 5 Labels
Obtaining Big 5 labels turned out to be more challenging. Unlike MBTI and Enneagram tests, Big 5 tests result in a score for each of the five traits. Moreover, the score format itself is not standardized, thus scores are reported in various formats and they are typically reported not in flairs but in comments replying to posts which mention a specific online test. Normalization of scores poses a series of challenges. Firstly, different web sites use different inventories (e.g., HEXACO, NEO PI-R, Aspect-scale), some of which are publicly available while others are proprietary. The different tests use different names for traits (e.g., emotional stability as the opposite of neuroticism) or use abbreviations (e.g., OCEAN, where O stands for openness, etc.). Secondly, test scores may be reported as either raw scores, percentages, or percentiles. Percentiles may be calculated based on the distribution of users that took the test or on distribution of specific groups of offline test-takers (e.g., students), in the latter case commonly adjusted for age and gender. Moreover, scores can be either numeric or descriptive, the former being in different ranges (e.g., -100-100, 0-100, 1-5) and the latter being different for each test (e.g., descriptions typical and average may map to the same underlying score). On top of this, users may decide to copy-paste the results, describe them in their own words (e.g., rock-bottom for low score) -often misspelling the names of the traits -or combine both. Lastly, in some cases the results do not even come from inventory-based assessments but from text-based personality prediction services (e.g., Apply Magic Sauce and Watson Personality).
Extraction. The fact that Big 5 scores are reported in full-text comments rather than flairs and that their form is not standardized makes it difficult to extract the scores fully automatically. Instead, we opted for a semiautomatic approach as follows. First, we retrieved candidate comments containing three traits most likely to be spelled correctly (agreeableness, openness, and extraversion). For each comment, we retrieved the corresponding post and determined what test it refers to based on the link provided, if the link was present. We first discarded all comment referring to text-based prediction services, and then used a set of regular expressions specific to the report of each test to extract personality scores from the comment. Next, we manually verified all the extracted scores and the associated comments to ensure that the comments indeed refer to a Big 5 test report and that the scores have been extracted correctly. For about 80% of reports the scores were extracted correctly, while for the remaining 20% we extracted the scores manually. This resulted in Big 5 scores for 1027 users, reported from 12 different tests. Left out from this procedure were the comments for which the test is unknown, as they were replying to posts without a link to the test. To also extract scores from these reports, we trained a test identification classifier on the reports of the 1,008 users, using character ngrams as features, and reaching an F1-macro score of 81.4% on held-out test data. We use this classifier to identify the tests referred to in the remaining comments and repeat the previous score extraction procedure. This yielded scores for additional 600 users, for a total of 1,608 users.
Normalization. To normalize the extracted scores, we first heuristically mapped score descriptions of various tests to numeric values in the 0-100 range in increments of 10. As mentioned, scores may refer to either raw scores, percentiles, or descriptions. Both percentiles and raw scores are mostly reported on the same 0-100 scale, so we refer to the information on the test used to interpret the score correctly. Finally, we convert raw scores and percentages reported by Truity 4 and HEXACO 5 to percentiles based on score distribution parameters. HEXACO reports distribution parameters publicly, while Truity provided us with parameters of the distribution of their test-takers.
Finally, for all users labeled with Big 5 labels, we retrieved all their comments from the year 2015 onward, and add these to the MBTI dataset from §3.1. The resulting dataset consists of 17,640,062 comments written by 10,288 users. There are 393 users labeled with both Big 5 and MBTI.

Demographic Labels
To obtain age, gender, and location labels, we again turn to textual descriptions provided in flairs. For each of the 10,228 users, we collected all the distinct flairs from all their comments in the dataset, and then manually inspected these flairs for age, gender, and location information. For users who reported their age in two or more flairs at different time points, we consider the age from most recent one. Additionally, we extract comment-level self-reports of users' age (e.g., I'm 18 years old) and gender (e.g., I'm female/male). As for location, users report location at different levels, mostly countries, states, and cities, but also continents and regions. We normalize location names, and map countries to country codes, countries to continents, and states to regions. Most users are from English speaking countries, and regionally evenly distributed in US and Canada (cf. Appendix). Table 1 shows the average number per user. Lastly, Table 2 gives intersection counts between personality models and other demographic variables.

Analysis
Table 1 and Figure 1 show the distributions of Big 5 scores per trait. 6 We observe that the average user in our dataset is average on neuroticism, more open, and less extraverted, agreeable, and conscientious. Furthermore, males are on average younger, less agreeable, and neurotic than females. Similarly, Table 1 shows that MBTI users have a preference for introversion, intuition (i.e., openness), thinking (i.e., less agreeable), and perceiving (less conscientious). This is not surprising if we look at Table 3, which shows high correlation between particular MBTI dimensions or Enneagram types, and Big 5 traits. Correlations between Big 5 and MBTI follow the same pattern as correlations from existing psychological research (McCrae and Costa, 1989).

Experiments
Coupling linguistic data with psycho-demographic profiles sets the stage for many interesting research 6 As noted by Gjurković andŠnajder (2018), due to various selection biases involved our dataset may not be representative of Reddit users, and it is certainly not representative of internet users or the general population.

Predicting Big 5 with MBTI/Enneagram
MBTI and Enneagram are considerably more popular than Big 5 among the social media users. This makes it relatively easy to obtain the MBTI and Enneagram labels ( §3.1), and develop wellperforming prediction models using supervised machine learning. On the other hand, validity of MBTI and Enneagram has been severely criticized (Barbuto Jr, 1997;Thyer, 2015), which is why they are virtually not used in psychological research. This experiment investigates whether we can combine the best of both worlds: leverage the more abundant MBTI/Enneagram labels in PANDORA to predict Big 5 traits from text. We hypothesize that the questionable psychological validity of MBTI/Enneagram labels can be compensated by their number. We base this on moderate to strong correlations observed between the personality models (Table 3) and the presence of a considerable number of users with multiple labels (Table 2). We frame the experiment as a domain adapta-tion task of transferring MBTI/Enneagram labels to Big 5 labels, using a simple domain adaptation approach from (Daumé III, 2007) (cf. Appendix for more details). We train four text-based MBTI classifiers on a subset of PANDORA users for which we have MBTI labels but no Big 5 labels. We then apply these classifiers on a subset of PANDORA users for which we have both MBTI and Big 5 labels, obtaining a type-level accuracy of MBTI prediction of 45%. Table 4 shows correlations between MBTI and Big 5 gold labels and predicted MBTI labels (cf. Appendix for Enneagram correlations). As expected, we observe lower overall correlations in comparison with correlations on gold labels ( Table 3). The main observable difference is that extraversion is now moderately correlated with predicted MBTI intuitive dimension. As majority of Big 5 traits significantly correlate with more than one MBTI dimension, we use these scores as features for training five regression models, one for each Big 5 trait. Lastly, we apply both classifiers on the subset of PANDORA users for which we have Big 5 labels but no MBTI labels (serving as domain adaptation target set). We use MBTI classifiers to obtain scores for the four MBTI dimensions, and then feed these to Big 5 models to obtain predictions for the five traits. The resulting correlations ( Table 5) clearly indicate that predictions based on MBTI help in predicting Big 5 traits. Furthermore, the results justify the use of regression models as predicted Big 5 traits are more correlated with gold Big 5 traits then predicted MBTI dimensions, with the exception of conscientiousness, which is significantly correlated with perceiving/judging MBTI dimension. For instance, predicted openness is a better predictor of openness than the intuitive dimension.

Gender Classification Bias
Gender classification from text is a fundamental task in author profiling, and in particular author profiling on social media has recently received a lot of attention from the NLP community (Bamman et al., 2014;Sap et al., 2014;Ciot et al., 2013). Additionally, gender is often in the spotlight of research of fairness and bias in NLP (Sun et al., 2019). Biases are often introduced by demographic and other imbalances in training data. Here we look at personality profile as a potential source of bias, and set out to investigate whether a simple gender classification model trained on Reddit exhibits biases that   could be traced back to personality traits. This is an important issue, given that Reddit is often used as a source of data for training NLP models, e.g., (Zhang et al., 2017;Cheng et al., 2017;Henderson et al., 2019;Sekulic and Strube, 2019).
To build a gender classifier, we retrieve a separate Reddit dataset and label it automatically for gender. To this end, we again rely on flairs, using strings"/f/" and "/m/" as female and male gender indicators, respectively. 7 This method yields a 98.5% precision on PANDORA. From the 34k users that used these patterns in their flairs, we sampled a balanced dataset of 24,954 users and retrieved over 30M of their comments, removing quoted text and all comments shorter than five words. Next, we aggregate the comments per user, and divide the users in an 80%-20% train-test split. For classification, we use logistic regression with 500-dimensional SVD vectors derived from Tf-Idf word n-grams.  The test accuracy of the classifier was 89.9%. The accuracy of the classifier on 3,084 users from PAN-DORA with known gender was 89.3%. We now turn to bias analysis. On PANDORA, the classifier failed to predict the correct gender for 8.1% male (142/1743) and 14.4% female (192/1331) users. As this is a statistically significant difference (p<0.05 with two-proportion Z-Test), we conclude that the classifier is biased. To investigate this further, we divide male and female users into those for which the predictions were correct and those for which they were incorrect. We then test for statistically significant differences (using two-proportion Z-test for binary variables and Kruskal-Wallis H-test for continuous variables) of psycho-demographic variables between correctly and incorrectly classified cases for both groups. Results are shown in Table 6. Differences are statistically significant for thinking and perceiving MBTI dimensions for both females and males, for extraversion Big 5 trait for males, and for age in females. Thinking and perceiving preference for females makes them more likely to be misclassified for males, and the reverse holds for males. Furthermore, the gender of more extraverted males is more likely to be misclassified. When it comes to age, younger females are more often in misclassified group. These findings clearly indicate that a complete psycho-demographic profile is a useful tool for bias analysis of machine learning models trained on social media text.

Propensity for Philosophy
Our last experiment investigates the usefulness of PANDORA for research in social sciences. One obvious type of use cases are confirmatory stud-ies which aim to replicate present theories and findings on a dataset that has been obtained in a manner different from typical datasets in the field. Another type of use cases are exploratory studies that seek to identify new relations between psychodemographic variables manifested in online talk. Here we present a use case of both types. We focus on propensity for philosophy of Reddit users (manifested as propensity for philosophical topics in online discussions), and seek to confirm its hypothesized positive relationship with openness to experiences (Johnson, 2014;Dollinger et al., 1996), cognitive processing (e.g., insight), and readability index. We expect this to be confirmed since all four variables share proneness to higher intellectual engagement. For exploratory analysis, we extend our analysis to emotion variables.
We conducted the analysis using hierarchical regression analysis with propensity for philosophical topics as the criterion variable and demographics, personality, emotions, cognitive processing, and text readability as predictors. As a measure of propensity for philosophical topics, we compute the "philosophy" feature (frequency of philosophical words) from Empath (Fast et al., 2016) for each user's comments. Similarly, for the predictors we compute posemo, negemo, and insight features from LIWC (Pennebaker et al., 2015) and Flesh-Kincaid Grade Level (F-K GL) readability score (Kincaid et al., 1975). 8 Emotion variables are inserted for the exploratory analysis. In the hierarchical regression analysis, demographics were added as control variables in the first step, Big 5 traits were added in the second step, emotion variables in the third step, and finally insight feature as a cognitive inclination variable and F-K GL readability index were added in the last step. The sample comprises 430 Reddit users, 273 males and 157 females, with the mean age of 26.79 (SD=7.954), who all had gold labels of gender, age, and Big 5.
The analysis yields interesting results. 9 Firstly, as much as the 41% of variance in the "philosophy" feature is explained by the 11 predictors. Secondly, openness to experiences, readability index, and insight are, as expected, all significant and positive predictors of the "philosophy" feature. Agreeable-

Regression coefficients Predictors
Step 1 Step 2 Step 3 Step 4 Step 5  ness was a negative significant predictor before adding the emotion variables. This is not surprising, as people low in agreeableness are less likely to pander to others, and agreeableness shows significant correlations with both positive (.20) and negative emotions (-.13). Thirdly, the results imply alluring associations with emotion variables. Negative emotions were clearly positive predictors of frequency of discussing philosophical topics. However, positive emotions were a significant predictor until the last step when F-K GL was added to the model. This was due to moderate correlation between posemo and F-K GL (-0.40). Lastly, males had higher frequency of words related to philosophy than females. To sum up, the hypothesis is confirmed and exploratory analysis yields interesting results which could motivate further research.

Prediction Models
In this section we describe baseline models for predicting personality and demographic variables from user comments in PANDORA.
We consider the following sets of features: (1) N-grams: Tf-Idf weighted 1-3 word ngrams and 2-5 character n-grams; (2) Stylistic: the counts of words, characters, and syllables, mono/polysyllable words, long words, unique words, as well as all readability metrics implemented in Textacy 10 ; (3) Dictionaries: words mapped to Tf-Idf categories from LIWC (Pennebaker et al., 2015), Empath (Fast et al., 2016), and NRC Emotion Lexicon (Mohammad and Turney, 2013) dictionaries; (4) Gender: predictions of the gender classifier from §4.2; (5) Subreddit distributions: a matrix where each row is a distribution of post counts across all subreddits for a particular user, reduced using PCA to 50 features per user; (6) Subreddit other: counts of downs, score, gilded, ups, as well as the controversiality scores for a comment; (7) Named entities: the number of named entities per comment, as extracted using Spacy; 11 (8) Part-of-speech: counts for each part-of-speech; (9) Predictions (only for predicting Big 5 traits): MBTI/Enneagram predictions obtained by a classifier built on held-out data. Features (2), (4), and (6-9) are calculated at the level of individual comments and aggregated to min, max, mean, standard deviation, and median values for each user.
We build six regression models (age and Big 5 traits) and eight classification models (four MBTI dimensions, gender, region, Enneagram). We experiment with linear/logistic regression (LR) from sklearn (Pedregosa et al., 2011) and deep learning models (NN). We trained a separate NN model for each task. In each model, a single user is represented as a matrix, with rows representing the user's comments. The comments were encoded using 1024-dimensional vectors derived using BERT (Devlin et al., 2019). BERT comment vectors are fed into convolution layers, max pooling, and several fully connected layers. Hyperparameters and additional information can be found in the Appendix.
We evaluate the models using 5-fold crossvalidation with a separate stratified split for each target. We use regression F-tests to select top-K features, and optimize model hyperparameters and K on held-out data for each fold separately.
Results are shown in Table 8. LR performs best when using only the n-gram features. An exception are Big 5 trait predictions, which benefit considerably from adding the MBTI/Enneagram predictions as features, building on Section 4.1 and Table 3. Also, using 1000 comments rather than last 100 (as in NN) increased scores up to 5 points.

Conclusion
PANDORA dataset comprises 17M comments, personality, and demographic labels for over 10k Reddit users, including 1.6k users with Big 5 labels. To our knowledge, this is the first Reddit dataset with Big 5 traits, and also the first covering multiple personality models (Big 5, MBTI, Enneagram). We showcased the usefulness of PANDORA with three experiments, showing (1) how more readily available MBTI/Enneagram labels can be used to estimate Big 5 traits, (2) that a gender classifier trained on Reddit exhibits bias on users of certain personality traits, and (3) that certain psycho-demographic variables are good predictors of propensity for philosophy of Reddit users. We also trained and evaluated benchmark prediction models for all psychodemographic variables. The poor performance of deep learning baseline models, the rich set of labels, and the large number of comments per user in PANDORA suggest that further efforts should be directed toward efficient user representations and more advanced deep learning architectures.

C Predicting Big 5 with MBTI/Enneagram
Here we describe in more details the setup for predicting Big 5 labels using MBTI/Enneagram labels. We frame the experiment as a domain adaptation task of transferring MBTI/Enneagram labels to Big 5 labels, and use one of the simplest domain adaptation approaches where we use source classifier (MBTI) predictions as features and linearly interpolate them on development set containing both MBTI and Big 5 to make predictions on Big 5 target set (e.g., PRED and LININT baselines from (Daumé III, 2007)). We first partition PANDORA into three subsets: comments of users for which we have both MBTI and Big 5 labels (M+B+, n=382), comments of users for which we have the MBTI but no Big 5 labels (M+B-, n=8,691), and comments of users for which we have the Big 5 but no MBTI labels (M-B+, n=1,588). We then proceed in three steps. In the first step, we train on M+B-four text-based MBTI classifiers, one for each MBTI dimension (logistic regression, optimized with 5-fold CV, using 7000 filter-selected, Tf-Idf-weighed 1-5 word and character n-grams as features).
In the second step, we use text-based MBTI classifiers to obtain MBTI labels on M+B+ (serving as domain adaptation source set), observing a typelevel accuracy of 45% (82.4% for one-off prediction). The classifiers output probabilities, which can be interpreted as a score of the corresponding MBTI dimension. As majority of Big 5 traits significantly correlate with more than one MBTI dimension, we use these scores as features for training five regression models, one for each Big 5 trait (Ridge regression optimized with 5-fold CV). Additionally, we preformed a correlation analysis between Enneagram types and Big 5 traits. Results are shown in Table 14.
In the third step, we apply both classifiers on M-B+ (serving as domain adaptation target set): we first use MBTI classifiers to obtain scores for the four MBTI dimensions, and then feed these to Big 5 regression models to obtain predictions for the five traits.

D Parameters of the DL Model
The models consist of three parts: a convolutional layer, a max-pooling layer, and several fully connected (FC) layers. Convolutional kernels are as wide as BERT's representation and slide vertically over the matrix to aggregate information from sev- We tried different kernel sizes varying from 2 to 6, and different numbers of kernels M varying from 4 to 6. Outputs of the convolutional layer are first sliced into a fixed number of K slices and then subject to max pooling. This results in M vectors of length K per user, one for each kernel, which are passed to several FC layers with Leaky ReLU activations. Regularization (L2-norm and dropout) is applied only to FC layers. Figures 2 and 3 show the learning curves for logistic regression model with 1-gram features: x-axis is the number of comments and y-axis is model's F1-macro score. Performance plateaus at around 1000 comments, showing little significant changes when increasing the number of comments used for training beyond that amount.