Guided by our modelling framework, our data analysis proceeded as follows. First, to describe the reliability of the mapping, we examined the variability in the mapping estimates within and across different corpora. Then, to address the specificity of the mapping, we performed a contrastive model comparison exploring which model best fits the data, while punishing overly complex models. Finally, we uncovered which levels of analysis contribute the most to the prediction of the model and supported the findings with a correlation and variability analysis.
Verifying the Bayesian inference models
Prior to the main analysis, we showed that our Bayesian multinomial logistic regression models perform equally well in the classification task as do support vector machines (SVMs), which have been extensively used in emotion classification from audio25 (see the Methods for the hyperparameters used). Emotion classification performance is often expressed as unweighted average recall (UAR)26, which is the average recall across all emotion categories while accounting for slight imbalances in the base rate of the categories. Using fourfold leave-speaker-out cross-validation, we showed that the SVM obtains a similarly high UAR score as the Bayesian regression model (25.5% and 22.7% UAR, respectively; Bayesian estimation of the mean paired difference, −4%; 89% credible interval, −12% to 4%), indicating that the Bayesian multinomial logistic regression performs comparably to a common baseline. Here we evaluated model prediction; however, in the main analysis we use the Bayesian logistic regressions as inferential models. Thus, the objective is not to optimize model prediction for unseen data but rather to explore what the models have learned.
High reliability within corpora and poor reliability across corpora
We next fit a model that estimates a coefficient for each of the seven acoustic factors across the six emotions (Fig. 2a). On top of this ‘global mapping’, we computed a corpus-specific deviation from this coefficient (Fig. 2b). In doing so, we measured the variability of the mapping within a corpus and across corpora. The estimates are depicted in Fig. 2c. The variability within a corpus is characterized by the spread of the distribution of estimates. Wide distributions indicate more variability for the given estimate in a corpus (smaller dots indicate greater variability in Fig. 2b,c). Variability across corpora can be described by the overlap in the estimated distributions across corpora. If there is a poor overlap of the distributions, then there is a great deal of variability across corpora.
While the estimated emotion coefficients across corpora mostly match with empirical predictions from two reviews on emotion-specific acoustic profiles16,27 (Fig. 2a), there are some disagreements—for example, happiness is predicted to have a higher speech rate and sadness to have a lower pitch. Such differences are to be expected because the factor scores do not relate one-to-one to the raw acoustic features, and there is a large spread in the coefficients estimated for the different corpora (Fig. 2c). This variability across corpora is even more striking, as shrinkage in multilevel models pulls observations from small corpora or extreme observations closer to the grand mean.
In Fig. 2d, we zoom in on a single factor (RC2, loudness, for anger) and can see that the estimates for the coefficients are rather tight (that is, the distribution of estimates is narrow). This implies that the mapping of a certain acoustic factor to an emotion label is consistent within a corpus. However, across corpora, we can observe that the credible intervals of the distributions are only partially overlapping, which means that the estimates from one corpus to another often differ. If the mapping between acoustic features and emotion labels were identical across corpora, we would expect a greater degree of overlap. Note that high variability does not imply low emotion recognition but is merely a justification to use moderators in the analysis. Given the observed variability in the estimates across corpora, the next step is to investigate the origin of the variability.
The objective here is to show the convergence of evidence (or the lack thereof) across studies. In meta-studies, each study is treated as an individual sample with its effect size and standard error. Some degree of variation across studies is expected due to minor sampling differences in the population, which should be smaller for larger sample sizes. Measuring the amount of heterogeneity among studies is key to the question of convergence, as large variability might indicate that studies measure distinct concepts, or moderators need to be included. We borrow the I2 metric from meta-analysis, which describes the proportion of total variation in study estimates due to heterogeneity28 (see the Methods for the details). Here we compute I2 separately for each factor and emotion and treat the estimates from single corpora as separate studies. The I2 values are shown on the right of each subplot in Fig. 2c. The analysis confirms that there is a great deal of variability in estimates across corpora and that this variance is larger than what would be expected on the basis of sampling variance alone.
Models only assuming a global mapping are outperformed
Given that estimates across corpora are heterogeneous, we ran a series of models accounting for different moderators. Every model estimates a separate intercept for each corpus to account for possible imbalances in the base rate of emotions across corpora. Models are compared to each other using the widely applicable information criterion (WAIC), which provides an approximation of the out-of-sample deviance while penalizing overly complex models, which tend to overfit the data (Supplementary Methods 4). Thus, the relative WAIC difference between contrasting models is of importance, where lower WAIC values indicate a better model fit.
As a lower boundary, we fit an intercept-only model estimating an intercept for each emotion and corpus. The ‘base’ model additionally estimates a coefficient for each acoustic factor. As shown in Fig. 3a, the base model is much better than the intercept-only model.
We then fit a series of models inspired by the emotion dialect theory10, on the basis of the ‘in-group’ effect. One way to model this membership is to add a group-level effect for languages and countries. As shown in Fig. 3b, the language model and the country model perform similarly well (the country model is slightly better). However, this initial approach was limited in that we treated languages and countries as discrete categories and ignored the proximity of different languages and countries to one another—for example, Dutch being linguistically closer to English than to Hindi. To model this proximity, we computed the Euclidean distances among languages and countries. Language distance is modelled as lexical distance29, and differences across countries are captured on the Hofstede cultural dimensions30. As depicted in Fig. 3c,d, the language and country trees reconstructed from the distances31 contain meaningful associations. For example, in the language tree, Brazilian Portuguese is closer to European Portuguese than it is to Spanish, and Romance languages are grouped together; for the country model, the Anglo-Saxon countries (the United States, Canada, Australia and New Zealand) are grouped together. However, models incorporating this complex hierarchical relationship did not converge. As a pragmatic solution, we therefore modelled ‘culture’ as the combination of the categories ‘language’ and ‘country’, as this enables useful distinctions (such as between American and Canadian English). As depicted in Fig. 3b, this model is better than the language or country model.
As shown in Fig. 3a, the culture model is outperformed by the corpus model from the reliability analysis (see the lower, non-overlapping WAIC value for the corpus model), as the grouping variable ‘corpus’ contains the same grouping information as in ‘culture’—each corpus is usually assigned to one country and one language—and additionally consists of more specific information potentially relevant for the communication of emotion. For example, speakers are often recruited from the same area or institution (for example, the same city or university), targeting a more specific social group9. However, the grouping variable ‘corpus’ is—in contrast to ‘language’ or ‘country’—an artificial construct that is transcended by a series of more realistic constructs, such as cultural proximity and social belonging. We therefore extend the culture in-group model (and not the corpus model) by adding sex and individual speaker differences. As shown in Fig. 3a, this ‘big’ model outperforms all other models.
The confusion matrices in Fig. 3e reveal that with increasing model complexity, the misclassifications by the model are reduced (darker diagonals), and hence the overall UAR per model increases (40.8% for base, 48.6% for the best in-group model and 69.8% for the final model). For example, in the base model, ‘happiness’ is often misclassified as ‘anger’ and ‘neutral’ as ‘sad’. In contrast to the WAIC, confusion matrices do not penalize overfitting models. And one would expect that with increasing model complexity, models will better fit (or even overfit) the data. However, group-level effects can have a regularizing effect due to shrinkage and hence reduce the risk of overfitting. The confusion matrices show that the models capture the trend in the data and are better at it with increasing model complexity.
Relevance of culture, sex and individual differences
To examine how individual levels of the mapping contribute to the prediction of the model, we computed the contribution of each level of analysis to the prediction of the model. We first obtained the model prediction on the data that the model was fitted on (as in Fig. 3e), and we then measured how much each group level contributes to the value for the predicted emotion (Fig. 4a). In all emotions (except ‘surprise’), individual differences have the greatest impact on the model prediction. The second most important level of analysis is culture for most emotions, followed by the global mapping or sex differences. Remarkably, only 20–25% of the model prediction originates from the global mapping, as depicted by the pie charts in the upper right corner of each panel in Fig. 4a.
As depicted in Fig. 4a, the intercepts (marked by the darker colours) play a subordinate role in the prediction of the emotion. In addition, the intercept of the corpus has the smallest contribution to the final prediction in all emotions except for ‘disgust’.
Variability in coefficients is the largest for speakers and cultures
While in the previous analysis the contributions of different levels of analysis were estimated in the original data, the current variability analysis was performed on the model estimates regardless of the data. We extracted the posterior estimates for each acoustic factor, each emotion and each group level and computed the average standard deviation as a metric of the variability of the estimates. As depicted in Fig. 4b, most variability can be found in the ‘speaker’ and ‘culture’ estimates. Overall, the first three acoustic factors (voice quality, loudness, and pitch and formants) show the most variability (see the subplot in the left panel of Fig. 4b). The remaining factors (except RC7, MFCC 3) have decreased variability corresponding to their component numbers. The variability results per emotion also show that the estimates for ‘speaker’ and ‘culture’ are the most variable. All estimates for the emotions are variable, although ‘surprise’, ‘anger’ and ‘sadness’ appear to be slightly more variable than the other three emotions (see the subplot in the right panel of Fig. 4b).
Confusion between the production of emotions across cultures, sexes and individuals
In the next correlation analysis, we again used the coefficient estimates. We started by correlating the global mapping across emotions. As depicted in the upper left panel of Fig. 4c, ‘sadness’ is the only emotion with a distinct profile, as it has only a strong correlation with itself and low correlations with all other emotions. Interestingly, the profiles of the other emotions correlate more strongly with each other, especially the correlations among the profiles for ‘fear’, ‘happiness’ and ‘surprise’.
In three further analyses, we described the relationship between emotions across sexes, cultures and individuals. A first analysis showed that the mapping for a specific emotion correlates the most strongly with the mapping for the same emotion of the other sex (right panel of Fig. 4c). For instance, female anger is, on average, closer to male anger than to any other emotion. When compared with the global mapping, adding sex further increases the correlation among the profiles of ‘fear’, ‘happiness’ and ‘surprise’.
The addition of ‘culture’ or ‘speaker’ to the global mapping leads to a strong decrease in the overall correlations across emotions, indicating that the mapping for individual cultures and speakers is relatively distinct. The overall drop in correlation is greater for speakers than for cultures, confirming the pattern of results in the previous analyses (Fig. 4a). Nonetheless, the diagonals are mildly preserved, indicating that the mapping for a given emotion is more similar across speakers and cultures than to another emotion.