Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages


  • Sneha Das Technical University of Denmark
  • Nicklas Leander Lund
  • Nicole Nadine Lønfeldt
  • Anne Katrine Pagsberg
  • Line H. Clemmensen



Speech emotion recognition, transferability, continuous metric learning, dimensional emotion model, low-resource machine learning


Speech emotion recognition (SER) refers to the technique of inferring the emotional state of an individual from speech signals. SERs continue to garner interest due to their wide applicability. While the domain is mainly founded on signal processing, machine learning and deep learning methods, generalizing over languages continues to remain a challenge. To improve performance over languages, in this paper we propose a denoising autoencoder with semi-supervision using a continuous metric loss. The novelty of this work lies in our proposal for continuous metric learning, which is among the first proposals on the topic to the best of our knowledge. Furthermore, we contribute labels corresponding to the dimensional model, that were used to evaluate the quality of embedding (the labels will be made available by the time of the publication). We show that the proposed method consistently outperforms the baseline method in terms of the classification accuracy and correlation with respect to the dimensional variables.


Y. Ahn, S. J. Lee, and J. W. Shin. Crosscorpus speech emotion recognition based on few-shot learning and domain adaptation. IEEE Signal Processing Letters, 2021. doi: 10.1109/lsp.2021.3086395.

I. Bakker, T. Van Der Voordt, P. Vink, and J. De Boon. Pleasure, arousal, dominance: Mehrabian and russell revisited. Current Psychology, 33(3):405–421, 2014. doi: 10.1007/s12144-014-9219-4.

F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database of german emotional speech. In Ninth European Conference on Speech Communication and Technology, 2005. doi: 10.21437/interspeech.2005-446.

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008. doi: 10.1007/s10579-008-9076-6.

D. Dai, Z. Wu, R. Li, X. Wu, J. Jia, and H. Meng. Learning discriminative features from spectrograms using center loss for speech emotion recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7405–7409. IEEE, 2019. doi: 10.1109/icassp.2019.8683765.

G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer. Covarep—a collaborative voice analysis repository for speech technologies. In 2014 ieee international conference on acoustics, speech and signal processing (icassp), pages 960–964. IEEE, 2014. doi: icassp.2014.6853739.

J. Deng, Z. Zhang, E. Marchi, and B. Schuller. Sparse autoencoder-based feature transfer learning for speech emotion recognition. In 2013 humaine association conference on affective computing and intelligent interaction, pages 511–516. IEEE, 2013. doi: 10.1109/acii.2013.90.

V. Dissanayake, H. Zhang, M. Billinghurst, and S. Nanayakkara. Speech emotion recognition ‘in the wild’using an autoencoder. Proc. Interspeech 2020, pages 526–530, 2020. doi: 10.21437/interspeech.2020-1356.

S. E. Eskimez, Z. Duan, and W. Heinzelman. Unsupervised learning approach to feature analysis for automatic speech emotion recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5099–5103. IEEE, 2018. doi: 10.1109/icassp.2018.8462685.

F. Eyben, M. W¨ollmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459–1462, 2010. doi: 10.1145/1873951.1874246.

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr´e, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan, et al. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE transactions on affective computing, 7(2):190–202, 2015. doi: 10.1109/taffc.2015.2457417.

Y. Gao, J. Liu, L. Wang, and J. Dang. Metric learning based feature representation with gated fusion model for speech emotion recognition. Proc. Interspeech 2021, pages 4503–4507, 2021. doi: 10.21437/interspeech.2021-1133.

S. Ghosh, E. Laksana, L.-P. Morency, and S. Scherer. Representation learning for speech emotion recognition. In Interspeech, pages 3603–3607, 2016. doi: 10.21437/interspeech.2016-692.

P. Gournay, O. Lahaie, and R. Lefebvre. A canadian french emotional speech dataset. In Proceedings of the 9th ACM Multimedia Systems Conference, pages 399–402, 2018. doi: 10.1145/3204949.3208121.

W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291.

J. Huang, Y. Li, J. Tao, Z. Lian, et al. Speech emotion recognition from variable-length inputs with triplet loss function. In Interspeech, pages 3673–3677, 2018. doi: 10.21437/interspeech.2018-1432.

P. Jackson and S. Haq. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014.

S. Latif, A. Qayyum, M. Usman, and J. Qadir. Cross lingual speech emotion recognition: Urdu vs. western languages. In 2018 International Conference on Frontiers of Information Technology (FIT), pages 88–93. IEEE, 2018. doi: 10.1109/fit.2018.00023.

S. Latif, R. Rana, J. Qadir, and J. Epps. Variational autoencoders for learning latent representations of speech emotion: a preliminary study. Interspeech 2018: Proceedings, pages 3107–3111, 2018. doi: 10.21437/interspeech.2018-1568.

S. Latif, J. Qadir, and M. Bilal. Unsupervised adversarial domain adaptation for cross-lingual speech emotion recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 732–737. IEEE, 2019. doi: 10.1109/acii.2019.8925513.

Z. Lian, Y. Li, J. Tao, and J. Huang. Speech emotion recognition via contrastive loss under siamese networks. In Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data, pages 21–26, 2018. doi: 10.1145/3267935.3267946.

B. Mocanu, R. Tapu, and T. Zaharia. Utterance level feature aggregation with deep metric learning for speech emotion recognition. Sensors, 21(12):4233, 2021. doi: 10.3390/s21124233.

M. Neumann and N. T. Vu. Improving speech emotion recognition with unsupervised representation learning on unlabeled speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7390–7394. IEEE, 2019. doi: 10.1109/icassp.2019.8682541.

S. Parthasarathy, V. Rozgic, M. Sun, and C. Wang. Improving emotion classification through variational inference of latent variables. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7410–7414. IEEE, 2019. doi: 10.1109/icassp.2019.8682823.

J. A. Russell. A circumplex model of affect. Journal of personality and social psychology, 39 (6):1161, 1980.

B. Schuller, G. Rigoll, and M. Lang. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In 2004 IEEE international conference on acoustics, speech, and signal processing, volume 1, pages I–577. IEEE, 2004. doi: 10.1109/icassp.2004.1326051.

B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. M¨uller, and S. Narayanan. The interspeech 2010 paralinguistic challenge. In Proc. INTERSPEECH 2010, Makuhari, Japan, pages 2794–2797, 2010. doi: 10.21437/interspeech.2010-739.

P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008. doi: 10.1145/1390156.1390294.

N. Vryzas, R. Kotsakis, A. Liatsou, C. A. Dimoulas, and G. Kalliris. Speech emotion recognition for performance interaction. Journal of the Audio Engineering Society, 66(6):457–467, 2018. doi: 10.17743/jaes.2018.0036.

M. W¨ollmer, B. Schuller, F. Eyben, and G. Rigoll. Combining long short-term memory and dynamic bayesian networks for incremental emotionsensitive artificial listening. IEEE Journal of selected topics in signal processing, 4(5):867–881, 2010. doi: 10.1109/jstsp.2010.2057200.

R. Xia and Y. Liu. Using denoising autoencoder for emotion recognition. In Interspeech, pages 2886–2889, 2013. doi: 10.21437/interspeech.2013-256.

S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, et al. Superb: Speech processing universal performance benchmark. Proc. Interspeech 2021, pages 1194–1198, 2021. doi: 10.21437/interspeech.2021-1775.

O. Yildirim, R. San Tan, and U. R. Acharya. An efficient compression of ecg signals using deep convolutional autoencoders. Cognitive Systems Research, 52:198–211, 2018. doi: 10.1016/j.cogsys.2018.07.004.

B. Zhang, Y. Kong, G. Essl, and E. M. Provost. f-similarity preservation loss for soft labels: A demonstration on cross-corpus speech emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5725–5732, 2019. doi: 10.1609/aaai.v33i01.33015725.