Audio Feature Space Analysis for Emotion Recognition from Spoken Sentences

Smietanka, Lukasz; Maka, Tomasz

doi:10.24425/aoa.2021.136581

Artykuł - szczegóły

Tytuł artykułu

Audio Feature Space Analysis for Emotion Recognition from Spoken Sentences

Autorzy

Smietanka Lukasz , Maka Tomasz

Treść / Zawartość

Pełne teksty:

Smietanka_Audio Feature Space Analysis_2_2021.pdf

Pobierz

Identyfikatory

DOI

10.24425/aoa.2021.136581

Warianty tytułu

Języki publikacji

Abstrakty

An analysis of low-level feature space for emotion recognition from the speech is presented. The main goal was to determine how the statistical properties computed from contours of low-level features influence the emotion recognition from speech signals. We have conducted several experiments to reduce and tune our initial feature set and to configure the classification stage. In the process of analysis of the audio feature space, we have employed the univariate feature selection using the chi-squared test. Then, in the first stage of classification, a default set of parameters was selected for every classifier. For the classifier that obtained the best results with the default settings, the hyperparameter tuning using cross-validation was exploited. In the result, we compared the classification results for two different languages to find out the difference between emotional states expressed in spoken sentences. The results show that from an initial feature set containing 3198 attributes we have obtained the dimensionality reduction about 80% using feature selection algorithm. The most dominant attributes selected at this stage based on the mel and bark frequency scales filterbanks with its variability described mainly by variance, median absolute deviation and standard and average deviations. Finally, the classification accuracy using tuned SVM classifier was equal to 72.5% and 88.27% for emotional spoken sentences in Polish and German languages, respectively.

Słowa kluczowe

speech analysis classification emotional speech

Wydawca

Instytut Podstawowych Problemów Techniki PAN
Polska Akademia Nauk

Czasopismo

Archives of Acoustics

Rocznik

2021

Tom

Vol. 46, No. 2

Strony

271--277

Opis fizyczny

Bibliogr. 24 poz., rys., tab., wykr.

Twórcy

autor

Smietanka Lukasz

lsmietanka1892@gmail.com

Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Poland

autor

Maka Tomasz

tmaka@wi.zut.edu.pl

Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Poland

Bibliografia

1. Anagnostopoulos C. N., Iliou T., Giannoukos I. (2015), Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011, Artificial Intelligence Review, 43: 155-177, doi: 10.1007/s10462-012-9368-5.
2. Boersma P. (1993), Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound, Proceedings of the Institute of Phonetic Sciences, 17 (1193): 97-110.
3. Boersma P., Weenink D. (2001), Praat, a system for doing phonetics by computer, Glot International, 5 (9/10): 341-345.
4. Breiman L. (2001), Random forests, Machine Learning, 45 (1): 5-32, doi: 10.1023/A:1010933404324.
5. Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W., Weiss B. (2005), A database of German emotional speech, 9th European Conference on Speech Communication and Technology, 5: 1517-1520.
6. Chang C.-C., Lin C.-J. (2011), LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, 2: 27:1-27:27, doi: 10.1145/1961189.1961199.
7. Davis S., Mermelstein P. (1980), Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing, 28 (4): 357-366, doi: 10.1109/TASSP.1980.1163420.
8. Eyben F. (2016), Real-time speech and music classification by large audio feature space extraction, Springer, Cham, doi: 10.1007/978-3-319-27299-3.
9. Feraru S. M., Zbancioc M. D. (2013), Emotion recognition in Romanian language using LPC features, [in:] 2013 E-Health and Bioengineering Conference (EHB), pp. 1-4, doi: 10.1109/EHB.2013.6707314.
10. Hao M., Tianhao Y., Fei Y. (2019), The SVM based on SMO optimization for speech emotion recognition, [in:] 2019 Chinese Control Conference (CCC), pp. 7884-7888, doi: 10.23919/ChiCC.2019.8866463.
11. Kathiresan T., Dellwo V. (2019), Cepstral derivatives in MFCCs for emotion recognition, [in:] 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 56-60, doi: 10.1109/SIPROCESS.2019.8868573.
12. Kuan T.-W., Tsai A.-C., Sung P.-H., Wang J.-F., Kuo H.-S. (2016), A robust BFCC feature extraction for ASR system, Artificial Intelligence Research, 5 (2): 14-23, doi: 10.5430/air.v5n2p14.
13. Lee K. H., Kyun Choi H., Jang B. T., Kim D. H. (2019), A study on speech emotion recognition using a deep neural network, [in:] 2019 International Conference on Information and Communication Technology Convergence (ICTC), pp. 1162-1165, doi: 10.1109/ICTC46691.2019.8939830.
14. Markel J. D., Gray A. H. J. (1976), Linear Prediction of Speech, New York: Springer-Verlag.
15. Meng H., Yan T., Yuan F., Wei H. (2019), Speech emotion recognition from 3D log-mel spectrograms with deep learning network, IEEE Access, 7: 125868-125881, doi: 10.1109/ACCESS.2019.2938007.
16. Mitrovic D., Zeppelzauer M., Breiteneder C. (2010), Features for content-based audio retrieval, Advances in Computers, 78: 71-150, doi: 10.1016/S0065-2458(10)78003-7.
17. Pedregosa F. et al. (2011), Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12: 2825-2830, doi: 10.5555/1953048.2078195.
18. Rajak R., Mall R. (2019), Emotion recognition from audio, dimensional and discrete categorization using CNNs, [in:] TENCON 2019 – 2019 IEEE Region 10 Conference (TENCON), pp. 301-305, doi: 10.1109/TENCON.2019.8929459.
19. Rao K. S., Reddy V. R., Maity S. (2015), Language Identification Using Spectral and Prosodic Features, Springer Publishing Company, Incorporated.
20. Slot K., Cichosz J., Bronakowski L. (2009), Application of voiced-speech variability descriptors to emotion recognition, [in:] 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp. 1-5, doi: 10.1109/CISDA.2009.5356537
21. Swain M., Routray A., Kabisatpathy P. (2018), Databases, features and classifiers for speech emotion recognition: a review, International Journal of Speech Technology, 21: 93-120, doi: 10.1007/s10772-018-9491-z.
22. Ververidis D., Kotropoulos C. (2006), Emotional speech recognition: Resources, features, and methods, Speech Communication, 48: 1162-1181, doi: 10.1016/j.specom.2006.04.003.
23. Zhang H. (2004), The optimality of naive bayes, [in:] Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2004.
24. Zhu C., Ahmad W. (2019), Emotion recognition from speech to improve human-robot interaction, [in:] 2019 IEEE International Conference on Dependable, Autonomic and Secure Computing, pp. 370-375, doi: 10.1109/DASC/PiCom/CBDCom/CyberSciTech.2019.00076.

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2021).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-487c6017-05a6-483e-bd77-0f9b94441482