An approach to speaker's emotion recognition based on several acoustic feature types and 1D convolutional neural networks is described. The focus is on selecting the best speech features, improving the baseline model configuration and integrating in the solution a gender classification network. Features include a Mel-scale spectrogram and MFCC- , Chroma-, prosodic- and pitch-related features. Especially, the question whether to use 2-D maps of features or reduce them to 1-D vectors by averaging, is experimentally resolved. Well--known speech datasets RAVDESS, Tess, Crema-D and Savee are used in experiments. It appeared, that the best performing model consists of two convolutional networks for gender-aware classification and one gender classifier. The Chroma features have been found to be obsolete, and even disturbing, given other speech features. The f1 accuracy of proposed solution reached 73.2% on the RAVDESS dataset and 66.5% on all four datasets combined, improving the baseline model by 7.8% and 3%, respectively. This approach is an alternative to other proposed models, which reported accuracy scores of 60% - 71% on the RAVDESS dataset.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.