With the development of the entertainment industry, the need for immersive and emotionally impactful sound design has emerged. Utilization of spatial sound is potentially the next step to improve the audio experiences for listeners in terms of their emotional engagement. Hence, the relationship between spatial audio characteristics and emotional responses of the listeners has been the main focus of several recent studies. This paper provides a systematic overview of the above reports, including the analysis of commonly utilized methodology and technology. The survey was undertaken using four literature repositories, namely, Google Scholar, Scopus, IEEE Xplore, and AES E-Library. The overviewed papers were selected according to the empirical validity and quality of the reported studies. According to the survey outcomes, there is growing evidence of a positive influence of the selected spatial audio characteristics on the listeners’ affective responses. However, more data is required to build reliable, universal, and useful models explaining the above relationship. Furthermore, the two research trends on this topic were identified. Namely, the studies undertaken so far can be classified as either technology-oriented or technology-agnostic, depending on the research questions or experimental factors examined. Prospective future research directions regarding this topic are identified and discussed. They include better utilization of scene-based paradigms, affective computing techniques, and exploring the emotional effects of dynamic changes in spatial audio scenes.
Due to its relevant real-life applications, the recognition of emotions from speech signals constitutes a popular research topic. In the traditional methods applied for speech emotion recognition, audio features are typically aggregated using a fixed-duration time window, potentially discarding information conveyed by speech at various signal durations. By contrast, in the proposed method, audio features are aggregated simultaneously using time windows of different lengths (a multi-time-scale approach), hence, potentially better utilizing information carried at phonemic, syllabic, and prosodic levels compared to the traditional approach. A genetic algorithm is employed to optimize the feature extraction procedure. The features aggregated at different time windows are subsequently classified by an ensemble of support vector machine (SVM) classifiers. To enhance the generalization property of the method, a data augmentation technique based on pitch shifting and time stretching is applied. According to the obtained results, the developed method outperforms the traditional one for the selected datasets, demonstrating the benefits of using a multi-time-scale approach to feature aggregation.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.