Single-ended quality measurement of a music content via convolutional recurrent neural networks

Organiściak, Kamila; Borkowski, Józef

doi:10.24425/mms.2020.134849

Artykuł - szczegóły

Tytuł artykułu

Single-ended quality measurement of a music content via convolutional recurrent neural networks

Autorzy

Organiściak Kamila , Borkowski Józef

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.24425/mms.2020.134849

Warianty tytułu

Języki publikacji

Abstrakty

The paper examines the usage of Convolutional Bidirectional Recurrent Neural Network (CBRNN) for a problem of quality measurement in a music content. The key contribution in this approach, compared to the existing research, is that the examined model is evaluated in terms of detecting acoustic anomalies without the requirement to provide a reference (clean) signal. Since real music content may include some modes of instrumental sounds, speech and singing voice or different audio effects, it is more complex to analyze than clean speech or artificial signals, especially without a comparison to the known reference content. The presented results might be treated as a proof of concept, since some specific types of artefacts are covered in this paper (examples of quantization defect, missing sound, distortion of gain characteristics, extra noise sound). However, the described model can be easily expanded to detect other impairments or used as a pre-trained model for other transfer learning processes. To examine the model efficiency several experiments have been performed and reported in the paper. The raw audio samples were transformed into Mel-scaled spectrograms and transferred as input to the model, first independently, then along with additional features (Zero Crossing Rate, Spectral Contrast). According to the obtained results, there is a significant increase in overall accuracy (by 10.1%), if Spectral Contrast information is provided together with Mel-scaled spectrograms. The paper examines also the influence of recursive layers on effectiveness of the artefact classification task.

Słowa kluczowe

audio data analysis artefacts detection convolutional neural networks recurrent neural network classification model

Wydawca

Komitet Metrologii i Aparatury Naukowej PAN

Czasopismo

Metrology and Measurement Systems

Rocznik

2020

Tom

Vol. 27, nr 4

Strony

721--733

Opis fizyczny

Bibliogr. 41 poz., rys., tab., wzory

Twórcy

autor

Organiściak Kamila

kamila.organisciak@pwr.edu.pl

Wrocław University of Science and Technology, Chair of Electronic and Photonic Metrology, B. Prusa 53/55, 50-317 Wrocław, Poland

autor

Borkowski Józef

jozef.borkowski@pwr.edu.pl

Wrocław University of Science and Technology, Chair of Electronic and Photonic Metrology, B. Prusa 53/55, 50-317 Wrocław, Poland

Bibliografia

[1] Gilski, P., & Stefański, J. (2017). Transmission Quality Measurements in DAB+ Broadcast System. Metrology and Measurement Systems, 24(4), 675-683. https://doi.org/10.1515/mms-2017-0050
[2] Rix, A. W., Beerends, J. G., Kim, D. S., Kroon, P., & Ghitza, O. (2006). Objective assessment of speech and audio quality - technology and applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1890-1901. https://doi.org/10.1109/TASL.2006.883260
[3] International Telecommunication Union. (2017). Audio Definition Model (Recommendation ITU-RBS.2076-1). https://www.itu.int/rec/R-REC-BS.2076-1-201706-S/en
[4] International Telecommunication Union. (2019).General methods for the subjective assessment of sound quality (Recommendation ITU-R BS.1284-2). https://www.itu.int/rec/R-REC-BS.1284-2-201901-I/en
[5] Thiede, T., Treurniet, W. C., Bitto, R., Schmidmer, C., Sporer, T., Beerends, J. G., & Colomes, C. (2000). PEAQ-The ITU standard for objective measurement of perceived audio quality. Journal of the Audio Engineering Society, 48(1-2), 3-29.
[6] International Telecommunication Union. (2011). Perceptual Objective Listening Quality Assessment (Recommendation ITU-T P.863). https://www.itu.int/rec/T-REC-P.863-201101-S/en.
[7] Sloan, C., Harte, N., Kelly, D., Kokaram, A. C., & Hines, A. (2017). Objective assessment of perceptual audio quality using ViSQOLAudio. IEEE Transactions on Broadcasting, 63(4), 693-705. https://doi.org/10.1109/TBC.2017.2704421
[8] Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2125-2136. https://doi.org/10.1109/TASL.2011.2114881
[9] Hines, A., Gillen, E., Kelly, D., Skoglund, J., Kokaram, A., & Harte, N. (2015). ViSQOLAudio: An objective audio quality metric for low bitrate codecs. The Journal of the Acoustical Society of America, 137(6), EL449-EL455. https://doi.org/10.1121/1.4921674
[10] Plapous, C., Marro, C., & Scalart, P. (2006). Improved Signal-to-Noise Ratio Estimation for Speech Enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 2098-2108. https://doi.org/10.1109/TASL.2006.872621
[11] Li, Z., Wang, J. C., Cai, J., Duan, Z., Wang, H. M., & Wang, Y. (2013). Non-reference audio quality assessment for online live music recordings. Proceedings of the 21st ACM international conference on Multimedia, Spain, 63-72. https://doi.org/10.1145/2502081.2502106
[12] Akhtar, Z., & Falk, T. H. (2017). Audio-visual multimedia quality assessment: A comprehensive survey. IEEE Access, 5, 21090-21117. https://doi.org/10.1109/ACCESS.2017.2750918
[13] Doh-Suk, K. (2005). ANIQUE: An auditory model for single-ended speech quality estimation. Speech and Audio Processing. IEEE Transactions, 13(5), 821-831. https://doi.org/10.1109/TSA.2005.851924
[14] Kates, J. M., & Arehart, K. H. (2010). The hearing-aid speech quality index (HASQI). Journal of the Audio Engineering Society, 58(5), 363-381. http://www.aes.org/e-lib/browse.cfm?elib=15451.
[15] Mahdi, E. A., & Picovici, D. (2010). New single-ended objective measure for non-intrusive speech quality evaluation. Signal, Image Video Process, 4, 23-38. https://doi.org/10.1007/s11760-008-0092-1
[16] Falk, T. H., Zheng, C., & Chan, W. Y. (2010). A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Transactions on Audio, Speech, Language Processing, 18(7), 1766-1774. https://doi.org/10.1109/TASL.2010.2052247
[17] Orcik, L., Voznak, M., Rozhon, J., Rezac, F., Slachta, J., Toral-Cruz, H., & Lin, J. C. W. (2017). Prediction of speech quality based on resilient backpropagation artificial neural network. Wireless Personal Communications, 96(4), 5375–5389. https://doi.org/10.1007/s11277-016-3746-2
[18] Falk, T. H., & Chan, W. Y. (2006). Single-ended speech quality measurement using machine learning methods. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1935-1947. https://doi.org/10.1109/TASL.2006.883253
[19] Babić, D., Pul, M., Vranješ, M., & Peković, V. (2017, October). Real-time audio and video artifacts detection tool. International Conference on Smart Systems and Technologies (SST), Croatia, 251-256. https://doi.org/10.1109/SST.2017.8188704
[20] Shen, J., Li, Q., & Erlebacher, G. (2011). Hybrid no-reference natural image quality assessment of noisy, blurry, JPEG2000, and JPEG images. IEEE Transactions on Image Processing, 20(8), 2089-2098. https://doi.org/10.1109/TIP.2011.2108661
[21] Li, Y., Po, L. M., Cheung, C. H., Xu, X., Feng, L., Yuan, F., & Cheung, K. W. (2015). No-reference video quality assessment with 3D shearlet transform and convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology, 26(6), 1044-1057. https://doi.org/10.1109/TCSVT.2015.2430711
[22] Chen, C., Izadi, M., & Kokaram, A. (2016, October). A perceptual quality metric for videos distorted by spatially correlated noise. Proceedings of the 24th ACM international conference on Multimedia, Netherlands, 1277-1285. https://doi.org/10.1145/2964284.2964302
[23] Jung, S., Park, J., & Lee, S. (2019). Polyphonic sound event detection using convolutional bidirectional lstm and synthetic data-based transfer learning. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), United Kingdom, 885-889. https://doi.org/10.1109/ICASSP.2019.8682909
[24] Cichosz, P. (2015). Data Mining Algorithms: Explained Using R. John Wiley & Sons.
[25] Zhao, H., Zarar, S., Tashev, I., & Lee, C. H. (2018). Convolutional-Recurrent Neural Networks for Speech Enhancement. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Canada, 2401-2405. https://doi.org/10.1109/ICASSP.2018.8462155
[26] Sang, J., Park, S., & Lee, J. (2018). Convolutional recurrent neural networks for urban sound classification using raw waveforms. 26th European Signal Processing Conference (EUSIPCO), Italy, 2444-2448. https://doi.org/10.23919/EUSIPCO.2018.8553247
[27] Portsev, R. J., & Makarenko, A. V. (2018). Convolutional Neural Networks for Noise Signal Recognition. IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), Denmark. https://doi.org/10.1109/MLSP.2018.8516920
[28] Goodfellow, I., Bengio, Y., & Courville, A. (2017). Deep Learning. London. The MIT Press.
[29] Rafii, Z., Liutkus, A., Stöter, F. R., Mimilakis, S. I., Bittner, R. (2018). The MUSDB18 corpus for music separation. https://doi.org/10.5281/zenodo.1117372
[30] Sturm, B. L. (2014). The state of the art ten years after a state of the art: Future research in music information retrieval. Journal of New Music Research, 43(2), 147-172. https://doi.org/10.1080/09298215.2014.894533
[31] Lu, Y. C., Wu, C. W., Lu, C. T., & Lerch, A. (2016, July). An unsupervised approach to anomaly detection in music datasets. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, United States, 749-752. https://doi.org/10.1145/2911451.2914700
[32] Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011). The Million Song Dataset. Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), United States. https://doi.org/10.7916/D8NZ8J07
[33] Defferrard, M., Benzi, K., Vandergheynst, P., & Bresson, X. (2017). FMA: A Dataset For Music Analysis. 18th International Society for Music Information Retrieval Conference (ISMIR), China. https://arxiv.org/abs/1612.01840.
[34] Font, F., Roma, G., & Serra, X. (2013). Free sound technical demo. Proceedings of the 21st ACM international conference on Multimedia (MM ‘13), New York, 411-412. https://doi.org/10.1145/2502081.2502245
[35] Min, X., Zhai, G., Zhou, J., Farias, M. C., & Bovik, A. C. (2020). Study of Subjective and Objective Quality Assessment of Audio-Visual Signals. IEEE Transactions on Image Processing, 29, 6054-6068. https://doi.org/10.1109/TIP.2020.2988148
[36] Camastra, F., & Vinciarelli, A. (2015). Machine Learning for Audio, Image and Video Analysis: Theory and Applications. Springer-Verlag London. https://doi.org/10.1007/978-1-4471-6735-8
[37] Chollet, F.et al. (2015). Keras. GitHub. https://github.com/fchollet/keras
[38] Mesaros, A., Heittola, T., & Virtanen, T. (2016). Metrics for polyphonic sound event detection. Applied Sciences, 6(6), 162. https://doi.org/10.3390/app6060162
[39] Gulhane, S. R., Badhe, S. S., & Shirbahadurkar, S. D. (2018). Cepstral (MFCC) feature and spectral (Timbral) features analysis for musical instrument sounds. 2018 IEEE global conference on wireless computing and networking (GCWCN), India, 109-113. https://doi.org/10.1109/GCWCN.2018.8668628
[40] Khonglah, B. K., & Prasanna, S. M. (2017). Clean speech/speech with background music classification using HNGD spectrum. International Journal of Speech Technology, 20(4), 1023-1036. https://doi.org/10.1007/s10772-017-9464-7
[41] Wu, Y., & Lee, T. (2018, April). Reducing model complexity for DNN based large-scale audio classification. IEEE international conference on acoustics, speech and signal processing (ICASSP), Canada, 331-335. https://doi.org/10.1109/ICASSP.2018.8462168

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2021).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-512eecad-19b1-4311-8bf8-81d833ffe17e