Experimental research on the impact of similarity function selection on the quality of keyword spotting in speech signal

Laszko, Łukasz

doi:10.5604/01.3001.0013.6598

Artykuł - szczegóły

Tytuł artykułu

Experimental research on the impact of similarity function selection on the quality of keyword spotting in speech signal

Autorzy

Laszko Łukasz

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.5604/01.3001.0013.6598

Warianty tytułu

Eksperymentalne badanie wpływu wyboru funkcji podobieństwa na jakość wykrywania słów w sygnale mowy

Języki publikacji

Abstrakty

The paper describes an evaluation of the application of selected similarity functions in the task of keyword spotting. Experiments were carried out in the Polish language. The research results can be used to improve already existing keyword spotting methods, or to develop new ones.

W pracy przedstawiono ocenę zastosowania wybranych funkcji podobieństwa w zadaniu wykrywania słów kluczowych. Przeprowadzono eksperymenty dla języka polskiego. Wyniki badań można wykorzystać do ulepszenia już istniejących metod wykrywania słów kluczowych lub do opracowania nowych.

Słowa kluczowe

keyword spotting signal similarity quality of detection dynamic time warping textual query

wykrywanie słów kluczowych podobieństwo sygnałów wskaźniki jakości wykrycia odkształcanie skali czasu kwerenda tekstowa

Wydawca

Instytut Teleinformatyki i Automatyki, Wydział Cybernetyki, Wojskowa Akademia Techniczna im. Jarosława Dąbrowskiego

Czasopismo

Przegląd Teleinformatyczny

Rocznik

2019

Tom

T. 7, Nr 3-4 (49)

Strony

57--81

Opis fizyczny

Bibliogr. 79 poz., rys., tab., wykr.

Twórcy

autor

Laszko Łukasz

lukasz.laszko@wat.edu.pl

Institute of Teleinformatics and Cybersecurity, Faculty of Cybernetics, MUT ul. gen. Sylwestra Kaliskiego 2, 00-908 Warsaw, Poland

Bibliografia

[1] AMGOUD L., DAVID V., DODER D., Similarity Measures Between Arguments Revisited. In: Kern-Isberner G., Ognjanović Z. (eds) Symbolic and Quantitative Approaches to Reasoning with Uncertainty, ECSQARU 2019, Lecture Notes in Computer Science, Vol. 11726, pp. 98-107, DOI 10.1007/978-3-030-29765-7_1.
[2] BHATTACHARYYA A., On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society. Vol. 35, 1943, pp. 99-109.
[3] BASENER W., FLYNN M., Microscene evaluation using the Bhattacharyya distance. Proc. of SPIE 10780, Honolulu, 2018, DOI 10.1117/12.2327004.
[4] BOYTSOV L., Indexing methods for approximate dictionary searching: Comparative analysis. Journal of Experimental Algorithmics, Vol. 16, Article 1.1, May 2011, pp. 1-91, DOI 10.1145/1963190.1963191.
[5] CHANG H.Y., An SVM Kernel With GMM-Supervector Based on the Bhattacharyya Distance for Speaker Recognition. IEEE Signal Processing Letters, 2009, Vol. 16, Issue 1, pp. 49-52, DOI 10.1109/LSP.2008.2006711.
[6] CHEN B., WANG H.-M., CHIEN L.-F. LEE L.-S., A-Admissible Key-Phrase Spotting With Sub-Syllable Level Utterance Verification. The 5th International Conference on Spoken Language Processing, Incorporating The 7th Australian International Speech Science and Technology Conference, Sydney, Australia, 1998, pp. 783-786.
[7] CHEN Y.-I., WU CH.-H., YAN G.-L., Utterance Verification Using Prosodic Information for Mandarin Telephone Speech. 1999 IEEE International Conference on Acoustics, Speech and Signal Processing. Keyword Spotting Proceedings, ICASSP '99, Vol. 2, Phoenix, AZ, USA, pp. 697-700, DOI 10.1109/ICASSP.1999.759762.
[8] CHICCO D., Ten quick tips for machine learning in computational biology. BioData Mining, Vol. 10, No. 35, 2017, pp. 1-17, DOI 10.1186/s13040-017-0155-3.
[9] CHINCHOR N., MUC-4 Evaluation Metrics. In Proceedings of the Fourth Message Understanding Conference, 1992, pp. 22-29, http://www.aclweb.org/anthology-new/M/M92/M92-1002.pdf.
[10] DEB K., Introduction to Evolutionary Multiobjective Optimization. In: Branke J., Deb K., Miettinen K., Słowiński R. (eds) Multiobjective Optimization. Lecture Notes in Computer Science, Vol. 5252, 2008, Springer, Berlin, Heidelberg, pp. 59-96, DOI 10.1007/978-3-540-88908-3_3.
[11] DUIN R.P. W., PĘKALSKA E., The Dissimilarity Representation for Structural Pattern Recognition. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 2011, pp. 1-24, DOI 10.1007/978-3-642-25085-9_1.
[12] DUIN R.P.W., PĘKALSKA E., Non-euclidean dissimilarities: Causes and informativeness. In proc. Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), 2010, LNCS, Vol. 6218, Springer, Heidelberg, pp. 324-333, DOI 10.1007/978-3-642-14980-1_31.
[13] DUBUISSON M.P., JAIN A.K., A Modified Hausdorff distance for object matching. In ICPR94, Jerusalem, Israel, 1994, pp. 566-568.
[14] FAWCETT T., An Introduction to ROC Analysis. Pattern Recognition Letters, Vol. 27, No. 8, 2006, pp. 861-874, DOI 10.1016/j.patrec.2005.10.010.
[15] FOOTE J., An Overview of Audio Information Retrieval. ACM Multimedia Systems, Vol. 7, 1998, pp. 2-10, DOI 10.1.1.39.6339
[16] FUKUNAGA K., Introduction to Statistical Pattern Recognition. 2nd Edition, Elsevier Inc, 1990, DOI 10.1016/C2009-0-27872-X
[17] GÜNDOĞDU B., Keyword Search for Low Resource Languages. PhD Thesis, Bogazici Universit, 2017.
[18] GÜNDOĞDU B., SARAÇLAR M., Distance metric learning for posteriorgram based keyword search. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, 2017, pp. 5660-5664, DOI 10.1109/ICASSP.2017.7953240
[19] GUPTA K., GUPTA D., An analysis on LPC, RASTA and MFCC techniques in Automatic Speech Recognition. 2016 6th International Conference - Cloud System and Big Data Engineering System (Confluence), Noida, 2016, pp. 493-497, DOI 10.1109/CONFLUENCE.2016.7508170
[20] GUPTA P., PUROHIT G.N., RATHORE M., Number Plate Extraction using Template Matching Technique. International Journal of Computer Applications, Vol. 88, No. 3, 2014, pp. 40-44, DOI 10.5120/15336-3670
[21] HAASDONK B., BAHLMANN C., Learning with distance substitution kernels. In Pattern Recognition – Proc. of the 26th DAGM Symposium, 2004, pp. 220-227, DOI 10.1007/978-3-540-28649-3_27
[22] HAFEN R.P., HENRY M.J., Speech information retrieval: a review. Multimedia Systems, Vol. 18, No. 6, 2012, pp. 499-518.
[23] HELLINGER E., (in German) Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. Journal für die reine und angewandte Mathematik, Vol. 136, 1909, pp. 210–271, DOI 10.1515/crll.1909.136.210
[24] HENRIKSON J., Completeness and total boundedness of the Hausdorff metric. MIT Undergraduate Journal of Mathematics, 1999, pp. 69-80.
[25] HIGGINS A., WOHLFORD R., Keyword recognition using template concatenation. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '85, Tampa, FL, USA, 1985, pp. 1233-1236, DOI 10.1109/ICASSP.1985.1168253
[26] HOBSON A., CHENG B-K., A comparison of the Shannon and Kullback information measures. Journal of Statistical Physics, Vol. 7, No. 4, 1973, pp. 301–310, DOI: 10.1007/BF01014906.
[27] HOLYOAK K.J., THAGARD P., Mental Leaps: Analogy in Creative Thought. A Bradford Book series, MIT Press, 1996.
[28] JANSEN A., DURME VAN B., Efficient Spoken Term Discovery Using Randomized Algorithms. 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, 2011, pp. 401-406, DOI 10.1109/ASRU.2011.6163965.
[29] JANSEN B., RIEH S.Y., The Seventeen Theoretical Constructs of Information Searching and Information Retrieval. In Journal of the American Society for Information Science and Technology, Vol. 61, No. 8., 2010, pp. 1517-1534, DOI 10.1002/asi.21358.
[30] JENSEN J.H., ELLIS D.P. W., CHRISTENSEN M.G., JENSEN S.H., Evaluation of Distance Measures Between Gaussian Mixture Models of MFCCs. Proceedings of the 8th International Conference on Music Information Retrieval, ISMIR 2007, Vienna, 2007, pp. 107-108.
[31] KAILATH T., The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Transactions on Communication Technology, 1967, Vol. 15, No. 1, pp. 52-60, DOI 10.1109/TCOM.1967.1089532
[32] KAMIŃSKA D., SAPIŃSKI T., ANBARJAFARI G., Efficiency of chosen speech descriptors in relation to emotion recognition. EURASIP Journal on Audio, Speech, and Music Processing (2017), Vol. 3, pp. 1-9, DOI 10.1186/s13636-017-0100-x
[33] KASSAMBARA A., Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning (Multivariate Analysis). Vol. 1, CreateSpace Independent Publishing Platform, 2017.
[34] KESHET J., GRANGIER D., BENGIO S.A., Discriminative keyword spotting. Speech Communication, 2009, Vol. 51, No. 4, pp. 317-329, DOI 10.1016/j.specom.2008.10.002.
[35] KORŽINEK D., MARASEK K., BROCKI Ł., WOŁK K., Polish Read Speech Corpus for Speech Tools and Services. Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure, No. 136, Linköping University Electronic Press, Linköpings universitet, 2017, pp. 54-62.
[36] KULLBACK S.; LEIBLER R.A. On information and sufficiency. Annals of Mathematical Statistics, Vol. 22, No. 1, 195, pp. 79-86, DOI 10.1214/aoms/1177729694
[37] KULLBACK S., Information theory and statistics. Dover Books on Mathematics, New Edition, 1997.
[38] KWIATKOWSKI W., (in Polish) Klasyfikacja metodą grupowania cech z uwzględnieniem ich wzajemnej korelacji. Biuletyn Instytutu Automatyki i Robotyki, Nr 14, 2000, pp. 139-146.
[39] KWIATKOWSKI W., (in Polish) Metody automatycznego rozpoznawania wzorców. Instytut Automatyki i Robotyki, WAT, Wydanie I, Warszawa, 2001.
[40] KWIATKOWSKI W., (in Polish) Wykrywanie anomalii bazujące na wskazanych przykładach. Przegląd Teleinformatyczny, Nr 1-2, 2018, pp. 3-21.
[41] KWIATKOWSKI W., (in Polish) Wstęp do cyfrowego przetwarzania sygnałów. BEL Studio, WAT, Warszawa, 2003.
[42] LASZKO Ł., Word detection in recorded speech using textual queries. Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 5, pp. 849-853, DOI 10.15439/2015F341.
[43] LASZKO Ł., Using formant frequencies to word detection in recorded speech. Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 8, pp. 797-801, DOI 10.15439/2016F518.
[44] LASZKO Ł., Developing keyword spotting method for the Polish language. Communication Papers of the 2018 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 17, pp. 123-127, DOI 10.15439/2018F178.
[45] LEBRET R., COLLOBERT R., Word Embeddings through Hellinger PCA. 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL, 2014, pp. 482-490, DOI 10.3115/v1/E14-1051.
[46] LI H., HAN J., ZHENG T., ZHENG G., Mandarin keyword spotting using syllable based confidence features and SVM. 2nd International Conference on Intelligent Control and Information Processing, Harbin, 2011, pp. 256-259, DOI 10.1109/ICICIP.2011.6008243.
[47] LI W., BILLARD A., BOURLARD H., Keyword Detection for Spontaneous Speech. 2nd International Congress on Image and Signal Processing, Tianjin, 2009, pp. 1-5, DOI 10.1109/CISP.2009.5303824
[48] LIU D., CHO S., SUN D., QIU Z., A Spearman correlation coefficient ranking for matching-score fusion on speaker recognition. TENCON 2010 - 2010 IEEE Region 10 Conference, Fukuoka, 2010, pp. 736-741, DOI 10.1109/TENCON.2010.5686608
[49] MATTHEWS B.W., Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) – Protein Structure, Vol. 405, No. 2, 1975, pp. 442-451, DOI 10.1016/0005-2795(75)90109-9
[50] MANNING CH.D., RAGHAVAN P., SCHÜTZE H., Introduction to Information Retrieval. Cambridge University Press, 2008.
[51] MIETTINEN K., Introduction to Multiobjective Optimization: Noninteractive Approaches. In: Branke J., Deb K., Miettinen K., Słowiński R. (eds) Multiobjective Optimization. Lecture Notes in Computer Science, Vol 5252, 2008, Springer, Berlin, Heidelberg, pp. 1-26, DOI 10.1007/978-3-540-88908-3_1
[52] MIETTINEN K., RUIZ F., WIERZBICKI A.P., Introduction to Multiobjective Optimization: Interactive Approaches. In: Branke J., Deb K., Miettinen K., Słowiński R. (eds) Multiobjective Optimization. Lecture Notes in Computer Science, Vol 5252, 2008, Springer, Berlin, Heidelberg, pp. 27-57, DOI 10.1007/978-3-540-88908-3_2
[53] MITRA V., HAUT VAN J., FRANCO H., VERGYRI D., Feature Fusion for High-Accuracy Keyword Spotting. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference Lei Y., et al. on, 2014, pp. 7143-7147.
[54] MOHAMED S.S., ABDALLA A., JOHN R.I., New Entropy-Based Similarity Measure between Interval-Valued Intuitionstic Fuzzy Sets. Axioms, Vol. 8, No. 2, 2019, Article-Number 73, DOI 10.3390/axioms8020073
[55] MUSCARIELLO A., GRAVIER G., BIMBOT F., Audio keyword extraction by unsupervised word discovery. In Proceedings of the Interspeech, 2009, pp. 2843–2847.
[56] MÜLLER M., Information Retrieval for Music and Motion. Springer Berlin Heidelberg New York, 2007.
[57] NIELSEN F., A generalization of the Jensen divergence: The chord gap divergence. arXiv preprint, 2017, pp. 1-13, https://arxiv.org/abs/1709.10498
[58] PARDO L., Statistical Inference Based on Divergence Measures. Statistics: A Series of Textbooks and Monographs, 1st Edition, Chapman and Hall/CRC, 2006.
[59] PARK A.S., GLAS J.R. Unsupervised pattern discovery in speech. IEEE Trans. on Audio, Speech and Language Processing, 2008, Vol. 16, No. 1, pp. 186-197.
[60] PONTIUS R.G., KANGPING S., The total operating characteristic to measure diagnostic ability for multiple thresholds. International Journal of Geographical Information Science, Vol. 28, No. 3, 2014, pp. 570-583, DOI 10.1080/13658816.2013.862623
[61] POWERS D.M.W., Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, Vol. 2, No. 1, 2007, pp. 37-63.
[62] QIAO Y., MINEMATSU N., A Study on Invariance of f-Divergence and Its Application to Speech Recognition. IEEE Transactions on Signal Processing, 2010, Vol. 58, No. 7, pp. 3884-3890, DOI 10.1109/TSP.2010.2047340.
[63] RAIELI R., Introducing Multimedia Information Retrieval to libraries. Italian Journal of Library, Archives, and Information Science, Vol. 7, No. 3, 2016, pp. 9-42, DOI 10.4403/jlis.it-11530.
[64] SAMMUT C., WEBB G.I. (eds.), Encyclopedia of Machine Learning and Data Mining. 2nd Edition, Springer, 2017.
[65] SASAKI Y., The truth of the F-measure. 2007, 5 pages, Web resource available at https://www.toyota-ti.ac.jp/Lab/Denshi/COIN/people/yutaka.sasaki/F-measure-YS-26Oct07.pdf.
[66] SCHÖLKOPF B., The Kernel Trick for Distances. Advances in neural information processing systems, Vol. 13, 2000, pp. 301-307.
[67] SINGH A., YADAV A., RANA A., K-means with Three different Distance Metrics. International Journal of Computer Applications, Vol. 67, No.10, 2013, pp. 13-17, DOI 10.5120/11430-6785.
[68] SINGHAL A., Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol. 24, No. 4, 2001, pp. 35-43.
[69] STEHMAN S.V., Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of Environment, Vol. 62, No. 1, 1997, pp. 77-89, DOI 10.1016/S0034-4257(97)00083-7.
[70] TABIBIAN S., AKBARI A., NASERSHARIF B., Improved dynamic match phone lattice search for Persian spoken term detection system in online and offline applications. International Journal of Speech Technology, March 2019, Vol. 22, Issue 1, pp 205-217, DOI 10.1007/s10772-019-09594-w.
[71] TÜSKEA Z., NOLDEN D., SCHLÜTERA R., NEY H., Multilingual MRASTA features for low-resource keyword search and speech recognition systems. 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2014, pp. 7349-7353.
[72] WILPON J. G., RABINER L.R., LEE C., GOLDMAN E.R., Automatic recognition of keywords in unconstrained speech using hidden Markov. IEEE Transactions on Acoustics, Speech and Signal Processing, 1990, Vol. 38, No. 11, pp. 1870-1878, DOI 10.1109/29.103088.
[73] YOUDEN W. J., Index for rating diagnostic tests. Cancer, Vol. 3, 1950, pp. 32–35, DOI 10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3
[74] ZEDDELMANN VON D., KURTH F., MÜLLER M., Perceptual audio features for unsupervised key-phrase detection. Proc. ICASSP2010, 2010, pp. 257-260, DOI 10.1109/ICASSP.2010.5495974.
[75] ZHANG Y., Unsupervised Speech Processing with Applications to Query-by-Example Spoken Term Detection. PhD thesis, Massachusetts Institute of Technology, 2013.
[76] ZHANG Y., GLASS J.R., Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, Merano, 2009, pp. 398-403, DOI 10.1109/ASRU.2009.5372931.
[77] ZHU X., PENN G., RUDZICZ F., Summarizing multiple spoken documents: finding evidence from untranscribed audio. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Vol. 2, 2009, pp. 549-557.
[78] ZIELIŃSKI T.P., (in Polish) Cyfrowe przetwarzanie sygnałów od teorii do zastosowań. Wydawnictwa Komunikacji i Łączności, Warszawa, 2005.
[79] ZIÓŁKO B., GAŁKA J., SKURZOK D., JADCZYK T., Modified Weighted Levenshtein Distance in Automatic Speech Recognition. Krajowa Konferencja Zastosowań Matematyki w Biologii i Medycynie, Krynica, 2010, s. 116-120.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-b1755e2e-97bf-4ffe-9e87-676385689ac1