PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Audio-Visual Speech Processing System for Polish Applicable to Human-Computer Interaction

Autorzy
Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
This paper describes audio-visual speech recognition system for Polish language and a set of performance tests under various acoustic conditions. We first present the overall structure of AVASR systems with three main areas: audio features extraction, visual features extraction and subsequently, audiovisual speech integration. We present MFCC features for audio stream with standard HMM modeling technique, then we describe appearance and shape based visual features. Subsequently we present two feature integration techniques, feature concatenation and model fusion. We also discuss the results of a set of experiments conducted to select best system setup for Polish, under noisy audio conditions. Experiments are simulating human-computer interaction in computer control case with voice commands in difficult audio environments. With Active Appearance Model (AAM) and multistream Hidden Markov Model (HMM) we can improve system accuracy by reducing Word Error Rate for more than 30%, comparing to audio-only speech recognition, when Signal-to-Noise Ratio goes down to 0dB.
Wydawca
Czasopismo
Rocznik
Strony
41--63
Opis fizyczny
Bibliogr. 57 poz., rys., wykr., tab.
Twórcy
autor
  • AGH University of Science and Technology, Krakow, Poland
Bibliografia
  • [1] Adjoudani A., Benoit C.: On the integration of auditory and visual parameters in an HMM-based ASR. In: Speechreading by humans and machines, pp. 461–471. Springer, 1996.
  • [2] Barker J., Marxer R., Vincent E., Watanabe S.: The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines. In: 2015 IE- EE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 504–511. 2015. URL http://dx.doi.org/10.1109/ASRU.2015.7404837.
  • [3] Bilmes J., Bartels C.: Graphical model architectures for speech recognition. In: Signal Processing Magazine, IEEE, vol. 22(5), pp. 89–100, 2005. ISSN 1053-5888. URL http://dx.doi.org/10.1109/MSP.2005.1511827.
  • [4] Castrillón Santana M., D ́eniz Su ́arez O., Hern ́andez Tejera M., Guerra Artal C.: ENCARA2: Real-time Detection of Multiple Faces at Different Resolutions in Video Streams. In: Journal of Visual Communication and Image Representation, pp. 130–140, 2007.
  • [5] Chan M.T., Zhang Y., Huang T.S.: Real-time lip tracking and bimodal conti- nuous speech recognition. In: Multimedia Signal Processing, 1998 IEEE Second Workshop on, pp. 65–70. IEEE, 1998.
  • [6] Cootes T., Edwards G., Taylor C.: Active appearance models. In: Pattern Ana- lysis and Machine Intelligence, IEEE Transactions on, vol. 23(6), pp. 681–685, 2001. ISSN 0162-8828. URL http://dx.doi.org/10.1109/34.927467.
  • [7] Cootes T.F., Taylor C.J., Cooper D.H., Graham J.: Active shape models-their tra- ining and application. In: Computer vision and image understanding, vol. 61(1), pp. 38–59, 1995.
  • [8] Cox S.J., Harvey R., Lan Y., Newman J.L., Theobald B.J.: The challenge of multispeaker lip-reading. In: AVSP, pp. 179–184. Citeseer, 2008.
  • [9] Czap L.: Lip representation by image ellipse. In: The Proceedings of the 6 ̃(th) International Conference on Spoken Language Processing (Volume IV). 2000.
  • [10] Dupont S., Luettin J.: Audio-visual speech modeling for continuous speech recognition. In: Multimedia, IEEE Transactions on, vol. 2(3), pp. 141–151, 2000. http://dx.doi.org/10.1109/6046.865479.
  • [11] Frankel J., Wester M., King S.: Articulatory feature recognition using dynamic Bayesian networks. Computer Speech & Language, vol. 21(4), pp. 620-640, 2007. http://dx.doi.org/10.1016/j.csl.2007.03.002.
  • [12] Gałka J., Ziółko M.: Wavelet parametrization for speech recognition. In: Proceedings of an ISCA tutorial and research workshop on non-linear speech processing NOLISP 2009, VIC. 2009.
  • [13] Gopinath R.A.: Maximum likelihood modeling with Gaussian distributions for classification. In: Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 2, pp. 661–664. IEEE, 1998.
  • [14] Gowdy J., Subramanya A., Bartels C., Bilmes J.: DBN based multi-stream models for audio-visual speech recognition. In: Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference on, vol. 1, pp. I–993–6 vol.1. 2004. ISSN 1520-6149. URL http://dx.doi.org/10. 1109/ICASSP.2004.1326155.
  • [15] Gurbuz S., Tufekci Z., Patterson E., Gowdy J.N.: Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In: Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IE- EE International Conference on, vol. 1, pp. 177–180. IEEE, 2001.
  • [16] Harte N., Gillen E.: TCD-TIMIT: An audio-visual corpus of continuous speech. In: , 2015.
  • [17] Hazen T.: Visual model structures and synchrony constraints for audio-visual speech recognition. In: Audio, Speech, and Language Processing, IEEE Trans- actions on, vol. 14(3), pp. 1082–1089, 2006. ISSN 1558-7916. URL http: //dx.doi.org/10.1109/TSA.2005.857572.
  • [18] Hazen T.J., Saenko K., La C.H., Glass J.R.: A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In: Proceedings of the 6th international conference on Multimodal interfaces, pp. 235–242. ACM, 2004.
  • [19] Hermansky H.: Perceptual linear predictive (PLP) analysis of speech. In: the Journal of the Acoustical Society of America, vol. 87(4), pp. 1738–1752, 1990.
  • [20] Hernando J.: Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition. In: Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 2, pp. 1267–1270. IE- EE, 1997.
  • [21] Igras M., Ziółko B., Jadczyk T.: Audiovisual database of Polish speech recordings. In: Studia Informatica, vol. 33(2B), pp. 163–172, 2013.
  • [22] Kubanek M., Bobulski J., Adrjanowicz L.: Lip tracking method for the system of audio-visual polish speech recognition. In: Artificial Intelligence and Soft Computing, pp. 535–542. Springer, 2012.
  • [23] Lan Y., Theobald B.J., Harvey R., Ong E.J., Bowden R.: Improving visual features for lipreading. In: AVSP, pp. 7–3. 2010.
  • [24] Luettin J., Thacker N.A.: Speechreading using probabilistic models. In: Computer Vision and Image Understanding, vol. 65(2), pp. 163–178, 1997.
  • [25] Marcheret E., Libal V., Potamianos G.: Dynamic Stream Weight Modeling for Audio-Visual Speech Recognition. In: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 4, pp. IV–945–IV– 948. 2007. ISSN 1520-6149. URL http://dx.doi.org/10.1109/ICASSP.2007. 367227.
  • [26] Matthews I., Potamianos G., Neti C., Luettin J.: A comparison of model and transform-based visual features for audio-visual LVCSR. In: Multimedia and Expo, 2001. ICME 2001. IEEE International Conference on, pp. 825-828, 2001. http://dx.doi.org/10.1109/ICME.2001.1237849.
  • [27] McCowan I., Gatica-Perez D., Bengio S., Lathoud G., Barnard M., Zhang D.: Automatic analysis of multimodal group actions in meetings. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27(3), pp. 305–317, 2005.
  • [28] McGurk H., MacDonald J.: Hearing lips and seeing voices. Nature, vol. 264, pp. 746-748, 1976. http://dx.doi.org/10.1038/264746a0.
  • [29] Messer K., Kittler J., Sadeghi M., Marcel S., Marcel C., Bengio S., Cardinaux F., Sanderson C., Czyz J., Vandendorpe L., et al.: Face verification competition on the XM2VTS database. In: Audio-and Video-Based Biometric Person Authentication, pp. 964–974. Springer, 2003.
  • [30] Minka T., Winn J., Guiver J., Webster S., Zaykov Y., Yangel B., Spengler A., Bronskill J.: Infer.NET 2.6, 2014. Microsoft Research Cambridge. http://research.microsoft.com/infernet.
  • [31] Missaoui O., Frigui H.: Optimal feature weighting for the Continuous HMM. In: Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, pp. 1–4. IEEE, 2008.
  • [32] Mroueh Y., Marcheret E., Goel V.: Deep Multimodal Learning for Audio-Visual Speech Recognition. In: arXiv preprint arXiv:1501.05396, 2015.
  • [33] Murphy K.P.: Dynamic bayesian networks: representation, inference and learning. Ph.D. thesis, University of California, Berkeley, 2002.
  • [34] Nakamura S., Ito H., Shikano K.: Stream weight optimization of speech and lip image sequence for audio-visual speech recognition. In: Sixth International Conference on Spoken Language Processing, ICSLP 2000, pp. 20-24, 2000.
  • [35] Neti C., Potamianos G., Luettin J., Matthews I., Glotin H., Vergyri D., Sison J., Mashari A.: Audio visual speech recognition. Tech. rep., IDIAP, 2000.
  • [36] Newman J.L., Cox S.J.: Automatic visual-only language identification: A preliminary study. In: Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pp. 4345–4348. IEEE, 2009.
  • [37] Ngiam J., Khosla A., Kim M., Nam J., Lee H., Ng A.Y.: Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. 2011.
  • [38] Noda K., Yamaguchi Y., Nakadai K., Okuno H.G., Ogata T.: Audio-visual speech recognition using deep learning. In: Applied Intelligence, vol. 42(4), pp. 722–737, 2015.
  • [39] O'Donovan A., Duraiswami R., Neumann J.: Microphone arrays as generalized cameras for integrated audio visual processing. In: Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on, pp.1-8, 2007.
  • [40] Palecek K., Chaloupka J.: Audio-visual speech recognition in noisy audio environments. In: Telecommunications and Signal Processing (TSP), 2013 36th International Conference on, pp. 484–487. IEEE, 2013.
  • [41] Papandreou G., Maragos P.: Adaptive and constrained algorithms for inverse compositional active appearance model fitting. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. IEEE, 2008.
  • [42] Petajan E.D.: Automatic lipreading to enhance speech recognition (speech reading). Ph.D. thesis, University of Illinois at Urbana-Champaign, 1984.
  • [43] Potamianos G., Graf H.P.: Discriminative training of HMM stream exponents for audio-visual speech recognition. In: Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 6, pp. 3733–3736. IEEE, 1998.
  • [44] Potamianos G., Neti C.: Improved ROI and within frame discriminant features for lipreading. In: Image Processing, 2001. Proceedings. 2001 International Conference on, vol. 3, pp. 250–253. IEEE, 2001.
  • [45] Potamianos G., Neti C., Gravier G., Garg A., Senior A.: Recent advances in the automatic recognition of audiovisual speech. In: Proceedings of the IEEE, vol. 91(9), pp. 1306–1326, 2003. ISSN 0018-9219. URL http://dx.doi.org/10. 1109/JPROC.2003.817150.
  • [46] Rabiner L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE, vol. 77(2), pp. 257–286, 1989.
  • [47] Rosenblum L.D., Saldana H.M.: Time-varying information for visual speech perception. In: Hearing by eye II, pp. 61–81, 1998.
  • [48] Saenko K., Livescu K.: An asynchronous DBN for audio-visual speech recognition. In: Spoken Language Technology Workshop, 2006. IEEE, pp. 154–157. 2006. URL http://dx.doi.org/10.1109/SLT.2006.326841.
  • [49] Schwartz J.L., Robert-Ribes J., Escudier P.: Ten years after Summerfield: a taxonomy of models for audio-visual fusion in speech perception. In: Hearing by eye II: Advances in the psychology of speechreading and auditory-visual speech, pp. 85–108, 1998.
  • [50] Shivappa S.T., Trivedi M.M., Rao B.D.: Audiovisual information fusion in human–computer interfaces and intelligent environments: A survey. In: Proceedings of the IEEE, vol. 98(10), pp. 1692–1715, 2010.
  • [51] Summerfield A.Q.: Some preliminaries to a comprehensive account of audio-visual speech perception. In: B. Dodd, R. Campbell, eds., Hearing by eye: The psychology of lip-reading. Lawrence Erlbaum Associates, Inc, 1987.
  • [52] Teissier P., Robert-Ribes J., Schwartz J.L., Gu ́erin-Dugu ́e A.: Comparing models for audiovisual fusion in a noisy-vowel recognition task. In: Speech and Audio Processing, IEEE Transactions on, vol. 7(6), pp. 629–642, 1999.
  • [53] Tremain T.E.: The government standard linear predictive coding algorithm: LPC- 10. In: Speech Technology, vol. 1(2), pp. 40–49, 1982.
  • [54] Viola P., Jones M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1, pp. 1-9, 2001.
  • [55] Young S., Evermann G., Gales M., Hain T., Kershaw D., Liu X.A., Moore G., Odell J., Ollason D., Povey D., et al.: The HTK book (for HTK version 3.4), Cambridge University Engineering Department, 2006.
  • [56] Ziółko M., Gałka J., Ziółko B., Jadczyk T., Skurzok D., Masior M.: Automatic speech recognition system dedicated for Polish. In: Proceedings of the Interspeech 2011 Conference, Florence, Italy, pp. 3315-3316, 2011.
  • [57] Ziółko M., Ziółko B., Skurzok D.: Ortfon2 - tool for orthographic to phonetic transcription. In: Human language technologies as a challenge for computer science and linguistics: Proceedings of 7th language & technology conference, November 27-29, 2015, Poznań, Poland, pp.115-119, 2015.
Uwagi
PL
Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2018).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-fb05464d-bae6-41fd-a8b3-ac89c9821521
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.