Wizyjna synteza mowy dla języka polskiego z wykorzystaniem klatek kluczowych

Janicki, A.; Bloch, J.; Taylor, K.

Artykuł - szczegóły

Tytuł artykułu

Wizyjna synteza mowy dla języka polskiego z wykorzystaniem klatek kluczowych

Autorzy

Janicki A. , Bloch J. , Taylor K.

Identyfikatory

Warianty tytułu

Visual speech synthesis for polish using keyframe based animation

Języki publikacji

Abstrakty

Artykuł dotyczy projektowania systemu wizyjnej syntezy mowy dla języka polskiego. Wykorzystano pakiet Xface oraz animację opartą na interpolacji klatek kluczowych. Artykuł opisuje proces powstawania modelu głowy "Karol" i zestawu polskich wizemów. Zaproponowano i zweryfikowano w testach wykorzystanie wizemów potowkowych dla syntezy mowy szybkiej. Pomysł ten połączono z omijaniem niektórych klatek. Subiektywne testy pokazały, że wygenerowane przez zaproponowany system animacje są dość naturalne (3,9 na skali MOS) i dobrze zsynchronizowane z sygnałem audio.

This paper is about designing visual speech synthesis system for Polish. Xface toolkit with keyframe interpolation based animation was chosen as animation method. The paper describes designing the "Karol" face model and Polish visemes. The idea of using half-visemes was proposed for synthesizing fast visual speech, and it was verified during testing. Finally this idea was combined with omitting selected keyframes. Subjective tests showed that the visual speech generated by the proposed system was found quite natural (3.9 in MOS scale) and with good audio-video synchronization.

Słowa kluczowe

wizyjna synteza mowy gadająca głowa wizem klatki kluczowe testy subiektywne

visual speech synthesis talking heads viseme kayframes subjective tests

Wydawca

Wydawnictwo SIGMA-NOT

Czasopismo

Elektronika : konstrukcje, technologie, zastosowania

Rocznik

2010

Tom

Vol. 51, nr 12

Strony

26--29

Opis fizyczny

Bibliogr. 17 poz, wykr.

Twórcy

autor

Janicki A.

autor

Bloch J.

autor

Taylor K.

Politechnika Warszawska, Instytut Telekomunikacji

Bibliografia

[1] Theobald B.-J., Fagel S., Bailly G., Elisei F: LIPS2008: Visual Speech Synthesis Challenge. Proc. Interspeech 2008, Brisbane, Australia, ss. 2310-2313, wrzesień 2008.
[2] Beskow J., Karlsson I., Kewley J., Salvi G.: SYNFACE-A Talking Head Telephone for the Hearing-Impaired, in: K. Miesenberger (Eds.): "Computers Helping People with Special Needs", LNCS Vol. 3118, ss. 1178-1185, Springer-Verlag Berlin Heidelberg, 2004.
[3] Takács G., Tihanyi A., Bárdi T., Feldhoffer G., Srancsik B.: Signal Conversion from Natural Audio Speech to Synthetic Visible Speech. Proc. ICSES 2006, Łódź, pp. 261-264, wrzesień 2006.
[4] Edge J. D., Hilton A., Jackson P.: Model-Based Synthesis of Visual Speech Movements from 3D Video. EURASIP Journal on Audio, Speech, and Music Processing, vol. 2009, article ID 597267, 12 p., 2009. doi:10.1155/2009/597267.
[5] McGurk H., MacDonald J.: Hearing lips and seeing voices. Nature, Vol. 264 (5588), pp. 746-748, 1976.
[6] Beskow J., Nordenberg M.: Data-driven synthesis of expressive visual speech using an MPEG-4 talking head. Proc. Interspeech 2005, Lizbona, Portugalia, ss. 793-796, wrzesień 2005.
[7] Tao J.: Realistic Visual Speech Synthesis Based on Hybrid Concatenation Method. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 17, No. 3, marzec 2009.
[8] Ezzat T., Poggio T.: Visual Speech Synthesis by Morphing Visemes. International Journal of Computer Vision, Volume 38, Number 1, czerwiec 2000.
[9] Wang L., X. Qian, L. Ma, Y. Qian, Y. Chen, F. Soong: A Real-Time Text to Audio-Visual Speech Synthesis System. Proc. Interspeech 2008, Brisbane, Australia, ss. 2338-2341, wrzesień 2008.
[10] Leszczyński M., Skarbek W.: Viseme Classification for Talking Head Application, in: A. Gagalowicz and W. Philips (Eds.): CAIP 2005, LNCS Vol. 3691, ss. 773-780, Springer-Verlag Berlin Heidelberg, 2005.
[11] Balci K.: Xface: MPEG-4 based open source toolkit for 3D facial animation. Proc. Advance Visual Interface 2004, Gallipoli, Włochy, 2004.
[12] Singular Inversions Inc., "FaceGen-3D Human Faces", http://www.facegen.com
[13] Pelachaud C. et. al.: Modelling an Italian Talking Head. Audio-Visual Speech Processing, Scheelsminde, Dania, wrzesień 2001.
[14] Wójcik A.: Wizyjna synteza mowy. Praca inżynierska (niepublikowana), Warszawa, 2007.
[15] Styczek I.: Logopedia (in Polish), Warszawa, PWN, 1981.
[16] Cohen M. M., Massaro D. W.: Modeling coarticulation in synthetic visual speech, in: N. M. Thalmann & D. Thalmann (Eds.) "Models and Techniques in Computer Animation", Tokio, Springer-Verlag, 1993.
[17] Minnis S., Breen A.: Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. Proc. of the 6th International Conference on Spoken Language Processing. Pekin, Chiny, vol. II, pp. 759-762, 2000.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BWAW-0006-0005