An Effective Speaker Clustering Method using UBM and Ultra-Short Training Utterances

Hossa, R.; Makowski, R.

doi:10.1515/aoa-2016-0011

Artykuł - szczegóły

Tytuł artykułu

An Effective Speaker Clustering Method using UBM and Ultra-Short Training Utterances

Autorzy

Hossa R. , Makowski R.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.1515/aoa-2016-0011

Warianty tytułu

Języki publikacji

Abstrakty

The same speech sounds (phones) produced by different speakers can sometimes exhibit significant differences. Therefore, it is essential to use algorithms compensating these differences in ASR systems. Speaker clustering is an attractive solution to the compensation problem, as it does not require long utterances or high computational effort at the recognition stage. The report proposes a clustering method based solely on adaptation of UBM model weights. This solution has turned out to be effective even when using a very short utterance. The obtained improvement of frame recognition quality measured by means of frame error rate is over 5%. It is noteworthy that this improvement concerns all vowels, even though the clustering discussed in this report was based only on the phoneme a. This indicates a strong correlation between the articulation of different vowels, which is probably related to the size of the vocal tract.

Słowa kluczowe

automatic speech recognition interindividual difference compensation speaker clustering universal background model GMM weighting factor adaptation

Wydawca

Instytut Podstawowych Problemów Techniki PAN
Komitet Akustyki PAN
Polskie Towarzystwo Akustyczne

Czasopismo

Archives of Acoustics

Rocznik

2016

Tom

Vol. 41, No. 1

Strony

107--118

Opis fizyczny

Bibliogr. 26 poz., rys., tab., wykr.

Twórcy

autor

Hossa R.

robert.hossa@pwr.edu.pl

Signal Processing Systems Department, Faculty of Electronics, Wroclaw University of Technology, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland

autor

Makowski R.

ryszard.makowski@pwr.edu.pl

Signal Processing Systems Department, Faculty of Electronics, Wroclaw University of Technology, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland

Bibliografia

1. Anderson T.W. (2003), An Introduction to Multivariate Statistical Analysis, 3rd ed., John Wiley & Sons Inc, New York.
2. Basseville M. (1989), Distance measures for signal processing and pattern recognition, Signal Processing, 18, 349–369.
3. Bishop C.M. (2006), Pattern Recognition and Machine Learning, Springer, New York.
4. Chu S.M., Tang H., Huand T.S. (2009a), Locality preserving speaker clustering, Proceedings of IEEE International Conference on Multimedia and Expo, pp. 494–497, Mexico.
5. Chu S.M., Tang H., Huang T.S. (2009b), Fisher-voice and semi-supervised speaker clustering, International Conference on Acoustics, Speech and Signal Processing, pp. 4089–4092, Taipei.
6. De La Torre A., Peinada A.M., Segura J.C., Perez-Cordoba J.L., Benitez M.C., Rubio A.J. (2005), Histogram equalization of speech representation for robust speech recognition, IEEE Transaction on Speech and Audio Processing, 13, 355–366.
7. Duda R., Hart P., Stork D. (2000), Pattern Classification, 2-nd ed., John Wiley & Sons Inc., New York.
8. Hazen T.J. (2000), A comparison of novel techniques for rapid speaker adaptation, Speech Communication, 31, 15–33.
9. He X., Niyogi P. (2003), Locality Preserving Projections, Advances in Neural Information Processing Systems, 16, Vancuver.
10. Iyer A.N., Ofoegbu U.O., Yantorno R.E., Smoliński B.Y. (2006), Blind Speaker Clustering, International Symposium on Intelligent Signal Processing and Communications Systems, pp. 343–346, Yonago.
11. Jassem W. (1973), Fundamentals of Acoustic Phonetics [in Polish: Podstawy fonetyki akustycznej ], PWN, Warszawa.
12. Kosaka T., Sagayama S. (1994), Tree-structured speaker clustering for fast speaker adaptation, Procedings of International Conference on Acoustics, Speech and Signal Processing, pp. 245—248, Ostendorf.
13. Kuhn R., Junqua J.-C., Nguyen P., Niedzielski N. (2000), Rapid speaker adaptation in eigenvoice space, IEEE Transaction on Speech and Audio Processing, 8, 695–707.
14. Liu D., Kubala F. (2004), Online Speaker Clustering, Procedings of International Conference on Acoustics, Speech and Signal Processing, pp. 333–336, Quebec.
15. Lu Z., Hui Y.V., Lee A.H. (2003), Minimum Hellinger distance estimation for finite Poisson regression models and its applications, Biometrics, 59, 1016–1026.
16. Mehrabani M., Hansen J.H.L. (2013), Singing speaker clustering based on subspace learning in the GMM mean supervector space, Speech Communication, 55, 653–666.
17. Makowski R. (2011), Automatic speech recognition – selected problems [in Polish: Automatyczne rozpoznawanie mowy – wybrane zagadnienia], Oficyna Wydawnicza Politechniki Wrocławskiej, Wrocław.
18. Makowski R., Hossa R. (2014), Automatic speech signal segmentation based on innovations adaptive filter, International Journal on Applied Mathematics and Computer Science, 24, 259–270.
19. Mrówka P., Makowski R. (2008), Normalization of speaker individual characteristics and compensation of linear transmission distortions in command recognition systems, Archives of Acoustics, 33, 221–242.
20. Naito M., Deng L., Sagisaka Y. (2002), Speaker clustering for speech recognition using vocal track parameters, Speech Communication, 36, 305–315.
21. Reynolds D.A., Rose R.C. (1995), Robust text-independent speaker identification using gaussian mixture speaker models, IEEE Transaction on Speech and Audio Processing, 3, 72–83.
22. Reynolds D.A., Quatieri T.F., Dunn R.B. (2000), Speaker verification using adaptive gaussian mixture models, Digital Signal Processing, 10, 19–41.
23. Stafylakis T., Katsouros V., Carayannis G. (2006), The segmental Bayesian Information Criterion and its applications to Speaker diarization, IEEE Selected Topics in Signal Processing, 4, 857–866.
24. Tang H., Chu S.M., Hasegawa-Johnson M., Huang T.S. (2012), Partially Supervised Speaker Clustering, IEEE Transaction on Pattern Analysis and Machine Intelligence, 34, 959–971.
25. Tranter S., Reynolds D. (2006), An overwiew of Autmatic Speaker Diarization Systems, IEEE Transaction Audio, Speech and Language Processing, 14, 1557–1565.
26. Tsai W-H., Cheng S-S., Wang H-M. (2007), Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation, IEEE Transaction on Audio, Speech and Language Processing, 15, 1461–1474.

Uwagi

Opracowanie ze środków MNiSW w ramach umowy 812/P-DUN/2016 na działalność upowszechniającą naukę.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-8845ebdb-24c1-4202-9a70-f0630fac4473