The article presents research on the automatic whispery speech recognition. The main task was to find dependences between a number of triphone classes (number of leaves in decision tree) and the total number of Gaussian distributions and therefore, to determine optimal values, for which the quality of speech recognition is best. Moreover, it was found, how these dependences differ between normal and whispery speech, what was not done earlier, and this is the innovative part of this work. Based on the performed experiments and obtained results one can say that the number of triphone classes (number of leaves) for whispered speech should be significantly lower than for normal speech.
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
In this paper, the automatic speech recognition task has been presented. Used toolkits, libraries and prepared speech corpus have been described. The obtained results suggest, that using different acoustic models for normal speech and whispered speech can reduce word error rate. The optimal training steps has been also selected. Thanks to the additional simulations it has been found that used corpus (over 9 hours of normal speech and the same of the whispery speech) is definitely too small and must be enlarged in the future.
PL
W artykule przedstawiono automatyczne rozpoznawanie mowy. Wykorzystane narzędzia, biblioteki i korpus opisano w artykule. Uzyskane wyniki wskazują, że wykorzystując różne modele akustyczne dla mowy zwykłej i szeptanej uzyskuje się polepszenie skuteczności rozpoznawania mowy. W wyniku wykonanych badań wskazano również optymalną kolejność kroków treningu. Dzięki dodatkowym obliczeniom stwierdzono, że użyty korpus (ponad 9 godzin zwykłej mowy i drugie tyle szeptu) jest zdecydowanie za mały do dobrego wytrenowania systemu rozpoznawania mowy i w przyszłości musi zostać powiększony.
3
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Formant frequencies are important cues for characterizing whispered speech. However, it is difficult to exactly estimate its formant by the conventional linear prediction coding algorithm. The main reason is that the formant bandwidth of a whisper is wider than that of voiced speech. This brings up the pole interaction problem that then leads to the result that one or more real roots are regarded as spurious and deleted from the original LP polynomial. To reduce the degradation of pole interactions, an improved root-finding formant estimation algorithm has been proposed. In this algorithm, the whisper formant bandwidth is modified to make the spectral energy of the remained formant polynomial equal to that of the original LP polynomial. Experimental results with six Chinese whispered monophthong phonemes show that the formant frequencies obtained by the proposed algorithm produce a more reliable formant spectrum than the one that does not consider the pole interaction effect.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.