Language modeling plays an important role in Large Vocabulary Continuous Speech Recognition (LVCSR) and has significant influence in choosing right hypothesis of word sequence. At the beginning of the paper we remind popular models used in this field as a basis for further considerations. The main scope of this paper is modeling of Polish language for LVCSR purpose. Because mostly in Slavonic languages the order of words is not so strict (and important) like for example in English, we should not put the accent on such model elements like trigram statistical model, where words’ order is taken into account. We chose application of Head–driven Phrase Structure Grammar (HPSG) and we propose automatic methods for obtaining constraints – general rules required by this grammar.
In the paper the idea of the multilevel correction of the results handwriting OCR of medical texts is investigated. The correction is performed according to different levels of linguistic knowledge. Three types of models, namely: the n-gram Language Models of word form and base form sequences, the morpho-syntactic model based on a tagger and the model of correction by parsing are presented and their results are compared. The parsing model is based on the combination of a deterministic Czech parser adapted for Polish and the Structured Language Model based on lexicalised, binary parsing trees produced in the left-to-right manner. Contrary to the initial expectations, the best result of correction from 82% of the word level classifier to 92.98% of the overall accuracy was achieved with the help of a n-gram Language Models. The more rich description of language expressions in a model, the worse results were obtained. This result is in large extent caused by the specific characteristics of the processed medical documents.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.