Multilevel correction of OCR of medical texts

Piasecki, M.

Artykuł - szczegóły

Tytuł artykułu

Multilevel correction of OCR of medical texts

Autorzy

Piasecki M.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

In the paper the idea of the multilevel correction of the results handwriting OCR of medical texts is investigated. The correction is performed according to different levels of linguistic knowledge. Three types of models, namely: the n-gram Language Models of word form and base form sequences, the morpho-syntactic model based on a tagger and the model of correction by parsing are presented and their results are compared. The parsing model is based on the combination of a deterministic Czech parser adapted for Polish and the Structured Language Model based on lexicalised, binary parsing trees produced in the left-to-right manner. Contrary to the initial expectations, the best result of correction from 82% of the word level classifier to 92.98% of the overall accuracy was achieved with the help of a n-gram Language Models. The more rich description of language expressions in a model, the worse results were obtained. This result is in large extent caused by the specific characteristics of the processed medical documents.

Słowa kluczowe

handwriting OCR medical documents language model tagger parser Polish

pisma OCR dokumenty medyczne modele językowe Polski

Wydawca

University of Silesia, Institute of Informatics, Computer Systems Department

Czasopismo

Journal of Medical Informatics & Technologies

Rocznik

2007

Tom

Vol. 11

Strony

263--273

Opis fizyczny

Bibliogr. 18 poz., rys., tab.

Twórcy

autor

Piasecki M.

maciej.piasecki@pwr.wroc.pl

Institute of Applied Informatics, Wrocław University of Technology, Wybrzeże Wyspiańskiego 27, Wrocław, Poland

Bibliografia

[1] BUNKE, H. Recognition of Cursive Roman Handwriting - Past, Present and Future Proc. of the 7th International Conference on Document Analysis and Recognition (ICDAR'03), IEEE, pp. 448-460, 2003.
[2] CHELBA, C., JELINEK, F. In Boitet, C. & Whitelock, P. (ed.) Exploiting Syntactic Structure for Language Modeling. Proc. of the 36th Annual Meeting of the ACL and 17th International Conference on Computational Linguistics, pp. 225–231, Morgan Kaufmann Publishers, 1998,
[3] COLLINS, M. J. A new statistical parser based on bigram lexical dependencies. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 184–191, 1996
[4] GODLEWSKI G. & PIASECKI M., SAS J. In Brailsford, D. F. (ed.) Application of Syntactic Properties to Three-level Recognition of Polish Hand-written Medical Texts Proceedings of 2006 ACM Symposium on Document Engineering, pp. 115–121 ACM, 2006.
[5] HOLAN, T., ŽABOKRSTKÝ, Z. Combining Czech Dependency Parsers. In [15] pp. 95–102.
[6] MALBURG, M. Comparative Evaluation of Techniques for Word Recognition Improvement by Incorporation of Syntactic Information. Proceedings of ICDAR `97, Ulm, Germany., IEEE, 1997.
[7] KOERICH, A. L.; SABOURIN, R. & SUEN, C. Y. Large vocabulary off-line handwriting recognition: A survey Pattern Anal Applic, Vol. 6, pp. 97-121, 2003.
[8] MANNING, C. D., SCHÜTZE, H. Foundations of Statistical Natural Language Processing The MIT Press, 2001
[9] MARCUS, M. P.; SANTORINI, B., MARCINKIEWICZ, M. A. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, Vol. 19, pp. 313–330, 1994.
[10] PIASECKI, M., GODLEWSKI, G. Effective Architecture of the Polish Tagger. In [15], pp. 213–220.
[11] PIASECKI, M. & GODLEWSKI, G. In Maglaveras, N. et. al. (ed.) Language Modelling for the Needs of OCR of Medical Texts Biological and Medical Data Analysis. 7th International Symposium, ISBMDA 2006, Thessaloniki, Greece, December 7-8 2006, LNCS, Springer, 2006
[12] PIASECKI, M., GODLEWSKI, G., PEJCZ, J.: Corpus of medical texts and tools. In: Proc. of Medical Informatics and Technologies 2006, pp. 273–280 Silesian University of Technology, 2006.
[13] PRZEPIÓRKOWSKI A. The IPI PAN Corpus Preliminary Version Institute of Computer Science PAS, 2004.
[14] SAS J., LUZYNA M. Combining character classifier using member classifiers assessment, Proc. of 5th Int. Conf. on Intelligent Systems Design and Applications, ISDA 2005, pp. 400–405, IEEE Press, 2005.
[15] SOJKA, P.; KOPECEK, I., PALA, K. (ed.) Proceedings of the Text, Speech and Dialog 2006 Conference LNCS, Springer, 2006
[16] WIDDOWS, D. Geometry and Meaning. CSLI Publications, 2004.
[17] WOLIŃSKI, M. Morfeusz — a practical tool for the morphological analysis of Polish. In Kopotek, M. A.; Wierzchoń, S. T. & Trojanowski, K. (ed.) Proceedings of the International IIS: IIPWM'06 Conference held in Zakopane, Poland, June, 2006, pp. 511–520, Springer, 2006.
[18] ZIMMERMANN, M., BUNKE H. Parsing N-best Lists of Handwritten Sentences 7th Int. Conference on Document Analysis and Recognition, IEEE Computer Society Press, 2003.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-PWA4-0007-0028