PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Construction of a medical corpus based on information extraction results

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
The paper presents a method of automatic construction of a semantically annotated corpus using the results of a rulebased information extraction (IE) application. Construction of the corpus is based on using existing programs for text tokenization and morphological analysis and combining their results with domain related correction rules. We reuse the specialized IE system to obtain a corpus annotated on the semantic level. The texts included within the corpus are Polish free text clinical data. We present the documents - diabetic patients' discharge records, the structure of the corpus annotation and the methods for obtaining the annotations. Initial evaluations based on the results of manual verification of selected data subset are also presented. The corpus, once manually corrected, is designed to be used for developing supervised machine learning models for IE applications.
Rocznik
Strony
337--360
Opis fizyczny
Bibliogr. 28 poz.
Twórcy
autor
  • Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Bibliografia
  • Aramaki, E., Miura, Y., Tonoike, M., Ohkuma, T., Mahuichi, H. and Ohe, K. (2009) TEXT2TABLE: Medical Text Summarization System based on Named Entity Recognition and Modality Identification. In: Proceedings of the Workshop on BioNLP, Boulder, Colorado. Association for Computational Linguistics, 185-192.
  • Drożdżyński,W., Krieger, H.U., Piskorski, J., Schäfer, U. and Xu, F. (2004) Shallow Processing with Unification and Typed Feature Structures- Foundations and Applications. German AI Journal KI-Zeitschrift, 01/04.
  • Erjavec, T., Tateisi, Y., Dong Kim, J., Ohta, T. and Tsujii, J. (2003) Encoding Biomedical Resources in TEI: the Case of the GENIA Corpus. In: Proceedings of the ACL 2003, Workshop on Natural Language Processing in Biomedicine. Association for Computational Linguistics, 97-104.
  • Franzén, K., Eriksson, G., Olsson, F., Asker, L., Lidén, P.and Cöster, J. (2002) Protein names and how to find them. International Journal of Medical Informatics, 67, 49-61.
  • Karwańska, D. and Przepiórkowski, A. (2009) On the Evaluation of Two Polish Taggers. In: The Proceedings of Practical Applications in Language and Computers PALC 2009. Peter Lang.
  • Kim, J.D., Ohta, T., Tateisi, Y. and Tsujii, J. (2003) GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics, 19(1), 180-182.
  • Kim, J.D., Ohtai, T. and Tsujii, J. (2008) Corpus annotation for mining biomedical events from literature. BMC Bioinformatics, 9 (10), doi 10.1186/1471-2105-9-10.
  • Kokkinakis, D. (2006) Collection, Encoding and Linguistic Processing of a Swedish Medical Corpus - The MEDLEX Experience. In: Proceedings of the Fifth International Language Resources and Evaluation (LREC’06). ELRA, European Language Resources Association, 1200-1205.
  • Mandel, M.A. (2006) Integrated Annotation of Biomedical Text: Creating the PennBioIE corpus. In: Proceedings of the Workshop on Text Mining, Ontologies and Natural Language Processing in Biomedicine: 20-21 March 2006. http://www-tsujii.is.s.u-tokyo.ac.jp/jw-tmnlpo/MarkMandel.pdf
  • Marciniak, M. and Mykowiecka, A. (2007) Automatic processing of diabetic patients’ hospital documentation. In: Proceedings of Balto-Slavonic Natural Language Processing ACL 2007 Workshop. Association for Computational Linguistics, 35-42.
  • Marciniak, M. and Mykowiecka, A. (2011) Towards Morphologically Annotated Corpus of Hospital Discharge Reports in Polish. Proc. of the 2011 Workshop on Biomedical Natural Language Processing, ACL-HLT 2011. Association for Computational Linguistics, 92-100.
  • Mykowiecka, A., Marciniak, M. and Kupść, A. (2009) Rule-based information extraction from patients’ clinical data. Journal of Biomedical Informatics, 42, 923-936.
  • Pakhomova, S.V., Codenb, A. and Chutea, Ch.G. (2006) Developing a corpus of clinical notes manually annotated for part-of-speech. International Journal of Medical Informatics, 75, 418-429.
  • Pestian, J.P., Brew, Ch., Matykiewicz, P., Hovermale, D.J., Johnson, N., Cohen, K.B. and Duch, W. (2007) A shared task involving multilabel classification of clinical free text. In: BioNLP ‘07: Proceedings of the Workshop on BioNLP 2007, Association for Computational Linguistics, 97-104.
  • Piasecki, M. (2007) Polish Tagger TaKIPI: Rule Based Construction and Optimisation. Task Quarterly, 11(1–2), 151-167.
  • Piasecki,M., Godlewski,G. and Pejcz, J. (2006) Corpus of medical Texas and tools. In: Proceedings of Medical Informatics and Technologies. Silesian University of Technology, 281-286.
  • Piasecki, M. and Radziszewski, A. (2007) Polish Morphological Guesser Based on a Statistical A Tergo Index. In: Proceedings of the International Multiconference on Computer Science and Information Technology - 2nd International Symposium Advances in Artificial Intelligence and Applications (AAIA’07). Polskie Towarzystwo Informatyczne, 247-256, http://www.proceedings2007.imcsit.org/pliks/150.pdf.
  • Przepiórkowski, A. (2005) The IPI PAN Corpus in Numbers. In: Zygmunt Vetulani, ed., Proceedings of the 2nd Language & Technology Conference, Poznań, Poland. Wydawnictwo Poznańskie, 27-31.
  • Przepiórkowski, A. and Bański, P. (2009) XML Text Interchange Format in the National Corpus of Polish. In: The Proceedings of Practical Applications in Language and Computers PALC 2009. Peter Lang, 245-250, http://nlp.ipipan.waw.pl/ adamp/Papers/2009-palc-xml/paper.pdf.
  • Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Jörvinen, J. and Salakoski, T. (2007) BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8.
  • Roberts, A., Gaizauskas, R., Hepple, M., Demetriou, G., Guo, Y., Roberts, I. and Setzer, A. (2009) Building a semantically annotated corpus of clinical texts. Journal of Biomedical Informatics, 42(5), 950-966.
  • Røst,T.B., Huseth,O., Nytrø,Ø. and Grimsmo,A. (2008) Lessons From Developing an Annotated Corpus of Patient Histories. Journal of Computing Science and Engineering, 2(2), 162-179.
  • Sutton,Ch. and McCallum,A. (2007) An Introduction to Conditional Random Fields for Relational Learning. In: L. Getoor, and B. Taskar, eds., Introduction to Statistical Relational Learning, chapter 4. MIT Press.
  • Tsalidis, Ch., Orphanos, G., Mantzari, E., Pantazara, M., Diolis, Ch. and Vagelatos, A. (2007) Developing a Greek biomedical corpus towards text mining. In: Proceedings of the Corpus Linguistics Conference (CL2007). University of Birmingham.
  • Vincze,V., Szarvas,G., Farkas,R., Móra,G. and Csirik, J. (2008) The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts. BMC Bioinformatics, 9 (Supl 11), 38-45.
  • Wermter, J. and Hahn,U. (2004) An Annotated German-LanguageMedical Text Corpus as Language Resorces. In: Proceedings of the Fourth International Language Resources and Evaluation (LREC’04). ELRA, European Language Resources Association, 473-476.
  • Woliński, M. (2003) System znaczników morfosyntaktycznych w korpusie IPI PAN. Polonica, XXII-XXIII, 39-55.
  • Woliński, M. (2006) Morfeusz - a Practical Tool for the Morphological Analysis of Polish. In: M. Kłopotek, S. Wierzchoń, and K. Trojanowski, eds., Intelligent Information Processing and Web Mining, IIS:IIPWM’06 Proceedings. Springer, 503-512.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-article-BATC-0008-0007
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.