Automatic protein abbreviations discovery and resolution from full-text scientific papers: the PRAISED framework

Toti, D; Atzeni, P; Polticelli, F

Artykuł - szczegóły

Tytuł artykułu

Automatic protein abbreviations discovery and resolution from full-text scientific papers: the PRAISED framework

Autorzy

Toti D , Atzeni P , Polticelli F

Wybrane pełne teksty z tego czasopisma

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

This paper describes a methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A local copy of the UniProt database is used as a source repository for all the known proteins. The PRAISED implementation has been tested against several known annotated corpora, such as the Medstract Gold Standard Corpus, the AB3P Corpus, the BioText Corpus and the Ao and Takagi Corpus, obtaining significantly high levels of recall and extremely fast performance, while also keeping promising levels of precision and overall f-measure, in comparison to the most relevant similar methods. This comparison has been carried out up to Phase 2, since those methods stop at expanding abbreviations, without performing any entity recognition. Instead, the entity recognition performed in the last phase provides PRAISED with an effective strategy for protein discovery, thus moving further from existing context-free techniques. Furthermore, this implementation also addresses the complexity of full-text papers, instead of the simpler abstracts more generally used. As such, the whole PRAISED process (Phase 1, 2 and 3) has been also tested against a manually annotated subset of full-text papers retrieved from the PubMed repository, with significant results as well.

Słowa kluczowe

proteins abbreviations data mining extraction resolution

Wydawca

Uniwersytet Jagielloński - Collegium Medicum
Index Copernicus Sp. z o.o.

Czasopismo

Bio-Algorithms and Med-Systems

Rocznik

2012

Tom

Vol. 8, no. 1

Strony

13--51

Opis fizyczny

Bibliogr. 24 poz., rys., tab., wykr.

Twórcy

autor

Toti D

Department of Computer Science and Automation, University of Roma Tre, 00146 Rome, Italy

autor

Atzeni P

Department of Computer Science and Automation, University of Roma Tre, 00146 Rome, Italy

autor

Polticelli F

polticel@uniroma3.it

Department of Biology, University of Roma Tre, 00146 Rome, Italy
National Institute of Nuclear Physics, Roma Tre Section, 00146 Rome, Italy
Department of Computer Science and Automation, University of Roma Tre, 00146 Rome, Italy

Bibliografia

[1] Pustejovsky J, Castao J, Cochran B, Kotecki M, Morrell M, Rumshisky A: Automatic Extraction of Acronym-meaning Pairs from MEDLINE Databases, Medinfo 10, 371-375, 2001.
[2] Taghva K, Gilbreth J: Recognizing acronyms and their definitions, International Journal on Document Analysis and Recognition 1, 191-198, 1999.
[3] Yeates S: Automatic extraction of acronyms from text, In Third New Zealand Computer Science Research Students' Conference 117-124, 1999.
[4] Larkey L, Ogilvie P, Price A, Tamilio B: Acrophile: An Automated Acronym Extractor and Server, In Proceedings of the ACM Digital Libraries conference 205-214, 2000.
[5] Park Y, Byrd RJ: Hybrid Text Mining for Finding Abbreviations and Their Definitions, In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, 2001.
[6] Schwartz A, Hearst M: A simple algorithm for identifying abbreviation definitions in biomedical texts, In Proceedings of the Pacific Symposium on Biocomputing (PSB), 2003.
[7] Yu H, Hripcsak G, Friedman C: Mapping abbreviations to full forms in biomedical articles, J. Am. Med. Inform. Assoc. 9, 262-272, 2002.
[8] Nadeau D, Turney PD: A Supervised Learning Approach to Acronym Identification. In 18th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, 2005.
[9] Chang JT, Schutze H, Altman RB: Creating an Online Dictionary of Abbreviations from MEDLINE, J. Am. Med. Inform. Assoc. 9, 612-620, 2002.
[10] Yoshida M, Fukuda K, Takagi T: PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary, Bioinformatics 16, 169-175, 2000.
[11] Fukuda K, Tsunoda T, Tamura A, Takagi T: Toward Information Extraction: Identifying protein names from biological papers. In Proceedings of the Pacific Symposium on Biocomputing 705-716, 1998.
[12] Kuo C, Ling MHT, Lin KT, Hsu CN: BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature, BMC Bioinformatics 10, S7, 2009.
[13] Ao H, Takagi T: Alice: an algorithm to extract abbreviations from medline, J. Am. Med. Inform. Assoc. 12, 576-586, 2005.
[14] Gawlik M, Strable C: Comparison of abbreviation recognition algorithms, http://acm.mscs.mu.edu/wiki-reu/index.php/User:Mgawlik
[15] Hersh W, Voorhees E: TREC genomics special issue overview, Information Retrieval 12, 1-15, 2009.
[16] Atzeni P, Polticelli F, Toti D: An Automatic Identification and Resolution System for Protein-related Abbreviations in Scientific Papers, In EvoBio, 2011.
[17] http://www.medstract.org/index.php?f=gold-standard
[18] http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/
[19] http://biotext.berkeley.edu/data.html
[20] http://3.uvdb.dbcls.jp/ALICE/corpus download.html
[21] http://www.uniprot.org/
[22] http://lucene.apache.org
[23] http://alias-i.com/lingpipe/
[24] http://biotext.berkeley.edu/code/abbrev/ExtractAbbrev.java

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-a940b0c8-c252-4afc-9dcf-fb145004ca47