From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe

Sagot, B.; Boullier, P.

Artykuł - szczegóły

Tytuł artykułu

From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe

Autorzy

Sagot B. , Boullier P.

Wybrane pełne teksty z tego czasopisma

https://journals.pan.pl/acs/

Identyfikatory

Warianty tytułu

Konferencja

Human Language Technologies as a challenge for Computer Science and Linguistics (2; 21-23.04.2005; Poznań, Poland)

Języki publikacji

Abstrakty

We present a robust full-featured architecture to preprocess text before parsing.This architecture, alled SxPipe, converts raw moisy corpora into word lattices, one by sentence, that can be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-deterministic multi-words processing, re-accentuation and un-/re-capitalization. Though our system currently deals with the French language, almost all components are in fact language-independent, and the others can be strainghtforwardly adapted to virtually any inflectional language. The output is a sequence of word lattices, all words being present in the lexicon. It has been applied on a large scale during a French parsing evaluation campaign and during experiments of large corpora parsing, showing both good efficiency and very satisfying precision and recall.

Słowa kluczowe

raw text processing named entities recognition spelling correction ambiguous tokenization

Wydawca

Polish Academy of Sciences, Committee of Automatic Control and Robotics

Czasopismo

Archives of Control Sciences

Rocznik

2005

Tom

Vol. 15, no. 4

Strony

655--664

Opis fizyczny

Bibliogr. 5 poz., rys., tab.

Twórcy

autor

Sagot B.

benoit.sagot@inria.fr

INRIA, Projet Atoll, Domaine de Voluceau, Rocquencort B.P. 105, 78153 Le Chesnay Cedex, France

autor

Boullier P.

pierre.boullier@inria.fr

INRIA, Projet Atoll, Domaine de Voluceau, Rocquencort B.P. 105, 78153 Le Chesnay Cedex, France

Bibliografia

[1] F. Barthelemy, P. Boullier, P. Deschamp and E. De La Clergerie: Guided parsing of range concatenation languages. In Proc. ACL'01. Toulouse, France, (2001), 42-49.
[2] L. Clement and E. De La Clergerie: Terminology and other language resources - Morpho-Syntactic Annotation Framework (MAF). ISO TC37SC4 WG2 Working Draft. 2004.
[3] G. Grefenstette and P. Tapanainen: What is a word, what is a sentence? Problems of tokenization. In Proc. 3rd Conf. on Computational Lexicography and Text Research. Budapest, Hungary, (1994).
[4] K. Kukich: Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4), (1992), 377-439.
[5] D. Maynard, V. Tablan, C. Ursu, H. Cunningham and Y. Wilks: Named entity recognition from diverse text types. In Proc. RANLP'2001, Tzigov Chark, Bulgaria, (2001).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BSW3-0021-0035