PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe

Autorzy
Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Konferencja
Human Language Technologies as a challenge for Computer Science and Linguistics (2; 21-23.04.2005; Poznań, Poland)
Języki publikacji
EN
Abstrakty
EN
We present a robust full-featured architecture to preprocess text before parsing.This architecture, alled SxPipe, converts raw moisy corpora into word lattices, one by sentence, that can be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-deterministic multi-words processing, re-accentuation and un-/re-capitalization. Though our system currently deals with the French language, almost all components are in fact language-independent, and the others can be strainghtforwardly adapted to virtually any inflectional language. The output is a sequence of word lattices, all words being present in the lexicon. It has been applied on a large scale during a French parsing evaluation campaign and during experiments of large corpora parsing, showing both good efficiency and very satisfying precision and recall.
Rocznik
Strony
655--664
Opis fizyczny
Bibliogr. 5 poz., rys., tab.
Twórcy
autor
  • INRIA, Projet Atoll, Domaine de Voluceau, Rocquencort B.P. 105, 78153 Le Chesnay Cedex, France
autor
  • INRIA, Projet Atoll, Domaine de Voluceau, Rocquencort B.P. 105, 78153 Le Chesnay Cedex, France
Bibliografia
  • [1] F. Barthelemy, P. Boullier, P. Deschamp and E. De La Clergerie: Guided parsing of range concatenation languages. In Proc. ACL'01. Toulouse, France, (2001), 42-49.
  • [2] L. Clement and E. De La Clergerie: Terminology and other language resources - Morpho-Syntactic Annotation Framework (MAF). ISO TC37SC4 WG2 Working Draft. 2004.
  • [3] G. Grefenstette and P. Tapanainen: What is a word, what is a sentence? Problems of tokenization. In Proc. 3rd Conf. on Computational Lexicography and Text Research. Budapest, Hungary, (1994).
  • [4] K. Kukich: Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4), (1992), 377-439.
  • [5] D. Maynard, V. Tablan, C. Ursu, H. Cunningham and Y. Wilks: Named entity recognition from diverse text types. In Proc. RANLP'2001, Tzigov Chark, Bulgaria, (2001).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-article-BSW3-0021-0035
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.