PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Massive multi lingual corpus compilation: Acquis Communautaire and totale

Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Konferencja
Human Language Technologies as a challenge for Computer Science and Linguistics (2; 21-23.04.2005; Poznań, Poland)
Języki publikacji
EN
Abstrakty
EN
Large, uniformly encoded collections of texts, corpora, are an invaluable source of data, not only for linguists, but also for Language Technology tools. Especially useful are multilingual parallel corpora, as they enable, e.g. the induction of translation knowledge in the shape of multilingual lexica or full-fledged machine translation models. But parallel corpora, esp. large ones, are still scare, and have been, so far, difficult to acquire; recently, however, a large new source of paralel texts has become available on the Web, which contains EU law texts (the Acquis Communautaire) in all the languages of the current EU, and more, i.e. parallel texts in over twenty different languages. The paper discusses the compilation of this text collection into the massively multilingual JRC-Acquis corpus, which is freely available for research use.Next, the text annotation tool "totale", which performs multilingual text tokenization, tagging and lemmatisation is presented. The tool implements a simple pipelined architecture ahich is, for the most part, fully trainable, requiring a word-level syntactically annotated text corpus and, optionally, a morphological lexicon. We describe the MULTEXT-East corpus and lexicons, which have been used to train totale for for seven languages, and the application of the tool to the Slovene part of the JRC-Acquis corpus.
Rocznik
Strony
529--540
Opis fizyczny
Bibliogr. 14 poz., rys., tab.
Twórcy
autor
  • Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia
autor
  • European Commission-Joint Research Centre, I-21020 Ispra (VA), Italy
autor
  • European Commission-Joint Research Centre, I-21020 Ispra (VA), Italy
  • European Commission-Joint Research Centre, I-21020 Ispra (VA), Italy
Bibliografia
  • [1] S. Armstrong, M. Kempen. D. McKelvie. D. Petitpierre. R. Rapp and H. Thompson: Multilingual Corpora for Cooperation. Proc. 1st Int. Conf. on Language Resources and Evaluation, ELRA, Paris, (1998), 579-980.
  • [2] T. Brants: TnT-A Statistical Part-of-Speech Tagger. Proc. 6th Applied Natural Language Processing Conf., Seattle. WA, USA, (2000), 224-231.
  • [3] L. Dimitrova, T. Erjavec, N. Ide, H.-J. Kaalep, V. Petkevic and D. Tufis: MULTEXT-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages. Proc. COLING-ACL'98, Montreal, Quebec, Canada, (1998).
  • [4] W. Gale and K. W. Church: A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19(1), (1993), 75-102.
  • (5) P. Di Cristo: MtSeg: The Multext multilingual segmenter tools. MULTEXT Deliverable MSG 1, Version 1.3.1. CNRS. Aix-en-Provence. http://www.lpl.univ-aix.fr/projects/multexl/MtSeg/. (1996).
  • [6] P. Danielsson and D. Ridings: Practical Presentation of a "Vanilla" Aligner. TELRI Newsletter No. 5, Institute fuer Deutsche Sprache. Mannheim. http://nl.ijs.si/telri/Vanilla/(1997).
  • [7] T. Erjavec and S. Dzeroski: Machine Learning of Language Structure: Lemmatising Unknown Slovene Words. Applied Artificial intelligence, 18(1), (2004), 17-41.
  • [8] T. Erjavec: MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. 4th Int. Conf. on Language Resources and Evaluation, ELRA, Paris, France, (2004), 1535-1538.
  • [9] P. Koehn: Europarl: A Multilingual Corpus for Evaluation of Machine Translation. http://people.csail.mit.edu/people/koehn/publications/europarl/, (2002).
  • [10] T. McEnery, A. Wilson, P. Sanchez-Leon and A. Nieto-Serrano: Multilingual Resources in European Languages; Contributions of The Crater Project. Literary and Linguistic Computing. 12(4), ( 1997).
  • [11] B. Pouliquen, R. Steinberger and C. Ignat: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. Proc. of the Workshop ontologies and Information Extraction at (Eurolan'2003). Bucharest, Romania, (2003).
  • [12] B. Pouliquen, R. Steinberger and C. Ignat: Automatic Linking of Similar Texts across Languages. In: Recent Advances in Natural Language Processing III. John Benjamins Publishers, Amsterdam, 2004.
  • [13] S. Manandhar, S. Dzeroski and T. Erjavec: Learning Multilingual Morphology with CLOG. Proc of Inductive Logic Programming. 8th Int. Workshop ILP-98, (Lecture Notes in Artificial Intelligence 1446). Springer-Verlag, Berlin, (1998), 135-144.
  • [14] C. M. Sperberg-McQueen and L. Burnard (Eds.): Guidelines for Electronic Text Encoding and Interchange, The XML Version of the TEI Guidelines. The TEI Consortium, http://www.tei-c.org/, (2002).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-article-BSW3-0021-0023
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.