Nowa wersja platformy, zawierająca wyłącznie zasoby pełnotekstowe, jest już dostępna.
Przejdź na https://bibliotekanauki.pl
Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 2

Liczba wyników na stronie
first rewind previous Strona / 1 next fast forward last
Wyniki wyszukiwania
Wyszukiwano:
w słowach kluczowych:  tokenisation
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 1 next fast forward last
1
Content available remote Massive multi lingual corpus compilation: Acquis Communautaire and totale
100%
EN
Large, uniformly encoded collections of texts, corpora, are an invaluable source of data, not only for linguists, but also for Language Technology tools. Especially useful are multilingual parallel corpora, as they enable, e.g. the induction of translation knowledge in the shape of multilingual lexica or full-fledged machine translation models. But parallel corpora, esp. large ones, are still scare, and have been, so far, difficult to acquire; recently, however, a large new source of paralel texts has become available on the Web, which contains EU law texts (the Acquis Communautaire) in all the languages of the current EU, and more, i.e. parallel texts in over twenty different languages. The paper discusses the compilation of this text collection into the massively multilingual JRC-Acquis corpus, which is freely available for research use.Next, the text annotation tool "totale", which performs multilingual text tokenization, tagging and lemmatisation is presented. The tool implements a simple pipelined architecture ahich is, for the most part, fully trainable, requiring a word-level syntactically annotated text corpus and, optionally, a morphological lexicon. We describe the MULTEXT-East corpus and lexicons, which have been used to train totale for for seven languages, and the application of the tool to the Slovene part of the JRC-Acquis corpus.
EN
The aim of this paper is to provide a corpus-based analysis of one type of Czech proper nouns (type Zubří). We will argue that the adequate annotation (lemmatisation and morphological tagging) of proper nouns type Zubří depends on several circumstances: 1) the coverage of the dictionary of the automatic analyser; 2) the accurate description of the variability of inflexion forms; 3) the non-trivial disambiguation of numerous homonymous word forms. We believe that while meeting the first two conditions is possible, the adequate disambiguation goes beyond the possibilities of automatic morphological analysis.
first rewind previous Strona / 1 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.