In order to develop balanced corpora, the term “expectations” of the future potential user of corpora has been introduced (Králík, 2001). Based on several statistical studies of such expectations, the textual structure of SYN2000, which is the synchronic part of the Czech National Corpus (CNC) has been proposed and realized. The present article discusses two new studies of expectations (Aktér 2001 and ČJ 2001) and suggests important implications for future work on CNC. Table 1 and Table 2 reveal the stability of expectations in the categories of fiction [krásná literatura] and newspapers and magazines [noviny + časopisy]. Although the daily contact between respondents and administrative texts is stable (see Table 3), the distribution of these texts is closely bound to other non-fiction topics, which is why no special attention to administrative texts is proposed. The expectations concerning newspapers and magazines are stable (Table 5), but changed radically during 1996–2001 (first and last searches, Table 6). Within the same period, an obvious rise in interest in fiction has been noted (Table 6). The reasons for this can be attributed to natural societal development. Thus, a strong reduction in newspaper texts and strong increase in the use of fictional texts is proposed (Table 7 + Table 8).
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The joint project of Hungarian Academy of Sciences in Budapest and the Czech Academy of Sciences in Prague Computational Lexicology and Dialogue Research has inspired not only specific approaches to new linguistic research, but has also directed attention toward the history of Hungarian and Czech linguistic description. Some previously hidden parallels in lexicography, grammar and corpus projects have been discovered and discussed. In this paper, an overview of main similarities in the phases of cultivation of these two languages reveals, among others, the important unifying role of the European style of education and scholarly work. In addition, a brief historical outline shows Czech and Hungarian as the subject of linguistic research with similar positions and as solving their specific problems in historical parallels. This information enables the depiction of new projects in corpus linguistics in a broader historical context.
CS
Společný projekt Maďarské akademie věd v Budapešti a Akademie věd České republiky v Praze Počítačová lexikologie a výzkum dialogu inspiroval nejen ke specifickým přístupům v nových lingvistických výzkumech, ale obrátil také pozornost k historii lingvistického popisu maďarštiny a češtiny. Předmětem zájmu se tak staly některé dříve skryté paralely ve vývoji lexikografie, gramatiky a korpusových projektů. Tento článek podává přehled hlavních podobností v klíčových etapách kultivování dvou jazyků a – mimo jiné – potvrzuje důležitost jednotící úlohy evropského typu vzdělání a vědy. Stručný historický přehled ukazuje češtinu a maďarštinu navíc jako objekty jazykovědného bádání s podobným postavením a v historických paralelách s obdobným přístupem k vlastním specifickým problémům. Povědomí o těchto skutečnostech umožňuje zasadit do širšího historického kontextu i nové projekty v oblasti korpusové lingvistiky.
3
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The Czech Academic Corpus was created during the 1970s and 1980s at the Czech Language Institute under the supervision of Marie Těšitelová. The main motivation to build it (a total of 540 thousand word tokens) was to obtain the quantitative characteristics of contemporary Czech. The corpus is structurally annotated on two levels – the morphological level and the syntactical-analytical level. The original stochastic experiments in morphological tagging of Czech were performed using the corpus at the beginning of the 1990s. Given this, the corpus-based processing of Czech was launched. At the end of 1990s, work on the Prague Dependency Treebank had started (independently from the corpus) and its first edition was published in 2001. In considering future released versions of the treebank, we have decided to convert the corpus into the treebank-like format. This article focuses on the twenty-year history of the Czech Academic Corpus. Special attention is devoted to thus far unpublished facts about the corpus annotation. The conversion steps resulting in the first version of the Czech Academic Corpus are described in detail.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.