Nowa wersja platformy, zawierająca wyłącznie zasoby pełnotekstowe, jest już dostępna.
Przejdź na https://bibliotekanauki.pl
Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 14

Liczba wyników na stronie
first rewind previous Strona / 1 next fast forward last
Wyniki wyszukiwania
Wyszukiwano:
w słowach kluczowych:  quantitative linguistics
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 1 next fast forward last
EN
The article is an attempt to reconstruct the field of German-language discourse studies and to analyse them critically. Owing to very strong references, in the discourse analysis models examined in the article, to Michel Foucault’s concept, they are regarded as post-Foucault. The authors present the main threads in German-language discourse studies: (1) approaches the objective of which is to formulate a theoretical-methodological basis of a post-Foucault discourse analysis (these are primarily “discipline-specific” schools of discourse analysis: linguistic and sociological, as well as the programme of the so-called critical discourse analysis); (2) “dispositive” approaches, which constitute a novelty in the debate over the discourse category and regard the dispositive category as a possibility of finding a supradiscursive “system.” The authors also reflect on the critical remarks about the various threads in the studies, including those formulated by scholars themselves. The main conclusion from the authors’ reconstruction is that there is a tendency in German-language discourse studies to understand the category of discourse quite narrowly, with regard to specific disciplines, and thus that there is a lack of an integrated and interdisciplinary model of discourse analysis.
EN
The article raises the problem of a possibility of carrying out discourse analysis by means of corpus-based and quantitative methods. Discourse studies use various research methods, which are mostly qualitative. Corpus-based and statistics-based linguistics, on the other hand, offer many tools that can be used to study discourse, beginning with electronic concordance and ending with calculations making it possible to discover similarity in texts and genres, marked lexis, key words, etc. The author briefly describes these methods in the article. She also provides basic information about the structure of a specialist corpus and the selection of a representative sample of texts, which constitute a given discourse.
PL
This article discusses automatic extraction of relevant words from sets of texts. The author briefly presents three methods aimed to extract the words from the corpus of words with regard to their frequency, or words whose occurrence next to each other is not random. First, he focuses on the keyword analysis method, then he discusses the Zeta method developed by John Burrows and Hugh Craig, and the third method covered in the article is the topic modelling method, which is becoming very popular recently, and consists in finding clusters of words co-occurring in similar contexts. Topic modelling was intended for a quick content search in large collections of documents. On the basis of 100 Polish novels, the article presents how this method can be used for linguistic studies.
PL
This article presents a statistical and comparative analysis of four spelling conventions that represent different stages in the development of the Polish graphic system: the graphic system of a late-medieval manuscript (hand-written text), the standard spelling convention typical for the first half of the sixteenth century, the accepted and standard modern spelling of the first half of the twentieth century and the innovative set of graphic features used in electronic media. The characteristics of the statistical parameters encompasses dispersion and entropy in the first and the second row of letters, as well as in two-element sets (dyads). The analysis proves that: 1) inasmuch as the degree of differentiation of the distribution of signs, the history of Polish spelling convention prior to the solidification of the modern standard practice (accepted standard system) manifested a self-organizing tendency that was based on a reduction of letter signs and two-element letter combinations (ligatures) with the frequency of 1; 2) innovative solutions used in the set of graphic features characteristic for electronic media do not violate the statistical proportion between letters and their dyads operative and specific for modern standard graphic system 3) in respect to theory and information, the transformations of the graphic substance (graphic system) (within the analysed chronological timeframe) depended on neither progress (evolution) nor degradation.
PL
The article presents the results of a quantitative analysis of the vocabulary attested in the Polish translation of the work of Pietro de‘ Crescenzi (Piotr Krescentyn) Księgi o gospodarstwie (original title: Opus ruralium commodorum) against the statystical characteristics of the translations of the New Testament from the latter half of the sixteenth century and the beginning of the seventeenth century, and the text of Wizerunek własny żywota człowieka poczciwego (The Life of the Honest Man) by Mikołaj Rej. The following statystical parameters were studied: the number of words, the number of entries, K quantity factor, arithmetic mean of the frequency of entries, dispersion of entries, vocabulary originality parameter I, distribution of the autosemantic parts of speech (lexical category) and the distribution of individual autosemantic parts of speech (lexical categories). The analysis shows that, with regard to the statistical approach, the vocabulary of Księgi... is rather closer to Wizerunek than to the translations of the New Testament – in comparison to the latter, Księgi... is characterized by a far more ample vocabulary, greater number of entries and autosemantic words, greater number of attested nouns, adverbs in particular (the latter group is characterized by a higher percentage of entries with high frequency). Księgi..., as compared to the translations of the New Testament, also abounds with participles whose repertory is not only more numerous with regard to words but also to entries. The analysis confims the observation made by Władysław Kuraszkiewicz with reference to the dependence between the quantity of vocabulary and the content of texts – the closer it is to real life and refers to its diverse aspects, the richer vocabulary is used for its rendering.
EN
This article investigates if there is some distinctive way in which characters of Bolesław Prus novel “The Doll” are speaking. It is well known that author endowed his characters with diff erent social backgrounds, views, ethics. Literary critics often emphasized diff erences in vocabulary and numerous language stylizations among social classes, but in most of the cases only the most characteristic words were taken into con16 Jarosław Foltman sideration and function words (prepositions, conjunctions, personal pronouns and other) were omitted in the research. The aim of this study, in contrast to the previous, is to examine individual characters’ way of speaking by measuring frequencies of using most frequent words and given parts of speech. Tests have been performed using the Delta method with a couple of data visualization techniques. The results show some signifi cant variations in individual idiolects.
7
Content available Kvantitativní určení lexikálního jádra jazyka
72%
EN
The exploitation of hapax legomena, i.e. word or lemma types which occur in a corpus only once, is usually overlooked in language description. These types cannot be systematically used for a vast majority of analyses as they do not provide a basis for any type of generalization. On the other hand, the overall number of hapaxes can be used as an indicator of the lexical periphery of the language system. This paper suggests that the ratio between the number of hapaxes and the number of all types in relation to the growing corpus size (hapax-type ratio, HTR) can be used for delimitation of the lexical core of a language. It has been shown by previous research (Fengxiang 2010) that HTR in English has the shape of a pipe or chibouque, which means that the rates of the emergence of new hapaxes and new types in the process of building a corpus differ before and after reaching a certain size. In a hypothetical small corpus (a few sentences) the hapax-type ratio will be equal to one (each wordtype is also a hapax). As texts are added to the corpus (up to a few million words), the hapax-type ratio decreases (the number of new words including hapaxes is continuously increasing but the majority of added tokens are new instances of words already present in the corpus) from its maximal value (=1) to a local minimum. After reaching this turning point, extending the corpus increases the ratio because the number of hapaxes grows at a faster pace than the number of non-hapaxes (i.e. types with a frequency higher than one). This empirical finding tested on corpora of Czech and English brings us closer to the exact determination of the range of the core lexicon. Subsequently, we can deduce the approximate size of a corpus sufficient for compiling a dictionary that covers the core lexicon.
EN
The article aims to recreate the tourist view based on the observation of the lexical layer of texts in quantitative terms. The subject of the research is the most frequent lexis excerpted in the form of an frequency list from Listy z podróży po Włoszech by Konstanty Gaszyński. The collected material consists of 300 words. The most common vocabulary was grouped into 9 semantic-lexical circles and analyzed using the research tools of cultural linguistics. The research has indicated that it is possible to reconstruct, on the basis of the analysis of the most common lexis, both the points of view in the texts and the correlated profiles of the Italian space.
9
Content available Epistemik und Faktizität in Pressediskursen
72%
EN
The article describes a corpus research with its theoretical background. The basis of our research is a press corpus with 27 million tokens, where the epistemic terms are annotated. The grammatical and lexical epistemic expression are consistently kept apart on the basis of the relevant literature. The next step is to investigate with a fully automatic method which expressions often occur together in a text. Semantic subclasses of epistemic expressions are defined on the basis of these quantitatively proven solidarities. Finally, concrete examples illustrate how the epistemic expressions structure the text.
DE
Im Beitrag wird eine Korpusuntersuchung mit ihrem theoretischen Hintergrund beschrieben. In einem 27 Millionen Tokens starken Pressekorpus werden die epistemischen Ausdrücke annotiert. Die grammatikalisierten und lexikalischen epistemischen Ausdrucksmittel werden aufgrund der einschlägigen Literatur konsequent auseinandergehalten. Im nächsten Schritt wird mit einer vollautomatischen Methode untersucht, welche Ausdrücke häufig gemeinsam in einem Text auftreten. Aufgrund dieser quantitativ nachgewiesenen Solidaritäten werden semantische Subklassen epistemischer Ausdrücke definiert. Anschließend wird anhand konkreter Textbeispiele gezeigt, wie die epistemischen Ausdrücke den Text strukturieren.
PL
Niedoceniana, niekiedy wręcz kwestionowana w Polsce teoria Witolda Mańczaka dotycząca nieregularnego rozwoju fonetycznego spowodowanego frekwencją od samego początku spotkała się z szerokim odzewem za granicą, wpływając w znaczącym stopniu na rozwój metod kwantytatywno-statystycznych w językoznawstwie. Uznawana w skali światowej za jedno z podstawowych narzędzi badawczych tego nurtu, wpisująca się w ramy powstałych w okresie późniejszym metodologii, jak np. dyfuzja leksykalna, wykorzystywana jest powszechnie poza granicami kraju w opisie i analizach wielu zjawisk językowych, zarówno diachronicznych, jak i synchronicznych, gdzie czynnik frekwencyjny odgrywa niebagatelną rolę.
EN
Underestimated and often questioned in Poland, Witold Mańczak’s theory of irregular sound changes due to frequency has been, from the beginning, well accepted and considered abroad. It has noticeably contributed to the development of quantitative linguistics methods. This statistic tool, which is widely acknowledged, is frequently used by linguists all over the world to study and explain many different linguistic phenomena, synchronic as well as diachronic, where the frequency parameter is regularly involved.
11
Content available remote Kvantitativněonomastická analýza Haškova Švejka a jeho tří parafrází
58%
EN
This paper is a quantitative onomastic analysis of four works featuring the literary figure of Švejk. These are the original novel by Jaroslav Hašek, two retellings of the story from the period of WWII (1941 and 1945), and a 1970s–80s version by Martin Petiška, who moved the story to Czechoslovakia after the 1948 Communist coup. The analysis relies on the thematic concentration of text (with an emphasis on the role of proper nouns as thematic words) and collocation analysis of the proper noun Švejk. Its aim is twofold: first, to determine whether there are quantitatively detectable differences between the original work and its paraphrases concerning the use of proper names; and second, to determine whether there is a difference between the paraphrases and the original as regards the literary figure of Švejk. The analysis shows that the paraphrases reduce the text-type richness of the original, while maintaining Švejk as the protagonist. The paraphrases employ a popular interpretation of the main character (Švejk as a “smart idiot”), leaving aside other aspects of Hašek’s multi-layered novel. Partial shifts in the portrayal of the main character are related to his intellectualization or increased emotionality.
12
Content available Korpus OnomOs: principy a příklady aplikací
58%
EN
The study introduces OnomOs, a new corpus of Czech texts with annotation of proper names. The corpus was compiled by onomasticians from the Department of Czech Language, Faculty of Arts, University of Ostrava, and made available by the Institute of the Czech National Corpus, Faculty of Arts, Charles University in Prague. The paper briefly discusses the content and structure of the corpus, the selection of texts for inclusion, and the onomastic-geographical classification of the identified names. The text consists chiefly of three preparatory analyses, which focus on the most frequent surnames, collocations found in Western and Eastern countries in the pre-1989 period, and the declension patterns of three types of onyms. In the summary, further possibilities of onomastic corpus research are presented.
EN
The paper focuses on the frequency and collocation analyses of Česko (“Czechia”), the short, geographical name of our country, in the opinion journalism section of the eight-version SYN corpus, which comprises texts from the period of 1990−2018. Within the scope of the research, the period was divided into several sections, which are delineated by the breakthrough political and cultural events (the Czech Republic entering NATO, the Czech Republic entering the EU, climax of the first season of the Pop Idol-based contest Czechia Is Looking for a SuperStar, etc.). The frequency analysis is based on the relativization via i.p.m.; the collocability force is counted on the grounds of the logDice index, which is easy to be interpreted linguistically, and independent of the corpus size. The goal of the study is to capture basic motivations which led to the popularisation of the name and its expansion in the given discourse (e.g. the influences of other one-word names of states, sport commentaries, popular contests, and generation change). It is possible to sum up that the Česko name is employed in a variety of contexts, and its usage can be seen as unmarked.
EN
The notes provide elements of a new quantitive theory for unsupervised learning from pragmatic language communication. It is argued that the suitable quantitive inference framework free from paradoxes should be based on minimum description lenght (MDL) interpreted as a simplified algorithmic complexity rather than on classical frequwntist probability. Furthermore, it is argued that recently observed non-extensivity of entropy in meaningful symbolic sequences can arise if and only if unsupervised acquisition of the MDL theories for these sequences produces infinite theories and when the unsupervised acquisition is optimal as well. Such result shakes rigorously the belief that a finite formal theory of natural language could be constructed by hands of any experts. On the other hand, unsupervised machine learning is pointed out as a feasible and the only right way to implementing language competence into Ais. From this perspective, a promising compression-learning algorithm by de Marcken, its efficiency and its extension are discussed. Important parallels with research in cognitive science and statistical physics are pointed out, as well. Thus, the notes may be interesting not only for computer scientists and linguists but also for other statistical and symbolic theorists.
PL
W niniejszych notatkach przedstawiono elementy nowej, ilościowej teorii uczenia bez nadzoru na podstawie pragmatycznej komunikacji językowej. Podano argumenty wskazujące na to, że odpowiedni formalizm wnioskowania ilościowego wolny od paradoksów powinien bazować na minimalnej długości opisu jako uproszczonej mierze złożoności algorytmicznej, a nie na prawdopodobieństwie jako klasycznej mierze częstości. Pokazano także, że niedawno zaobserwowana nieekstensywność entropii niepustych semantycznie ciągów symboli zachodzi wtedy i tylko wtedy, gdy teorie najkrótszych opisów dla tych ciągów mogą rosnąć nieskończenie, a także wtedy, gdy uczenie bez nadzoru zachodzi maksymalnie efektywnie. Rezultat ten w sposób ścisły podważa przekonanie, że skończona formalna teoria języka naturalnego może być podana przez jakiegokolwiek specjalistę. Z drugiej strony, wynik ten ukazuje maszynowe uczenie bez nadzoru jako perspektywicznie realizowalny a zarazem jedyny właściwy sposób implementowania kompetencji językowej w sztucznej inteligencji. Z tego względu przeprowadzono dyskusję obiecującego algorytmu uczenia opartego na kompresji, podanego przez de Marckena. Rozważono wstępnie możliwe rozszerzenia tego algorytmu. Ponieważ przedstawiono istotnie powiązania pomiędzy omawianymi kwestiami a bieżącymi badaniami w kognitywistyce i fizyce statystycznej, niniejsze notatki mogą zainteresować nie tylko informatyków i lingwistów, ale także innych teoretyków zajmujących się naukami statystycznymi i symbolicznymi.
first rewind previous Strona / 1 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.