Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 3

Liczba wyników na stronie
first rewind previous Strona / 1 next fast forward last
Wyniki wyszukiwania
Wyszukiwano:
w słowach kluczowych:  authorship attribution
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 1 next fast forward last
EN
When patterns to be recognised are described by features of continuous type, discretisation becomes either an optional or necessary step in the initial data pre-processing stage. Characteristics of data, distribution of data points in the input space, can significantly influence the process of transformation from real-valued into nominal attributes, and the resulting performance of classification systems employing them. If data include several separate sets, their discretisation becomes more complex, as varying numbers of intervals and different ranges can be constructed for the same variables. The paper presents research on irregularities in data distribution, observed in the context of discretisation processes. Selected discretisation methods were used and their effect on the performance of decision algorithms, induced in classical rough set approach, was investigated. The studied input space was defined by measurable style-markers, which, exploited as characteristic features, facilitate treating a task of stylometric authorship attribution as classification.
EN
This paper describes the issue of authorship attribution based on the content of conversations originating from instant messaging software applications. The results presented in the paper refer to the corpus of conversations conducted in Polish. On the basis of a standardised model of the corpus of conversations, stylometric features were extracted, which were divided into four groups: word and message length distributions, character frequencies, tf-idf matrix and features extracted on the basis of turns (conversational features). The vectors of users’ stylometric features were compared in pairs by using Euclidean, cosine and Manhattan metrics. CMC curves were used to analyse the significance of the feature groups and the effectiveness of the metrics for identifying similar speech styles. The best results were obtained by the group of features being the tf-idf matrix compared with the use of cosine distance and the group of features extracted on the basis of turns compared with the use of the Manhattan metric.
PL
W artykule opisano zagadnienie atrybucji autorstwa na podstawie treści konwersacji pochodzących z komunikatorów internetowych. Zamieszczone w artykule wyniki odnoszą się do korpusu konwersacji prowadzonych w języku polskim. Na podstawie ustandaryzowanego modelu korpusu konwersacji wyodrębnione zostały cechy stylometryczne, które podzielono na cztery grupy tj.: rozkłady długości słowa i wiadomości, częstotliwości występowania znaków, macierz tf-idf oraz cechy wyodrębnione na podstawie tur (konwersacyjne). Wektory cech stylometrycznych użytkowników porównane zostały parami z wykorzystaniem metryk: euklidesowej, kosinusowej oraz Manhattan. Przy pomocy krzywych CMC przeanalizowano istotność grup cech oraz skuteczność metryk dla identyfikacji podobnych stylów wypowiedzi. Najlepsze rezultaty miała grupa cech będąca macierzą tf-idf porównywana z wykorzystaniem odległości kosinusowej oraz grupa cech wyodrębnionych na podstawie tur porównywana z wykorzystaniem metryki Manhattan.
EN
The paper presents an open, web-based system for stylometric analysis named WebSty, which is a part of the CLARIN-PL research infrastructure. WebSty does not require local installation by users, can be used via any web browser, offers rich set-up, and runs on a computing cluster.We discuss the underlying ideas of the system, its architecture, a pipeline of language tools for processing Polish, and its integration with systems for clustering, visualizing the results of clustering, and identifying the features of the strongest discrimination power. The techniques used for feature weighting and text similarity measuring are also concisely overviewed. In conclusions, we present preliminary evaluation of WebSty on the corpus of 1000 literary works, and we report on the results of the first research applications of WebSty. Even if the system was initially focused on processing Polish texts, we also briefly discuss its development towards a multilingual system, which already supports English, German and Hungarian.
first rewind previous Strona / 1 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.