Open Stylometric System WebSty : Integrated Language Processing, Analysis and Visualisation

Piasecki, M.; Walkowiak, T.; Eder, M.

doi:10.12921/cmst.2018.0000007

Artykuł - szczegóły

Tytuł artykułu

Open Stylometric System WebSty : Integrated Language Processing, Analysis and Visualisation

Autorzy

Piasecki M. , Walkowiak T. , Eder M.

Wybrane pełne teksty z tego czasopisma

http://cmst.eu/

Identyfikatory

DOI

10.12921/cmst.2018.0000007

Warianty tytułu

Języki publikacji

Abstrakty

The paper presents an open, web-based system for stylometric analysis named WebSty, which is a part of the CLARIN-PL research infrastructure. WebSty does not require local installation by users, can be used via any web browser, offers rich set-up, and runs on a computing cluster.We discuss the underlying ideas of the system, its architecture, a pipeline of language tools for processing Polish, and its integration with systems for clustering, visualizing the results of clustering, and identifying the features of the strongest discrimination power. The techniques used for feature weighting and text similarity measuring are also concisely overviewed. In conclusions, we present preliminary evaluation of WebSty on the corpus of 1000 literary works, and we report on the results of the first research applications of WebSty. Even if the system was initially focused on processing Polish texts, we also briefly discuss its development towards a multilingual system, which already supports English, German and Hungarian.

Słowa kluczowe

stylometry language technology infrastructure web application authorship attribution style analysis CLARIN

Wydawca

Institute of Bioorganic Chemistry Scientific Publishers OWN, Polish Academy of Sciences

Czasopismo

Computational Methods in Science and Technology

Rocznik

2018

Tom

Vol. 24, No. 1

Strony

43--58

Opis fizyczny

Bibliogr. 43 poz., rys.

Twórcy

autor

Piasecki M.

maciej.piasecki@pwr.wroc.pl

Faculty of Computer Science and Management Wrocław University of Science and Technology

autor

Walkowiak T.

Faculty of Electronics Wrocław University of Science and Technology

autor

Eder M.

Institute of Polish Language Polish Academy of Sciences and Pedagogical University of Kraków

Bibliografia

[1] P. Juola Authorship attribution. Foundations and Trends in Information Retrieval 1(3), 233–334 (2006).
[2] M. Koppel, J. Schler, S. Argamon Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9–26 (2009).
[3] E. Stamatatos A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009).
[4] Le, X., I. Lancashire, G. Hirst, R. Jokel Longitudinal detection of dementia through lexical and syntactic changes in writing: a case study of three British novelists. Literary and Linguistic Computing, 26(4): 435–461 (2011).
[5] Signature Stylometric System (access Apr. 2017). Web Page of the system. URL: http://www.philocomp.net/humanities/signature.htm.19http://websty.clarin-pl.eu
[6] Maurer, Leon (access Apr. 2017)Web page of the StyleTool programURL:https://github.com/lnmaurer/StyleTool.
[7] S. Bird, E. Klein, E. Loper, (2009) Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O’Reilly Media, URL: http://www.nltk.org/book_1ed/.
[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
[9] JGAAP (accees Apr. 2017) Web page of the application. URL: https://github.com/evllabs/JGAAP.
[10] A. McDonald, S. Afroz, A. Caliskan, A. Stolerman, R. Greenstadt, Rachel (2012) Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization. PETS 2012.
[11] I.H.Witten, Ian H., Frank, Eibe, Hall, Mark A., Christopher J. Pal Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufman (2017).
[12] S. Sinclair, Geoffrey Rockwell and the Voyant Tools Team (2012) Voyant Tools (web application). URL: http://docs.voyant-tools.org.
[13] C.D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S.J.Bethard, D. McClosky, The Stanford CoreNLP Natural Language Processing Toolkit. Association for Computational Linguistics (ACL) 2014 – System Demonstrations, ACL (2014).
[14] A.K. McCallum, MALLET: A Machine Learning for Language Toolkit. Web page of the system. URL: http://mallet.cs.umass.edu (2002) .
[15] LATtice - Application for Visualizing Linguistic Variation (access Apr. 2017) WEb page of the application URL: http://winedarksea.org/?p=1285
[16] M. Eder, J. Rybicki, M. Kestemont Stylometry with R: a package for computational text analysis. R Journal, 8(1): 107–121, http://journal.r-project.org/archive/2016-1/eder-rybicki-kestemont.pdf (2016).
[17] P. Wittenburg, et al. Resource and Service Centres as the Backbone for a Sustainable Service Infrastructure. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 60–63. European Language Resources Association (2010).
[18] M. Marcińczuk, J. Kocoń, M. Janicki, Liner2 - A Customizable Framework for Proper Names Recognition for Polish. Studies in Computational Intelligence, vol. 467, pp. 231–253 (2013).
[19] E. Wolff Microservices: Flexible Software Architectures, Addison-Wesley (2016).
[20] M. Bell SOA Modeling Patterns for Service-Oriented Discovery and Analysis. Wiley & Sons. (2010).
[21] T. Walkowiak Language Processing Modelling Notation –orchestration of NLP microservices. In: Advances in Dependability Engineering of Complex Systems: Proceedings of the Twelfth International Conference on Dependability and Complex Systems DepCoS-RELCOMEX, 2017, Springer International Publishing, pp. 464-473 (2017).
[22] C. Peltz, Web services orchestration and choreography. Computer 36(10), 46–52 (2013).
[23] T. Parr, K. Fisher LL(*): the foundation of the ANTLR parser generator. ACM SIGPLAN Notices 46(6), 425–436 (2011).
[24] T. Walkowiak, M. Pol The impact of administrator working hours on the reliability of the Centre of Language Technology. ournal of Polish Safety and Reliability Association 6(1), 167–174 (2017).
[25] A. Radziszewski A tiered CRF tagger for Polish, Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence 467, 215–230 (2013).
[26] G. Salton, McM. Gill J. Introduction to modern information retrieval. McGraw-Hill. ISBN 978-0070544840, 1986.
[27] P. Pantel & D. Ravichandran (2004) Automatically Labeling Semantic Classes. In Susan D. Dumais M. & S. Roukos (Eds.) HLT-NAACL 2004: Main Proceedings , Association for Computational Linguistics, 2004, pp. 321-328.
[28] T. Hastie, R. Tibshirani, J. Friedman The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York, 2009.
[29] T. Landauer & S. Dumais (1997) A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition. Psychological Review, 1997, 104, pp. 211-240.
[30] M. Piasecki; S. Szpakowicz & B. Broda (2009) A Wordnet from the Ground Up. Oficyna Wydawnicza Politechniki Wrocławskiej. URL: http://www.dbc.wroc.pl/Content/4220/Piasecki_Wordnet.pdf
[31] M. Eder, J. Rybicki, K. Młynarczyk, M. Oleksy, R. Borys, M. Maryl, M. Piasecki, 1000 Novels Corpus, CLARIN-PL digital repository, http://hdl.handle.net/11321/312.
[32] L.R. Dice, "Measures of the Amount of Ecologic Association Between Species". Ecology. 26(3), 297–302 (1945).
[33] T. Sˇrrensen, "A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons". Kongelige Danske Videnskabernes Selskab. 5(4), 1–34 (1948).
[34] M. Eder Taking stylometry to the limits: benchmark study on 5,281 texts from Patrologia Latina. In: Digital Humanities 2015: Conference Abstracts http://dh2015.org/abstracts (2015).
[35] J.F. Burrows, “Delta”: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17(3), 267–287 (2002).
[36] S. Argamon Interpreting Burrows’s Delta: geometric and probabilistic foundations. Literary and Linguistic Computing 23(2), 131–147 (2008).
[37] P. Gärdenfors, Conceptual Spaces – The Geometry of Thought. The MIT Press, (2000).
[38] Y. Zhao and G. Karypis Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery, 10(2), 1 (2005).
[39] I. Borg, P. Groenen Modern Multidimensional Scaling – Theory and Applications, Springer Series in Statistics, 1997.
[40] J.P van der L. Maaten, G. Hinton Visualizing data using t-SNE. Journal of Machine Learning Research 9(Nov), 2431–2456 (2008).
[41] M. Belkin, P. Niyogi Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15(6), 1373–1396 (2003).
[42] A. Przepiórkowski, M. Bańko, R.L. Górski, B.Lewandowska-Tomaszczyk, (2012) editors. Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN.
[43] M. Maryl; M. Piasecki, K. Młynarczyk, (2016) Where Close and Distant Readings Meet: Text Clustering Methods in Literary Analysis of Weblog Genres. In M. Eder & J. Rybicki (Eds.) Digital Humanities 2016 Conference Abstracts, Jagiellonian University and Pedagogical University, pp. 273-275.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2018).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-808cd3a8-6b47-4daa-a7fc-eaba422a1863