PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Temporal predictive regression models for linguistic style analysis

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
This study focuses on modelling general and individual language change over several decades. A timeline prediction task was used to identify interesting temporal features. Our previous work achieved high accuracy in predicting publication year, using lexical features marked for syntactic context. In this study, we use four feature types (character, word stem, part-of-speech, and word n-grams) to predict publication year, and then use associated models to determine constant and changing features in individual and general language use. We do this for two corpora, one containing texts by two different authors, published over a fifty-year period, and a reference Corpus containing a variety of text types, representing general language style over time, for the same temporal span as the two authors. Our linear regression models achieve good accuracy with the two-author data set, and very good results with the reference corpus, bringing to light interesting features of language change.
Słowa kluczowe
Rocznik
Strony
175--222
Opis fizyczny
Bibliogr. 31 poz., rys., tab., wykr.
Twórcy
autor
  • School of Computer Science and Statistics, Trinity College Dublin, Ireland
autor
  • School of Computer Science and Statistics, Trinity College Dublin, Ireland
Bibliografia
  • [1] Alex Ayres (2010), The Wit and Wisdom of Mark Twain, Harper Collins.
  • [2] Nina Baym (1981), Melodramas of Beset Manhood: How Theories of American Fiction Exclude Women Authors, American Quarterly, 33 (2): 123-139, ISSN 00030678, 10806490, http://www.jstor.org/stable/2712312.
  • [3] Joseph Warren Beach (1918), The Method of Henry James, Yale University Press.
  • [4] Walter Blair (1963), Reviewed Work: Twain and the Image of History by Roger B. Salomon, American Literature, 34 (4): 578-580, http://www.jstor.org/stable/2923090.
  • [5] Van Wyck Brooks (1920), The Ordeal of Mark Twain, New York: Dutton.
  • [6] Henry Seidel Canby (1951), Turn West, Turn East: Mark Twain and Henry James, Biblo & Tannen Publishers.
  • [7] Walter Daelemans (2013), Explanation in Computational Stylometry, In Computational Linguistics and Intelligent Text Processing, pp. 451-462, Springer.
  • [8] Mark Davies (2012), The 400 Million Word Corpus of Historical American English (1810-2009), in English Historical Linguistics 2010: Selected Papers from the Sixteenth International Conference on English Historical linguistics (ICeHl 16), Pécs, 23-27 August 2010, pp. 231-61.
  • [9] Maciej Eder, Mike Kestemont, and Jan Rybicki (2013), Stylometry with R: A Suite of Tools, in Digital Humanities 2013: Conference Abstracts, pp. 487-89, University of Nebraska-Lincoln, Lincoln, NE, http://dh2013.unl.edu/abstracts/.
  • [10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani (2001), The Elements of Statistical Learning, volume 1, Springer Series in Statistics Springer, Berlin.
  • [11] Jerome Friedman, Trevor Hastie, and Robert Tibshirani (2010), Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, 33 (1): 1-22, http://www.jstatsoft.org/v33/i01/.
  • [12] David L Hoover (2007), Corpus Stylistics, Stylometry, and the Styles of Henry James, Style, 41 (2): 174-203.
  • [13] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013), An Introduction to Statistical Learning, volume 112, Springer.
  • [14] Henry James (1884), The Art of Fiction, Longmans, Green and Company.
  • [15] Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, and Wouter van Atteveldt (2012), RTextTools: Automatic Text Classification via Supervised Learning, http://CRAN.R-project.org/package=RTextTools, R package version 1.3.9.
  • [16] Michael J. Kane, John Emerson, and Stephen Weston (2013), Scalable Strategies for Computing with Massive Data, Journal of Statistical Software, 55 (14): 1-19, http://www.jstatsoft.org/v55/i14/.
  • [17] Carmen Klaussner and Carl Vogel (2015), Stylochronometry: Timeline Prediction in Stylometric Analysis, in Max Bramer and Miltos Petridis, editors, Research and Development in Intelligent Systems XXXII, pp. 91-106, Springer International Publishing, Cham.
  • [18] Moshe Koppel, Jonathan Schler, and Shlomo Argamon (2011), Authorship Attribution in the Wild, Language Resource Evaluation, 45 (1): 83-94, doi: 10.1007/s10579-009-9111-2, http://dx.doi.org/10.1007/s10579-009-9111-2.
  • [19] Moshe Koppel, Jonathan Schler, and Elisheva Bonchek-Dokow (2007), Measuring Differentiability: Unmasking Pseudonymous Authors, Journal of Machine Learning Resources, 8:1261-1276, ISSN 1532-4435, http://dl.acm.org/citation.cfm?id=1314498.1314541.
  • [20] Max Kuhn (2014), Caret: Classification and Regression Training, http://CRAN.R-project.org/package=caret, with contributions from: Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer and the R Core Team, R package version 6.0-30.
  • [21] Spyros Makridakis, Steven C Wheelwright, and Rob J Hyndman (2008), Forecasting Methods and Applications, John Wiley & Sons.
  • [22] Meik Michalke (2014), koRpus: An R Package for Text Analysis, http://reaktanz.de/?c=hacking&s=koRpus, (Version 0.05-4).
  • [23] James W Pennebaker and Lori D Stone (2003), Words of Wisdom: Language Use Over the Life Span, Journal of Personality and Social Psychology, 85 (2): 291-231.
  • [24] Revolution Analytics and Steve Weston (2014), foreach: Foreach looping construct for R, http://CRAN.R-project.org/package=foreach, R packane version 1.4.2.
  • [25] Paolo Rosso, Francisco M. Rangel Pardo, Martin Potthast, Efstathios Stamatatos, Michael Tschuggnall, and Benno Stein (2016), Overview of PAN’16 – New Challenges for Authorship Analysis: Cross-Genre Profiling, Clustering, Diarization, and Obfuscation, in Experimental IR Meets Multilinguality, Multimodality, and Interaction – 7th International Conference of the CLEF Association, CLEF 2016, Évora, Portugal, September 5-8, 2016, Proceedings, pp. 332-350.
  • [26] Helmut Schmid (1994), Probabilistic Part-of-Speech Tagging Using Decision Trees, in Proceedings of International Conference on New Methods in Language Processing, volume 12, pp. 44-49, Manchester, UK.
  • [27] Joseph A. Smith and Colleen Kelly (2002), Stylistic Constancy and Change across Literary Corpora: Using Measures of Lexical Richness to Date Works, Computers and the Humanities, 36 (4): 411-430, http://www.jstor.org/stable/30204686.
  • [28] Efstathios Stamatatos (2012), On the Robustness of Authorship Attribution Based on Character N-gram Features, Journal of Law & Policy, 21: 421-439.
  • [29] Thomas M. Walsh and Thomas D. Zlatic (1981), Mark Twain and the Art. of Memory, American Literature, 53 (2): 214-231, http://www.jstor.org/stable/2926100.
  • [30] James D. Williams (1965), The Use of History in Mark Twain’s ‘A Connecticut Yankee’, PMLA, 80 (1): 102-110, http://www.jstor.org/stable/461131.
  • [31] Hui Zou and Trevor Hastie (2005), Regularization and Variable Selection via the Elastic Net, Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67 (2): 301-320, http://www.jstor.org/stable/3647580.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-e8bdcf64-250a-4ace-b00a-edf9a5eb5972
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.