Document content mining for authors’ identification task

Łukasik, Sz.; Haręza, M.; Kaczor, M.

Artykuł - szczegóły

Tytuł artykułu

Document content mining for authors’ identification task

Autorzy

Łukasik Sz. , Haręza M. , Kaczor M.

Wybrane pełne teksty z tego czasopisma

http://repozytorium.biblos.pk.edu.pl/resources/35433

Identyfikatory

Warianty tytułu

Eksploracja treści dokumentów w problemie identyfikacji autorów

Języki publikacji

Abstrakty

This paper deals with automatic authorship attribution through documents content analysis. This approach is based on selecting sets of suitable features relying on specific use of grammar, punctuation or vocabulary and in the next step – executing given classification algorithm. The contribution first overviews various text characteristics which can be employed for that purpose, then presents the results of experiments involving feature selection and examines classifier performance for author identification problem. The paper concludes with discussion and proposals for further research.

Przedmiotem niniejszego artykułu jest problem identyfikacji autora na podstawie analizy treści dokumentów. Podejście to opiera się na wyborze odpowiednich cech związanych ze specyficznym użyciem struktur gramatycznych, interpunkcji oraz słownika, a następnie – użycie wybranego algorytmu klasyfikacji. W artykule przedstawiono najpierw różne charakterystyki tekstu, które mogą być użyte w omawianym zagadnieniu, a następnie załączono wyniki eksperymentów obliczeniowych obejmujących wybór cech i badanie skuteczności klasyfikacji w problemie identyfikacji autorów. Artykuł podsumowano wnioskami oraz propozycjami dalszych prac w rozważanej tematyce badawczej.

Słowa kluczowe

author identification feature selection classification

identyfikacja autora wybór cech klasyfikacja

Wydawca

Wydawnictwo Politechniki Krakowskiej im. Tadeusza Kościuszki

Czasopismo

Czasopismo Techniczne. Automatyka

Rocznik

2013

Tom

R. 110, z. 1-AC

Strony

3--15

Opis fizyczny

Bibliogr. 24 poz., wz., tab., wykr.

Twórcy

autor

Łukasik Sz.

Department of Automatic Control and Information Technology, Faculty of Electrical and Computer Engineering, Cracow University of Technology; Systems Research Institute, Polish Academy of Sciences.

autor

Haręza M.

Department of Automatic Control and Information Technology, Faculty of Electrical and Computer Engineering, Cracow University of Technology

autor

Kaczor M.

Department of Automatic Control and Information Technology, Faculty of Electrical and Computer Engineering, Cracow University of Technology

Bibliografia

[1] Argamon S., Levitan S., Measuring the Usefulness of Function Words for Authorship Attribution, Proc. Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, paper no 162, 2005.
[2] Broccias C., Cognitive linguistic theories of grammar and grammar teaching, [in:] De Knop S., De Rycker T. (eds.), Cognitive Approaches to Pedagogical Grammar, Walter de Gruyter, Berlin 2008, 67-90.
[3] Cheng N. Chandramouli R., Subbalakshmi K.P., Author gender identification from text, Digital Investigation, vol. 8, 2011, 78-88.
[4] Eder M., Rybicki J., Do birds of a feather really flock together, or how to choose training samples for authorship attribution, Literary and Linguistic Computing /to appear.
[5] Grieve J., Quantitative authorship attribution: An evaluation of techniques, Literary and Linguistic Computing, vol. 22, 2007, 251-270.
[6] Hastie T., Tibshirani R., Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York 2009.
[7] Hoover D.L., Another Perspective on Vocabulary Richness, Computers and the Humanities, vol. 37, 2003, 151-178.
[8] Jagadev A.K, Devi S., Mall R., Soft Computing for Feature Selection, [in:] Dehuri S., Cho, S.-B. (eds.), Knowledge Mining using Intelligent Agents, Imperial College Press, London 2011, 217-258.
[9] Jockers M.L., Witten D.M., A comparative study of machine learning methods for authorship attribution, Literary and Linguistic Computing, vol. 25, 2010, 215-223.
[10] Jurafsky D.,Martin J.H., Speech And Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, New Jersey 2009.
[11] Karlgren J., Eriksson G., Authors, genre, and linguistic convention, Proc. SIGIR Workshop on Plagiarism Analysis, Author ship Attribution, and Near-Duplicate Detection, 2007, 23-28.
[12] Klein D., Manning C.D., Accurate Unlexicalized Parsing, Proceedings of the 41st Meeting of the Association for Computational Linguistics, 2003, 423-430.
[13] Koppel M. Schler J., Argamon S., Authorship attribution in the wild, Language Resources and Evaluation, vol. 45, 2011, 83-94.
[14] Kowalski P.A., Procedure of feature extraction from face image for biometrical system (in Polish), Technical Transactions, vol. 1-AC/2012, Cracow University of Technology Press, 55-79.
[15] Layton R., Authorship Attribution for Twitter in 140 Characters or Less, Cracow University of Technology Press, Proc. 2nd Cybercrime and Trustworthy Computing Workshop, 2010, 1-8.
[16] Łukasik S., Haręza M., Kaczor M., Grammatical structures ranking (supplementary material), http://www.pk.edu.pl/~szymonl/nauka/Author_suppl.pdf (date of access: 9.10.2013).
[17] Łukasik S., Kulczycki P., Using topology preservation measures for high-dimensional data analysis in a reduced feature space (in Polish), Technical Transactions, vol. 1-AC/2012, Cracow University of Technology Press, 5-15.
[18] Panicheva P., Cardiff J., Rosso P., Personal Sense and Idiolect: Combining Authorship Attribution and Opinion Analysis, [in:] Proc. International Conference on Language Resources and Evaluation, Valletta, paper no. 10.491, 2010.
[19] Penn Treebank Project, http://www.cis.upenn.edu/~treebank/
[20] Punctuation, [in:] Merriam-Webster’s Collegiate Dictionary: Eleventh Edition, 1009, Merriam-Webster, Springfield 2004.
[21] Radford A., Minimalist Syntax: Exploring the structure of English, Cambridge University Press, Cambridge 2004.
[22] Stamatatos E., A survey of modern authorship attribution methods, Journal of the American Society for Information Science and Technology, vol. 60, 2009, 538-556.
[23] Wang L.P, Fu X.J., Data Mining with Computational Intelligence, Springer, Berlin 2005.
[24] Zheng R., Li J., Chen H., Huang Z., A framework for authorship identification of online messages: Writing style features and classification techniques, Journal of the American Society of Information Science and Technology, vol. 57, 2006, 378-393.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-052eb1c5-8292-4097-ae9f-45279469406f