Identifying age groups of Twitter users based on the specific characteristics of textposts

Najman, Krzysztof; Migdał-Najman, Kamila; Raca, Katarzyna; Majkowska, Agata

doi:10.59139/ws.2024.10.4

Article details

Journal

Wiadomości Statystyczne. The Polish Statistician

2024 | 69 | 10 | 59-74

Article title

Identifying age groups of Twitter users based on the specific characteristics of textposts

Authors

Krzysztof Najman , Kamila Migdał-Najman , Katarzyna Raca , Agata Majkowska

Content

Full texts:

Download

Title variants

PL

Identyfikacja grup wieku użytkowników Twittera na podstawie charakterystyki wiadomości tekstowych

Languages of publication

Abstracts

PL

Dane (wiadomości) tekstowe stanowią znaczną część wszystkich danych zamieszczanych w Internecie. Jedną z informacji, które badacze chcieliby uzyskać o autorach wiadomości tekstowych, jest ich wiek, ponieważ ma on duże znaczenie z perspektywy badań marketingowych, społecznych czy ekonomicznych. Nie zawsze jednak data urodzenia jest udostępniana publicznie. Z badań językowych wynika, że przedstawiciele różnych grup wieku posługują się odmiennym słownictwem i innymi formami gramatycznymi. Wydaje się, że mogą je różnicować również sposoby formatowania wiadomości tekstowych i poprawność zapisu tekstu. Celem badania omawianego w artykule jest wyodrębnienie grup wieku autorów wpisów na Twitterze (obecnie X) na podstawie elementów zwykle usuwanych z tekstów analizowanych metodami text mining, takich jak emotikony, znaki interpunkcyjne i słowa, które nie są nośnikami treści (ang. stopwords). Przeanalizowano prawie 3 mln tweetów w języku angielskim opublikowanych przed lipcem 2020 r. Badanie wykazało, że wyodrębnione cechy w niewielkim stopniu różnicują grupy wiekowe. Najbardziej specyficznym stylem pisania wiadomości wyróżniają się najmłodsi użytkownicy Internetu.

EN

Textual data (textposts) account for a significant portion of all data posted on the Internet. One piece of information that researchers are seeking to obtain about the authors of textposts is their age, which is not always made public, yet important from the point of view of marketing, social and economic research. Language research shows that representatives of different age groups tend to use a distinct set of vocabulary and grammatical forms. Presumably, textpost formatting as well as the level of the correctness of the text itself may also differentiate user age groups. The aim of the research presented in this article is to use the elements typically eliminated from texts during text mining processes, such as emoticons, punctuation marks and words that are not content carriers (stopwords) to distinguish the age groups of the authors of Twitter (currently X) posts. The study analysed nearly 3 million tweets in English posted before July 2020. The research shows that distinguished textpost elements differentiate the age groups only to a small extent. The youngest users stood out the most due to their specific language characteristics in textposts.

Keywords

EN

Twitter text mining user age

PL

Twitter text mining wiek użytkowników

Publisher

Główny Urząd Statystyczny

Journal

Wiadomości Statystyczne. The Polish Statistician

Year

2024

Volume

69

Issue

10

Pages

59-74

Physical description

Dates

published

2024

Contributors

author

Krzysztof Najman

Uniwersytet Gdański, Wydział Zarządzania / University of Gdańsk, Faculty of Management

https://orcid.org/0000000316732858

author

Kamila Migdał-Najman

Uniwersytet Gdański, Wydział Zarządzania / University of Gdańsk, Faculty of Management

https://orcid.org/0000000341062964

author

Katarzyna Raca

Uniwersytet Gdański, Wydział Zarządzania / University of Gdańsk, Faculty of Management

https://orcid.org/0000000317601673

author

Agata Majkowska

Uniwersytet Gdański, Wydział Zarządzania / University of Gdańsk, Faculty of Management

https://orcid.org/0000000303363598

References

Arabie, P., Hubert, L. J., & De Soete, G. (Eds). (1996). Clustering and Classification. World Scientific. https://doi.org/10.1142/1930.
Baker, F. B., & Hubert, L. J. (1975). Measuring the Power of Hierarchical Cluster Analysis. Journal of the American Statistical Association, 70(349), 31-38. https://doi.org/10.1080/01621459.1975.10480256.
Balicki, A. (2009). Statystyczna analiza wielowymiarowa i jej zastosowania społeczno-ekonomiczne. Wydawnictwo Uniwerytetu Gdańskiego.
Dunn, J. C. (1974). Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics, 4(1), 95-104. https://doi.org/10.1080/01969727408546059.
Florek, K., Łukaszewicz, J., Perkal, J., Steinhaus, H., & Zubrzycki, S. (1951). Taksonomia wrocławska. Przegląd Antropologiczny, 17, 193-211.
Goban-Klas, T. (2005). Media i komunikowanie masowe. Teorie i analizy prasy, radia, telewizji i Internetu. Wydawnictwo Naukowe PWN.
Goodman, L. A., & Kruskal, W. H. (1954). Measures of Association for Cross Classifications. Journal of the American Statistical Association, 49(268), 732-764. https://doi.org/10.1080/01621459.1954 .10501231.
Gower, J. C. (1967). A Comparison of Some Methods of Cluster Analysis. Biometrics, 23(4), 623- 637. https://doi.org/10.2307/2528417.
Hubert, L. (1974). Approximate evaluation techniques for the single-link and complete-link hierarchical clustering procedures. Journal of the American Statistical Association, 69(347), 698- 704. https://doi.org/10.1080/01621459.1974.10480191.
Hull, D. L. (1970). Contemporary Systematic Philosophies. Annual Review of Ecology, Evolution, and Systematics, 1(1), 19-54. https://doi.org/10.1146/ANNUREV.ES.01.110170.000315.
Jakobson, R. (1960). Poetyka w świetle językoznawstwa. Pamiętnik Literacki, 51(2), 431-473.
Jambu, M., & Lebeaux, M. O. (1978). Classification automatiqe pour l'analyse des donnees: vol. 1. Méthodes et algorithms. Paris Dunod.
Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29(2), 115-129. https://doi.org/10.1007/BF02289694.
Lance, G. N., & Williams, W. T. (1966). A Generalized Sorting Strategy for Computer Classifications. Nature, 212, 218. https://doi.org/10.1038/212218a0.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. E. Le Cam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability: vol. 1. Statistics (pp. 281-298). University of California Press. https://projecteuclid.org/proceedings/berkeley-symposium-on-mathematical-statistics -and-probability/Proceedings-of-the-Fifth-Berkeley-Symposium-on-Mathematical-Statistics-and /Chapter/Some-methods-for-classification-and-analysis-of-multivariate-observations/bsmsp /1200512992.
Majkowska, A., Migdał-Najman, K., Najman, K., & Raca, K. (2021). Identification of the Words Most Frequently Used by Different Generations of Twitter Users. In K. Jajuga, K. Najman & M. Walesiak (Eds.), Data Analysis and Classification. Methods and Applications (pp. 27-47). Springer. https://doi.org/10.1007/978-3-030-75190-6_3.
Majkowska, A., Migdał-Najman, K., Najman, K., & Raca, K. (2022). Graphic Characters as Twitter Age Group Identifiers. In K. Jajuga, G. Dehnel & M. Walesiak (Eds.), Modern Classification and Data Analysis. Methodology and Applications to Micro- and Macroeconomic Problems (pp. 275- 288). Springer, Cham. https://doi.org/10.1007/978-3-031-10190-8_19.
Mcquitty, L. L. (1960). Hierarchical Linkage Analysis for the Isolation of Types. Educational and Psychological Measurement, 20(1), 55-67. https://doi.org/10.1177/001316446002000106.
Mcquitty, L. L. (1966). Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement, 26(4), 825-831. https://doi.org/10.1177 /001316446602600402.
Mcquitty, L. L. (1967). Expansion of Similarity Analysis By Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement, 27(2), 253-255. https://doi.org/10.1177 /001316446702700202.
Migdał-Najman, K., & Najman, K. (2013). Samouczące się sztuczne sieci neuronowe w grupowaniu i klasyfikacji danych. Teoria i zastosowania w ekonomii. Wydawnictwo Uniwersytetu Gdańskiego.
Mojena, R. (1977). Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal, 20(4), 359-363. https://doi.org/10.1093/COMJNL/20.4.359.
Pociecha, J., Podolec, B., Sokołowski, A., & Zając, K. (1988). Metody taksonomiczne w badaniach społeczno-ekonomicznych. Państwowe Wydawnictwo Naukowe.
Pratama, B. Y., & Sarno, R. (2016). Personality classification based on Twitter text using Naive Bayes, KNN and SVM. In Proceedings of 2015 International Conference on Data and Software Engineering (pp. 170-174). Universitas Gadjah Mada. https://doi.org/10.1109/ICODSE.2015.7436992.
Rodrigues, D., Prada, M., Gaspar, R., Garrido, M. V., & Lopes, D. (2018). Lisbon Emoji and Emoticon Database (LEED): Norms for emoji and emoticons in seven evaluative dimensions. Behavior Research Methods, 50(1), 392-405. https://doi.org/10.3758/S13428-017-0878-6.
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65. https://doi.org/10.1016 /0377-0427(87)90125-7.
Sneath, P. H. A. (1957). The Application of Computers to Taxonomy. Journal of General Microbiology, 17(1), 201-226. https://doi.org/10.1099/00221287-17-1-201.
Sokal, R. R., & Michener, C. D. (1958). A Statistical Method for Evaluating Systematic Relationships. The University of Kansas Science Bulletin, 38(22), 1409-1438.
Sokal, R. R., & Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxon, 11(2), 33-40. https://doi.org/10.2307/1217208.
Sokal, R. R., & Sneath, P. H. A. (1963). Principles of Numerical Taxonomy. W. H. Freeman & Company.
Sztompka, P. (2002). Socjologia. Analiza społeczeństwa. Znak.
Tuteja, S. K., & Bogiri, N. (2016). Email Spam filtering using BPNN classification algorithm. In Proceedings of 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (pp. 915-919). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109 /ICACDOT.2016.7877720.
Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58(301), 236-244. https://doi.org/10.1080/01621459.1963.10500845.
Wyka, K. (1939). Rozwój problemu pokolenia. Przegląd Socjologiczny, 7(1-2), 159-192.
Wyka, K. (1977). Pokolenia literackie. Wydawnictwo Literackie.

Document Type

Publication order reference

Identifiers

DOI

10.59139/ws.2024.10.4

Biblioteka Nauki

59148849

YADDA identifier

bwmeta1.element.ojs-doi-10_59139_ws_2024_10_4

Article details

Journal

Wiadomości Statystyczne. The Polish Statistician

Article title

Identifying age groups of Twitter users based on the specific characteristics of textposts

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Volume

Issue

Pages

Physical description

Dates

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier