Tytuł artykułu
Autorzy
Treść / Zawartość
Pełne teksty:
Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
Analysis of the shape of a Laplacian spectrogram is a new line of research used in graph spectral clustering. More precisely, we observed that (properly normalized) plots of the eigenvalues of sub-Laplacians characterizing different groups of documents differ in their shape. Thus, by computing the distance between these plots, we can solve the problem of clustering and classifying new observations. This idea is developed in a number of our papers and as such, can be considered a pioneering approach to cluster analysis. In an attempt to answer why it is so useful, in this paper we consider the hypothesis that the shape of a spectrogram could be attributed to the writing style of the authors of the document group in the cluster. We explore this hypothesis for several models of word distribution. In particular, we assume that the writing style is reflected in the word distribution of texts of an author or a group of them. We check if changing of distribution parameters of a widely accepted log-normal word distribution model changes in fact the Laplacian eigenvalue spectrogram in such a way as to distinguish between document groups. We found that in fact variation of each of the distribution parameters leads to distinct groups of documents. These findings justify the usage of Laplacian spectrograms to distinguish (cluster or classify) groups of documents.
Rocznik
Tom
Strony
5--13
Opis fizyczny
Bibliogr. 20 poz., tab., wykr.
Twórcy
autor
- Institute of Computer Science of Polish Academy of Sciences ul. Jana Kazimierza 5, 01-248 Warszawa, Poland
autor
- Institute of Computer Science of Polish Academy of Sciences ul. Jana Kazimierza 5, 01-248 Warszawa, Poland
autor
- Institute of Computer Science of Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248 Warszawa, Poland
autor
- Institute of Computer Science of Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248 Warszawa, Poland
autor
- Institute of Computer Science of Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248 Warszawa, Poland
Bibliografia
- 1. Baayen, R.H.: Statistical models for word frequency distributions: A linguistic evaluation. Comput.Humanit.26(5-6), 347–363 (1992).https://doi.org/10.1007/BF00136980,https://doi.org/10.1007/BF00136980
- 2. Barredo Arrieta, A.e.a.: Explainable artificial intelligence (XAI): Concepts, taxonomies, op-portunities and challenges toward responsible AI. Information Fusion58, 82 – 115 (2020),https://doi.org/10.1016/j.inffus.2019.12.012
- 3. Bogachev, L.V., Nuermaimaiti, R., Voss, J.: Limit shape of the generalized inverse Gaussian-Poissondistribution. arXiv 2303.08139 (2023),https://arxiv.org/abs/2303.08139
- 4. Borkowski, P., Kłopotek, M., Starosta, B., Wierzchoń, S., Sydow, M.: Eigenvalue based spec-tral classification. PLOS ONE18(4), e0283413 (2023),https://doi.org/10.1371/journal.pone.0283413
- 5. Carroll, J.: On sampling from a lognormal model of word frequency distribution. In: Kurera,H., Francis, W. (eds.) Computational Analysis of Present-Day American English, pp. 406–424.Providence: Brown University Press (1967).
- 6. Chaudhuri, K., Chung, F., Tsiatas, A.: Spectral clustering of graphs with general degrees in theextended planted partition model. In: Mannor, S., Srebro, N., Williamson, R.C. (eds.) Proceedingsof the 25th Annual Conference on Learning Theory. Proceedings of Machine Learning Research,vol. 23, pp. 35.1 – 35.23. PMLR, Edinburgh, Scotland (25 - 27 Jun 2012),https://proceedings.mlr.press/v23/chaudhuri12.html
- 7. Dudek, A.: Classification via spectral clustering. Acta Universitas Lodziensis135, 121–130(2021),https://dspace.uni.lodz.pl/xmlui/bitstream/handle/11089/344/121-130.pdf?sequence=1
- 8. Kłopotek, M.A., Wierzchoń, S.T., Starosta, B., Czerski, D., Borkowski, P.: Towards explain-ing the spectrogram of graph spectral clustering in text document domain. In: Computer Infor-mation Systems and Industrial Management: 23rd International Conference, CISIM 2024, Bia-lystok, Poland, September 27–29, 2024, Proceedings. p. 372–386. Springer-Verlag, Berlin, Heidel-berg (2024).https://doi.org/10.1007/978-3-031-71115-2_26,https://doi.org/10.1007/978-3-031-71115-2_26
- 9. Kuperman, V., Schroeder, S., Gnetov, D.: Word length and frequency effects on text read-ing are highly similar in 12 alphabetic languages. Journal of Memory and Language135, 104497 (2024). https://doi.org/https://doi.org/10.1016/j.jml.2023.104497,https://www.sciencedirect.com/science/article/pii/S0749596X23000967
- 10. Kłopotek, M.A., Starosta, B., Wierzchoń, S.T.: Eigenvalue-based incremental spectral clustering.Journal of Artificial Intelligence and Soft Computing Research14(2), 157–169 (2024).https://doi.org/10.2478/jaiscr-2024-0009,https://doi.org/10.2478/jaiscr-2024-0009
- 11. Li, W.: Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Transactions onInformation Theory38(6), 1842–1845 (1992).https://doi.org/10.1109/18.165464
- 12. Linke, M., Ramscar, M.: How the probabilistic structure of grammatical context shapes speech.Entropy22(2020),https://api.semanticscholar.org/CorpusID:211089876
- 13. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing17(4), 395–416 (2007)
- 14. Macgregor, P., Sun, H.: A tighter analysis of spectral clustering, and beyond. In: Chaudhuri, K.,Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th InternationalConference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 14717–14742. PMLR (17–23 Jul 2022),https://proceedings.mlr.press/v162/macgregor22a.html
- 15. Mandelbrot, B.: An informational theory of the statistical structure of languages. In: Jackson, W.(ed.) Communication Theory, pp. 486–502. Academic Press, Princeton (1953).
- 16. Matricciani, E.: Multi–dimensional data analysis of deep language in j.r.r. tolkien and c.s. lewisreveals tight mathematical connections. AppliedMath4(3), 927–949 (2024).https://doi.org/10.3390/appliedmath4030050,https://www.mdpi.com/2673-9909/4/3/50
- 17. Matsubara, Y.: Fluctuations in the email size modeled by a log-normal-like distribution (2025),https://arxiv.org/abs/2501.04042
- 18. Mondal, R., Ignatova, E., Walke, D., Broneske, D., Saake, G., Heyer, R.: Clustering graph data:the roadmap to spectral techniques. Discov Artif Intell4(7) (2024),https://doi.org/10.1007/s44163-024-00102-x
- 19. Munro, R.: A queueing-theory model of word frequency distributions. In: Bow, C., Hughes, B.(eds.) Proceedings of the Australasian Language Technology Workshop, ALTA 2003, Melbourne,Australia, December 8-12, 2003. pp. 70–77. Australasian Language Technology Association (2003),https://aclanthology.org/U03-1009/
- 20. Orlov, J., Chitashvili, R.: On the distribution of frequency spectrum in small samples from popula-tions with a large number of events. Bulletin of the Academy of Sciences, Georgia108.2, 297–300(1982).
- 21. Özbey, C., Çolakoğlu, T., Bilici, M.c., Erkuş;, E.C.: A unified formulation for the frequencydistribution of word frequencies using the inverse Zipf’s law. In: Proceedings of the 46th In-ternational ACM SIGIR Conference on Research and Development in Information Retrieval. p.1776–1780. SIGIR ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3539618.3591942,https://doi.org/10.1145/3539618.3591942
- 22. Parker-Rhodes, A.F., Joyce, T.: A theory of word-frequency distribution. Nature178, 1308 (1956).https://doi.org/10.1038/1781308a023. Ramscar, M.: The empirical structure of word frequency distributions. arXiv 2001.05292 (2020),https://arxiv.org/abs/2001.05292
- 24. Sichel, H.: On a distribution law for word frequencies. Journal of the American Statistical Associ-ation70, 542–547 (1975)
- 25. Starosta, B., Kłopotek, M., Wierzchoń, S.: Hashtag similarity based on laplacian eigenvalue spec-trum. In: Proc. PP-RAI’2023 - 4th Polish Conference on Artificial Intelligence , Progress in PolishArtificial Intelligence Research 4, Łódź, Poland 2023 (2023).
- 26. Torre, I., Luque, B., Lacasa, L., Kello, C., Hernández-Fernández: On the physical origin of linguisticlaws and lognormality in speech. R. Soc. Open Sci.6, 191023(2019).https://doi.org/10.1098/rsos.191023
- 27. Wierzchoń, S., Kłopotek, M.: Modern Clustering Algorithms, Studies in Big Data, vol. 34. SpringerVerlag (2018).
- 28. Xu, Y., Srinivasan, A., Xue, L.: A Selective Overview of Recent Advances in Spectral Clusteringand Their Applications, pp. 247–277. Springer International Publishing, Cham (2021).https://doi.org/10.1007/978-3-030-72437-5_12
- 29. Zipf, G.: Selective Studies and the Principle of Relative Frequency in Language. Harvard UniversityPress, Cambridge, MA (1932).
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa nr POPUL/SP/0154/2024/02 w ramach programu "Społeczna odpowiedzialność nauki II" - moduł: Popularyzacja nauki (2025).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-d371109c-cf18-42f7-9022-a94de040e9a5
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.