PageRank Topic Model : Estimation of Multinomial Distributions using Network Structure Analysis Methods

Ikegami, K.; Ohsawa, Y.

doi:10.3233/FI-2018-1664

Artykuł - szczegóły

Tytuł artykułu

PageRank Topic Model : Estimation of Multinomial Distributions using Network Structure Analysis Methods

Autorzy

Ikegami K. , Ohsawa Y.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

DOI

10.3233/FI-2018-1664

Warianty tytułu

Języki publikacji

Abstrakty

It is useful to extract review sentences based on an assigned viewpoint for purposes such as summarization tasks. Previous studies have considered review extraction using semi-supervised learning or association mining. However, we approach this task using a clustering method. In particular, we focus on a topic model as a clustering method. In the conventional topic model, after randomly initializing the word distribution and the topic distribution, these distributions are estimated in order to minimize the perplexity using Gibbs sampling or variational Bayes. We introduce a new method called the PageRank topic model (PRTM) for estimating multinomial distributions over topics and words using network structure analysis methods. PRTM extracts topics by focusing on the co-occurrence relationships of words and it does not need randomly initialized values. Therefore, it can calculate unique word and topic distributions. In experiments using synthetic data, we showed that PRTM can infer an appropriate number of topics by clustering short sentences, and it was particularly effective when the sentences were covered by a small number of topics. Furthermore, in a real-world review data experiment, we showed that PRTM performed better with a shorter runtime compared with other models that infer the number of topics.

Słowa kluczowe

network analysis review data sentence clustering topic model

Wydawca

Polskie Towarzystwo Matematyczne

Czasopismo

Fundamenta Informaticae

Rocznik

2018

Tom

Vol. 159, nr 3

Strony

257--277

Opis fizyczny

Bibliogr. 35 poz., rys., tab., wykr.

Twórcy

autor

Ikegami K.

kenchin110100@gmail.com

Department of Systems Innovation, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, Japan

autor

Ohsawa Y.

ohsawa@sys.t.u-tokyo.ac.jp

Department of Systems Innovation, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, Japan

Bibliografia

[1] Kobayashi N, Inui K, Matsumoto Y, Tateishi K, Fukushima T. Collecting Evaluative Expressions for Opinion Extraction. Journal of Natural Language Processing, 2005. 12(3):203-222.
[2] Bafna K, Toshniwal D. Feature Based Summarization of Customers’ Reviews of Online Products. In Proceedings of 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, Procedia Computer Science Volume 22, 2013, pp. 142-151. URL https://doi.org/10.1016/j.procs.2013.09.090.
[3] Jo Y, Oh AH. Aspect and Sentiment Unification Model for Online Review Analysis. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11. ACM, 2011 pp.815-824. doi:10.1145/1935826.1935932.
[4] Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. the Journal of Machine Learning Research, 2003;3:993-1022. URL http://dl.acm.org/citation.cfm?id=944919.944937.
[5] Zhang H. The Optimality of Naive Bayes. In: Barr V, Markov Z (eds.), Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2004). AAAI Press, 2004.
[6] Nigam K, McCallum AK, Thrun S, Mitchell T. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 2000;39(2-3):61-67.
[7] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (methodological), 1977;39(1):1-38. URL http://www.jstor.org/stable/2984875.
[8] Yamamoto M, Sadamitsu K. Dirichlet Mixtures in Text Modeling. CS Technical report CS-TR-05-1, University of Tsukuba, Japan, 2005.
[9] Liu JS. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association, 1994;89(427):958-966. doi:10.2307/2290921.
[10] Teh YW, Newman D, Welling M. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems, 2006 pp. 1353-1360. URL http://dl.acm.org/citation.cfm?id=2976456.2976626.
[11] Mochihashi D, Kikui G. Infinite Dirichlet Mixtures in Text Modeling. Information Processing Society of Japan, 2006.
[12] Yin J, Wang J. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, 2014 pp. 233-242. doi:10.1145/2623330.2623715.
[13] Escobar MD, West M. Bayesian Density Estimation and Inference Using Mixtures. Journal of the American Statistical Association, 1995;90(430):577-588. URL http://www.jstor.org/stable/2291069.
[14] Zhang A, Gultekin S, Paisley J. Stochastic Variational Inference for the HDP-HMM. In: Gretton A, Robert CC (eds.), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research. PMLR, Cadiz, Spain, 2016 pp. 800-808.
[15] Van Gael J, Saatci Y, Teh YW, Ghahramani Z. Beam sampling for the infinite hidden Markov model. In: Proceedings of the 25th international conference on Machine learning. ACM, 2008 pp. 1088-1095. doi:10.1145/1390156.1390293.
[16] Li AQ, Ahmed A, Li M, Josifovski V, Smola AJ. High Performance Latent Variable Models. the Computing Research Repository, 2015. arXiv:1510.06143 [cs.LG].
[17] Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008;2008(10):P10008. doi:10.1088/1742-5468/2008/10/P10008.
[18] Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Physical Review E, 2004;70(6):066111. doi:10.1103/PhysRevE.70.066111.
[19] Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998;30(1-7):107-117. doi:10.1016/S0169-7552(98)00110-X.
[20] Sorensen DC. Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. 1997 pp. 119-165. URL https://doi.org/10.1007/978-94-011-5412-3_5.
[21] Sakaki T, Matsuo Y, Uchiyama K, Ishizuka M. Construction of Related Terms Thesauri from the Web. Journal of Natural Language Processing, 2007;14(2):3-31. URL http://doi.org/10.5715/jnlp.14.2_3.
[22] Bollegala D, Matsuo Y, Ishizuka M. A Co-occurrence Graph-based Approach for Personal Name Alias Extraction from Anchor Texts. In Proceedings of International Joint Conference on Natural Language Processing, 2008; pp. 865-870. URL http://aclweb.org/anthology/I/I08/I08-2123.pdf.
[23] Gould NIM, Hribar ME, Nocedal J. On the Solution of Equality Constrained Quadratic Programming Problems Arising in Optimization. Society for Industrial and Applied Mathematics Journal on Scientific Computing, 2001;23(4):1376-1395. URL https://doi.org/10.1137/S1064827598345667.
[24] Bonnans JF, Gilbert JC, Lemaréchal C, Sagastizábal CA. Numerical optimization: theoretical and practical aspects. Springer Science & Business Media, 2006. pp. 349-350.
[25] Piantadosi ST. Zipfs word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 2014;21(5):1112-1130. doi:10.3758/s13423-014-0585-6.
[26] Ganesan K, Zhai C. Opinion-Based Entity Ranking. Information Retrieval, 2012;15(2):116-150 doi:10.1007/s10791-011-9174-8.
[27] Rosenberg A, Hirschberg J. V-Measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007; pp. 410-420. URL https://doi.org/10.7916/D80V8N84.
[28] Amigó E, Gonzalo J, Artiles J. A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints Technical Report. Departamento de Lenguajes y Sistemas Informáticos UNED, Madrid, Spain, 2008. URL https://doi.org/10.1007/s10791-008-9066-8. ???
[29] Vinh NX, Epps J, Bailey J. Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? In Proceedings of the 26th International Conference on Machine Learning, 2009 pp. 1073-1080. doi:10.1145/1553374.1553511.
[30] Hubert L, Arabie P. Comparing partitions. Journal of Classification, 1985;2(1):193-218. URL https://doi.org/10.1007/BF01908075.
[31] Langville AN, Meyer CD. Google’s PageRank and beyond: The science of search engine rankings. Princeton University Press, 2012. ISBN-10:0691152667, 13:978-0691152660.
[32] Wilkinson JH. The algebraic eigenvalue problem. Oxford: Clarendon Press, 1965 87. URL https://doi.org/10.1016/0041-5553(66)90153-4.
[33] Kamvar SD, Haveliwala TH, Manning CD, Golub GH. Extrapolation Methods for Accelerating PageRank Computations. In Proceedings of the 12th international conference on World Wide Web, 2003 pp. 261-270. doi:10.1145/775152.775190.
[34] Gabrilovich E, Markovitch S. Computing semantic relatedness of words and texts in Wikipedia-derived semantic space. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, volume 7. 2006 pp. 1606-1611.
[35] Janusz A, Stawicki S, Nguyen HS. Adaptive learning for improving semantic tagging of scientific articles. In: Computer Science and Information Systems (FedCSIS), 2014 Federated Conference on. IEEE, 2014 pp. 27-34. doi:10.15439/2014F492.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2018).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-fe9cc527-9c65-434e-9ed4-22b6fbb9f5a2