Ensemble Methods for Improving Classification of Data Produced by Latent Dirichlet Allocation

Jankowski, M.

doi:10.5604/01.3001.0013.1458

Artykuł - szczegóły

Tytuł artykułu

Ensemble Methods for Improving Classification of Data Produced by Latent Dirichlet Allocation

Autorzy

Jankowski M.

Treść / Zawartość

Pełne teksty:

17_28_MJankowski_Ensemble_CSMM_8_2018.pdf

Pobierz

Identyfikatory

DOI

10.5604/01.3001.0013.1458

Warianty tytułu

Metody oparte o Ensemble do klasyfikacji danych wygenerowanych przez model Latent Dirichlet Allocation

Języki publikacji

Abstrakty

Topic models are very popular methods of text analysis. The most popular algorithm for topic modelling is LDA (Latent Dirichlet Allocation). Recently, many new methods were proposed, that enable the usage of this model in large scale processing. One of the problem is, that a data scientist has to choose the number of topics manually. This step, requires some previous analysis. A few methods were proposed to automatize this step, but none of them works very well if LDA is used as a preprocessing for further classification. In this paper, we propose an ensemble approach which allows us to use more than one model at prediction phase, at the same time, reducing the need of finding a single best number of topics. We have also analyzed a few methods of estimating topic number.

Modelowanie tematyczne, jest popularną metodą analizy tekstów. Jednym z najbardziej popularnych algorytmów modelowania tematycznego jest LDA (Latent Dirichlet Allocation) [14]. W ostatnim czasie zostało zaproponowanych wiele nowych rozszerzeń tego modelu, które pozwalają na przetwarzanie dużych ilości danych. Jednym z problemów podczas użycia algorytmu LDA jest to, że liczba tematów musi zostać wybrana przed uruchomieniem algorytmu. Ten krok, wymaga wcześniejszej analizy i zaangażowania analityka danych. Powstało kilka metod, które pozwalają automatyzować ten krok, ale żadna z nich, nie działa dobrze, gdy LDA jest użyte do redukcji wymiarów przed klasyfikacją danych. W tej pracy, proponujemy podejście oparte o ensemble wielu modeli. Taki model, unika problemu wybrania jednego, najlepszego modelu LDA. Pokażemy, że takie podejście pozwala uzyskać niższy błąd klasyfikacji. Zaproponujemy również, dwie nowe metody wyboru liczby tematów, gdy chcemy użyć tylko pojedynczego modelu.

Słowa kluczowe

dimensionality reduction classification machine learning natural language processing topic modelling

klasyfikacja redukcja wymiarów modelowanie tematyczne

Wydawca

Institute of Computer and Information Systems, Faculty of Cybernetics, Military University of Technology

Czasopismo

Computer Science and Mathematical Modelling

Rocznik

2018

Tom

No. 8

Strony

17--28

Opis fizyczny

Bibliogr. 25 poz., tab., wykr.

Twórcy

autor

Jankowski M.

maciej.jankowski@wat.edu.pl

Military University of Technology, Faculty of Cybernetics, W. Urbanowicza 2, 00-908 Warsaw, Poland

Bibliografia

[1] Hofmann Th., “Probabilistic latent semantic indexing”, in: SIGIR’99: Proceedings of the 22nd annual international SIGIR conference on research and development in information retrieval, pp. 50–57, ATM, NY, USA, 1999.
[2] Griffiths T., Steyvers M., “Finding scientific topics”, in: Proceedings of the National Academy of Sciences of the United States of America 101, pp. 5228–5235, 2004.
[3] Teh Y.W., Jordan M.I., Beal M.J., Blei D.M., “Hierarchical Dirichlet Processes”, Journal of the American Statistical Association, Vol. 101, No. 476, 1566–1581(2006).
[4] Breiman L., “Random forests”, Machine learning, 45.1, 5–32 (2001).
[5] Chang J., lda: Collapsed Gibbs Sampling Methods for Topic Models, Package v. 1.4.2, https://CRAN.R-project.org/package=lda (2015).
[6] Arun R., Suresh V., Veni Madhavan C.E., Narasimha Murthy M.N., “On finding the natural number of topics with Latent Dirichlet Allocation: some observations”, in: PAKDD’10 Proceedings of the 14th Pacific-Asia conference on Advance in Knowledge Discovery and Data Mining, pp. 391–402, Springer, Berlin 2010.
[7] Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, Sheng Tang, “A density-based method for adaptive LDA model selection”, Neurocomputing, Vol. 72, Issue 7–9, 1775–1781 (2008).
[8] Steyvers M., Griffiths T., “Probabilistic topic models”, in: T. Landauer, D. McNamara, S. Dennis, and W. Kintsch (Eds.), Latent Semantic Analysis: A Road to Meaning, Laurence Erlbaum, 2007.
[9] Deveaud R., Sanjuan E., Bellot P., “Accurate and effective latent concept modeling for ad hoc information retrieval”, Document numérique, Vol. 17(1), 61–84 (2014).
[10] Jianqing Fan, Fang Han, Han Liu, “Challenges of Big Data analysis”, National Science Review, No. 1, 293–314 (2014).
[11] Guyon I., Gunn S., Nikravesh M., Zadeh L.A. (Eds.), Feature Extraction. Foundations and Applications, Springer, 2006.
[12] Deerwester S., Dumais S.T., Harshman R., “Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science, Vol. 41(6), 391–407 (1990).
[13] Lee J.A., Verleysen M., Nonlinear Dimensionality Reduction, Springer, 2007.
[14] Blei D.M., Ng A.Y., Jordan M., “Latent dirichlet allocation”, Journal of Machine Learning Research, No. 3, 993–1022 (2003).
[15] Heinrich G., “Parameter estimation for text analysis”, Technical report, 2005.
[16] MacKay D.J.C., Peto L.C.B., “A hierarchical Dirichlet language model”, Natural Language Engineering, Vol. 1, Issue 3, 289–307 (1995).
[17] Wallach H.M., Topic Modeling: Beyond Bag-of-Words, in: ICML’06 Proceedings of the 23rd international conference onMachine learning, pp. 977–984, ACM, NY, USA, 2006.
[18] Minka Th., Lafferty J., “Expectation- -propagation for the generative aspect model”, in: UAI’02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pp. 352–359, Morgan Kaufmann Publishers Inc., San Francisco, USA, 2002.
[19] Dempster A.P., Laird N.M., Rubin D.B., “Maximum likelihood from incomplete data via the EM algorithm”, Journal of the Royal Statistical Society. Series B, Vol. 39, No. 1, 1–38 (1977).
[20] Hastie T., Tibshirani R., Friedman J., The Elements of Statistical Learning, Springer, 2008.
[21] Asuncion A., Welling M., Smyth P., Yee Whye Teh, “On smoothing and inference for topic models”, in: UAI’09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 27–34, AUAI Press Arlington, USA, 2009.
[22] Minka T.P., “Estimating a Dirichlet distribution”, Annals of Physics, Vol. 2000, No. 8, 1–13 (2003).
[23] Jurafsky D., Martin J.H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Upper Saddle River, New Jersey 2006.
[24] Buntine W., “Estimating Likelihoods for Topic Models”, in: Advances in Machine Learning, ACML 2009, Lecture Notes in Computer Science, Vol. 5828, pp. 51–64, Springer, Berlin 2009.
[25] Blei D.M., McAuliffe J.D., “Supervised topic models”, in: Advances in neural information processing systems 20 (NIPS 2007), J.C. Platt, D. Koller, Y. Singer, S.T. Roweis (Eds.), pp. 121–128, Cambridge, MA: MIT Press, 2008.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-a5c53026-f0a5-4258-8d42-df36d3aff52a