PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Building semantic user profile for polish web news portal

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
The aim of this research is to construct meaningful user profiles that are the most descriptive of user interests in the context of the media content that they browse. We use two distinct state-of-the-art numerical text-representation techniques: LDA topic modeling and Word2Vec word embeddings. We train our models on the collection of news articles in Polish and compare them with a model built on a general language corpus. We compare the performance of these algorithms on two practical tasks. First, we perform a qualitative analysis of the semantic relationships for similar article retrieval, and then we evaluate the predictive performance of distinct feature combinations for user gender classification. We apply the algorithms to the real-world dataset of Polish news service Onet. Our results show that the choice of text representation depends on the task –Word2Vec is more suitable for text comparison, especially for short texts such as titles. In the gender classification task, the best performance is obtained with a combination of features: topics from the article text and word embeddings from the title.
Wydawca
Czasopismo
Rocznik
Strony
307–--332
Opis fizyczny
Bibliogr. 35 poz., rys., tab.
Twórcy
  • Grupa Onet-RAS Polska
Bibliografia
  • [1] Ahn J., Brusilovsky P., Grady J., He D., Syn S.Y.: Open User Profiles for Adaptive News Systems: Help or Harm? In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 11–20, ACM, New York, USA, 2007. http://dx.doi.org/10.1145/1242572.1242575.
  • [2] Alekseev A., Nikolenko S.I.: Predicting the age of social network users from usergenerated texts with word embeddings. In: Proceedings of 5th IEEE Artificial Intelligence and Natural Language Conference (AINL), St. Petersburg, pp. 3–13, 2016.
  • [3] Alekseev A., Nikolenko S.I.: Word Embeddings of User Profiling in Online Social Networks, Computación y Sistemas, vol. 21(2), pp. 203–226, 2017.
  • [4] Bai X., Barla Cambazoglu B., Gullo F., Mantrach A., Silvestri F.: Exploiting Search History of Users for News Personalization, Information Sciences, vol. 385–386, pp. 125–137, 2017. http://dx.doi.org/10.1016/j.ins.2 016.12.038.
  • [5] Blei D.M., Ng A.Y., Jordan M.I.: Latent Dirichlet Allocation, The Journal of Machine Learning Research, vol. 3(1), pp. 993–1022, 2003. http://dl.acm.org/citation.cfm?id=944919.944937.
  • [6] De Bock K., Van den Poel D.: Predicting Website Audience Demographics for-Web Advertising Targeting Using Multi-Website Clickstream Data, Fundamenta Informaticae, vol. 98(1), pp. 49–70, 2010. http://dx.doi.org/10.3233/FI-2010-216.
  • [7] Demšar J.: Statistical Comparisons of Classifiers over Multiple Data Sets, The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006. http://dl.acm.org/citation.cfm?id=1248547.1248548.
  • [8] Dorosz K.: PyDic – Single dictionary API, 2013. https://pydic.readthedocs.io/en/latest/index.html.
  • [9] Duc D.T., Son P.B., Hanh T., Thien L.T.: A Resamping Approach for Customer Gender Prediction Based on E-Commerce Data, Journal of Science and Technology: Issue on Information and Communications Technology, vol. 3(1), pp. 76–81, 2017. http://jst.udn.vn/ict/index.php/jst/article/view/40.
  • [10] Gauch S., Speretta M., Chandramouli A., Micarelli A.: User Profiles for Personalized Information Access, pp. 54–89. Springer, Berlin–Heidelberg, 2007. http://dx.doi.org/10.1007/978-3-540-72079-9_2.
  • [11] Goel S., Hofman J.M., Sirer M.I.: Who Does What on the Web: A Large-Scale Study of Browsing Behavior. In: ICWSM. 2012.
  • [12] Graliński F., Borchmann Ł., Wierzchoń P.: “He Said She Said” – a Male/Female Corpus of Polish. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Paris, France, 2016.
  • [13] Hoffman M., Bach F.R., Blei D.M.: Online Learning for Latent Dirichlet Allocation. In: Lafferty J.D., Williams C.K.I., Shawe-Taylor J., Zemel R.S., Culotta A. (eds.), Advances in Neural Information Processing Systems 23, pp. 856–864. Curran Associates, Inc., 2010. http://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf.
  • [14] Ivanova E.: Predicting website audience demographics based on browsing history. G2 pro gradu, diplomityö, Aalto University School of Business, 2013. http://urn.fi/URN:NBN:fi:aalto-201403171565.
  • [15] Jędrzejowicz J., Zakrzewska M.: Word Embeddings Versus LDA for Topic Assignment in Documents, pp. 357–366. Springer International Publishing, Cham, 2017. http://dx.doi.org/10.1007/978-3-319-67077-5_34.
  • [16] Kabbur S., Han E.H., Karypis G.: Content-Based Methods for Predicting Web-Site Demographic Attributes. In: 2010 IEEE International Conference on Data Mining, pp. 863–868. 2010. http://dx.doi.org/10.1109/ICDM.2010.97.
  • [17] Kim I.: Predicting Audience Demographics of Web Sites Using Local Cues. David Eccles School of Business, University of Utah, 2011. https://books.google.pl/books?id=jxxxMwEACAAJ.
  • [18] Kędzia P., Czachor G., Piasecki M., Kocon J.: Vector representations of polish words (Word2Vec method), 2016. http://hdl.handle.net/11321/327. CLARIN-PL digital repository.
  • [19] Kompan M., Bieliková M.: Content-Based News Recommendation, pp. 61–72, Springer, Berlin–Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-15208-5_6.
  • [20] Liu J., Dolan P., Pedersen E.R.: Personalized News Recommendation Based on Click Behavior. In: Proceedings of the 15th International Conference on Intelligent User Interfaces, IUI ’10, pp. 31–40. ACM, New York, NY, USA, 2010. http://dx.doi.org/10.1145/1719970.1719976.
  • [21] Lu Z., Dou Z., Lian J., Xie X., Yang Q.: Content-based collaborative filtering for news topic recommendation. In: AAAI’15 Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 217–223, 2015.
  • [22] Luostarinen T., Kohonen O.: Using Topic Models in Content-Based News Recommender Systems. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings, No. 085, pp. 239–251. Linköping University Electronic Press, 2013.
  • [23] Mikolov T., Chen K., Corrado G., Dean J.: Efficient Estimation of Word Representations in Vector Space. In: CoRR, vol. abs/1301.3781, 2013. http://arxiv.org/abs/1301.3781.
  • [24] Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J.: Distributed Representations of Words and Phrases and their Compositionality. In: Burges C.J.C., Bottou L., M. Welling, Ghahramani Z., Weinberger K.Q. (eds.), Advances in Neural Information Processing Systems, 26, pp. 3111–3119, Curran Associates, Inc., 2013. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
  • [25] Mykowiecka A., Marciniak M., Rychlik P.: Testing word embeddings for Polish. In: Cognitive Studies | Études cognitives, vol. 17, 2017. https://doi.org/10.11649/cs.1468.
  • [26] Özgöbek Ö., Gulla J.A., Erdur R.C.: A Survey on Challenges and Methods in News Recommendation. In: WEBIST (2), pp. 278–285. 2014.
  • [27] Ozsoy M.G.: From Word Embeddings to Item Recommendation. In: CoRR, vol. abs/1601.01356, 2016. http://arxiv.org/abs/1601.01356.
  • [28] Phuong D.V., Phuong T.M.: Gender Prediction Using Browsing History, pp. 271–283, Springer International Publishing, Cham, 2014. http://dx.doi.org/10.1007/978-3-319-02741-8_24.
  • [29] Polscy internauci w listopadzie 2017. http://pbi.org.pl/raporty/polscy-internauci-listopadzie-2017.
  • [30] Polski internet w listopadzie 2017. http://pbi.org.pl/badanie-gemiuspbi/polscy-internauci-listopadzie-2017.
  • [31] Przepiórkowski A., Banko M., Górski R.L., Lewandowska-Tomaszczyk B.: Narodowy Korpus Jezyka Polskiego. Wydawnictwo Naukowe PWN, Warszawa, 2012.
  • [32] Rangel F., Rosso P., Verhoeven B., Daelemans W., Potthast M., Stein B.: Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, Évora, Portugal, 2016.
  • [33] Stiebellehner S., Wang J., Yuan S.: Learning Continuous User Representations through Hybrid Filtering with doc2vec. In: CoRR, vol. abs/1801.00215, 2018. http://arxiv.org/abs/1801.00215.
  • [34] Wang C., Blei D.M.: Collaborative topic modeling for recommending scientific articles. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pp. 448–456. ACM, New York, NY, USA, 2011. http://dx.doi.org/10.1145/2020408.2020480.
  • [35] Webb G.I., Pazzani M.J., Billsus D.: Machine Learning for User Modeling, User Modeling and User-Adapted Interaction, vol. 11(1-2), pp. 19–29, 2001. http://dx.doi.org/10.1023/A:1011117102175.
Uwagi
PL
Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2018).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-8d4eeacf-5f6a-4123-8d14-da9951e941c8
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.