PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Improving Short Text Classification using Information from DBpedia Ontology

Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
With the emergence of social networks and micro-blogs, a huge amount of short textual documents are generated on a daily basis, for which effective tools for organization and classification are needed. These short text documents have extremely sparse representation, which is the main cause for the poor classification performance. We propose a new approach, where we identify relevant concepts in short text documents with the use of the DBpedia Spotlight framework and enrich the text with information derived from DBpedia ontology, which reduces the sparseness. We have developed six variants of text enrichment methods and tested them on four short text datasets using seven classification algorithms. The obtained results were compared to those of the baseline approach, among themselves, and also to some state-of-the-art text classification methods. Beside classification performance, the influence of the concepts similarity threshold and the size of the training data were also evaluated. The results show that the proposed text enrichment approach significantly improves classification of short texts and is robust with respect to different input sources, domains, and sizes of available training data. The proposed text enrichment methods proved to be beneficial for classification of short text documents, especially when only a small amount of documents are available for training.
Wydawca
Rocznik
Strony
261--297
Opis fizyczny
Bibliogr. 57 poz., rys., tab., wykr.
Twórcy
  • UM FERI, Koroška cesta 46, Maribor, Maribor, Slovenia
  • UM FERI, Koroška cesta 46, Maribor, Maribor, Slovenia
Bibliografia
  • [1] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002. 34(1):1-47. doi:10.1145/505282.505283. URL http://dl.acm.org/citation.cfm?id=505282.505283.
  • [2] Aggarwal C, Zhai C. A survey of text classification algorithms, volume 9781461432234, pp. 163-222. Springer US. ISBN 1461432227, 2012. doi:10.1007/978-1-4614-3223-4_6.
  • [3] Jurado F, Rodriguez P. An experience in automatically building lexicons for affective computing in multiple target languages. Computer Science and Information Systems, 2019. 16(1):273-287. doi:https://doi.org/10.2298/CSIS171001036J. URL http://www.comsis.org/archive.php?show=ppr644-1710.
  • [4] Phan XH, Nguyen LM, Horiguchi S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceeding of the 17th international conference on World Wide Web - WWW ’08. ACM Press, New York, New York, USA. ISBN 9781605580852, 2008 p. 91. doi:10.1145/1367497.1367510. URL http://dl.acm.org/citation.cfm?id=1367497.1367510http://portal.acm.org/citation.cfm?doid=1367497.1367510.
  • [5] Gabrilovich E, Markovitch S. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In: Proceedings of The 21st National Conference on Artificial Intelligence (AAAI). AAAI Press. ISBN 9781577352815, 2006 pp. 1301-1306. doi:10.1.1.66.3456. URL http://dl.acm.org/citation.cfm?id=1597348.1597395.
  • [6] Wang X, Chen R, Jia Y, Zhou B. Short Text Classification Using Wikipedia Concept Based Document Representation. In: 2013 International Conference on Information Technology and Applications. IEEE. ISBN 978-1-4799-2877-4, 2013 pp. 471-474. doi:10.1109/ITA.2013.114. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6710030http://ieeexplore.ieee.org/document/6710030/.
  • [7] Yun J, Jing L, Yu J, Huang H. A multi-layer text classification framework based on two-level representation model. Expert Systems with Applications, 2012. 39(2):2035-2046. doi:10.1016/j.eswa.2011.08.027. URL http://www.sciencedirect.com/science/article/pii/S0957417411011389.
  • [8] Mouriño-García MA, Pérez-Rodríguez R, Anido-Rifón L, Vilares-Ferro M. Wikipedia-based hybrid document representation for textual news classification. Soft Computing, 2018. (Sebastiani 2002). doi:10.1007/s00500-018-3101-5. URL http://link.springer.com/10.1007/s00500-018-3101-5.
  • [9] Mendes P, Jakob M, Bizer C. DBpedia: A Multilingual Cross-domain Knowledge Base. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012). European Language Resources Association (ELRA), 2012 URL http://www.aclweb.org/anthology/L12-1323.
  • [10] Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C. DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 2014. 1:1-5.
  • [11] Mendes P, Jakob M. DBpedia spotlight: shedding light on the web of documents. Proceedings of the 7th International Conference on Semantic Systems, 2011. pp. 1-8.
  • [12] Rizzo G, Troncy R, Hellmann S, Bruemmer M. NERD meets NIF: Lifting NLP extraction results to the linked data cloud. In: CEUR Workshop Proceedings, volume 937. 2012.
  • [13] Mendes PN, Jakob M, García-Silva A, Bizer C. DBpedia spotlight. In: Proceedings of the 7th International Conference on Semantic Systems - I-Semantics ’11. ACM Press, New York, New York, USA. ISBN 9781450306218, 2011 pp. 1-8. doi:10.1145/2063518.2063519. URL http://dl.acm.org/citation.cfm?id=2063518.2063519.
  • [14] Hahm Y, Park J, Lim K, Kim Y, Hwang D, Choi KS. Named Entity Corpus Construction using Wikipedia and DBpedia Ontology. In: Chair) NCC, Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland. ISBN 978-2-9517408-8-4, 2014.
  • [15] García-Silva A, Jakob M, Mendes PN, Bizer C. Multipedia: enriching DBpedia with multimedia information. In: K-CAP. 2011.
  • [16] Agerri R, Artola X, Beloki Z, Rigau G, Soroa A. Big data for Natural Language Processing: A streaming approach. Knowledge-Based Systems, 2014. doi:10.1016/j.knosys.2014.11.007. URL http://linkinghub.elsevier.com/retrieve/pii/S0950705114003992.
  • [17] Flisar J, Podgorelec V. Document Enrichment Using DBPedia Ontology for Short Text Classification. In: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, WIMS ’18. ACM, New York, NY, USA. ISBN 978-1-4503-5489-9, 2018 pp. 8:1-8:9. doi:10.1145/3227609.3227649. URL http://doi.acm.org/10.1145/3227609.3227649.
  • [18] Vitale D, Ferragina P, Scaiella U. Classification of short texts by deploying topical annotations. Lecture Notes in Computer Science, 2012. 7224 LNCS:376-387. doi:10.1007/978-3-642-28997-2_32.
  • [19] Bekkerman R, Allan J. Using bigrams in text categorization. Technical report, Technical Report IR-408, Center of Intelligent Information Retrieval, UMass . . . , 2004.
  • [20] Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed Representations of Words and Phrases and their Compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13. Curran Associates Inc., USA. ISBN 2150-8097, 2013 pp. 3111-3119. doi:10.1162/jmlr.2003.3.4-5.951.1310.4546, URL http://arxiv.org/abs/1310.4546.
  • [21] Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Indexing by latent semantic analysis. Journal of the American society for information science, 1990. 41(6):391-407.
  • [22] Yao D, Bi J, Huang JH, Zhu J. A word distributed representation based framework for large-scale short text classification. 2015 International Joint Conference on Neural Networks (IJCNN), 2015. pp. 1-7.
  • [23] de Buenaga Rodríguez M, Hidalgo JMG, Díaz-Agudo B. Using WordNet to Complement Training Information in Text Categorization. CoRR, 1997. cmp-lg/9709007. URL http://arxiv.org/abs/cmp-lg/9709007.
  • [24] Elberrichi Z, Rahmoun A, Bentaalah MA. Using WordNet for Text Categorization, 2006.
  • [25] Nezreg H, Lehbab H, Belbachir H. Conceptual Representation Using WordNet for Text Categorization. International Journal of Computer and Communication Engineering, 2014. 3(1):27-30. doi:10.7763/IJCCE.2014.V3.286. URL http://www.ijcce.org/index.php?m=content{\&}c=index{\&}a=show{\&}catid=40{\&}id=341.
  • [26] Hotho A, Staab S, Stumme G. Wordnet improves Text Document Clustering. In: In Proc. of the SIGIR 2003 Semantic Web Workshop. 2003 pp. 541-544.
  • [27] Sun A. Short text classification using very few words. In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’12. ACM Press, New York, New York, USA. ISBN 9781450314725, 2012 p. 1145. doi:10.1145/2348283.2348511. URL http://dl.acm.org/citation.cfm?doid=2348283.2348511.
  • [28] Janik M, Kochut KJ. Wikipedia in Action: Ontological Knowledge in Text Categorization. 2008 IEEE International Conference on Semantic Computing, 2008. pp. 268-275. doi:10.1109/ICSC.2008.53. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4597201.
  • [29] Wang P, Domeniconi C. Building semantic kernels for text classification using wikipedia. Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining – KDD 08, 2008. p. 713. doi:10.1145/1401890.1401976. URL http://dl.acm.org/citation.cfm?doid=1401890.1401976.
  • [30] Zhu Y, Li L, Luo L. Learning to classify short text with topic model and external knowledge. Lecture Notes in Computer Science, 2013. 8041 LNAI:493-503. doi:10.1007/978-3-642-39787-5-41.
  • [31] Chen M, Jin X, Shen D. Short text classification improved by learning multi-granularity topics. In: IJCAI International Joint Conference on Artificial Intelligence. ISBN 9781577355120, 2011 pp. 1776-1781. doi:10.5591/978-1-57735-516-8/IJCAI11-298.
  • [32] Ferragina P, Scaiella U. Fast and Accurate Annotation of Short Texts with Wikipedia Pages. IEEE Software, 2012. 29(1):70-75. doi:10.1109/MS.2011.122. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6035657.
  • [33] Romero SAP, Becker K. Experiments with Semantic Enrichment for Event Classification in Tweets. In: Proceedings - 2016 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2016. ISBN 9781509044702, 2017 pp. 503-506. doi:10.1109/WI.2016.0084.
  • [34] Leal J. Using proximity to compute semantic relatedness in RDF graphs. Computer Science and Information Systems, 2013. 10(4):1727-1746. doi:10.2298/CSIS121130060L. URL http://www.doiserbia.nb.rs/Article.aspx?ID=1820-02141300060L.
  • [35] Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S. DBpedia - A Crystallization Point for the Web of Data. Web Semant., 2009. 7(3):154-165. doi:10.1016/j.websem.2009.07.002. URL http://dx.doi.org/10.1016/j.websem.2009.07.002.
  • [36] DBPedia. DBpedia, 2018. URL http://dbpedia.org/.
  • [37] Yang L, Li C, Ding Q, Li L. Combining Lexical and Semantic Features for Short Text Classification. Procedia Computer Science, 2013. 22:78-86. doi:10.1016/j.procs.2013.09.083. URL http://linkinghub.elsevier.com/retrieve/pii/S1877050913008764.
  • [38] Daiber J, Jakob M, Hokamp C, Mendes PN. Improving efficiency and accuracy in multilingual entity extraction. In: Proceedings of the 9th International Conference on Semantic Systems - I-SEMANTICS ’13. ACM Press, New York, New York, USA. ISBN 9781450319720, 2013 p. 121. doi:10.1145/2506182.2506198. arXiv:1011.1669v3, URL http://doi.acm.org/10.1145/2506182.2506198http://dl.acm.org/citation.cfm?doid=2506182.2506198.
  • [39] Maldonado EdS, Shihab E, Tsantalis N. Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt. IEEE Transactions on Software Engineering, 2017. 43(11):1044-1062. doi:10.1109/TSE.2017.2654244. URL http://ieeexplore.ieee.org/document/7820211/.
  • [40] Sahami M, Heilman TD. A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on World Wide Web. AcM, 2006 pp. 377-386.
  • [41] Batanovic V, Bojic D. Using part-of-speech tags as deep-syntax indicators in determining short-text semantic similarity. Computer Science and Information Systems, 2015. 12(1):1-31. doi:10.2298/CSIS131127082B. URL http://www.doiserbia.nb.rs/Article.aspx?ID=1820-02141400082B.
  • [42] Manning C, Raghavan P, Schütze H. Introduction to information retrieval. Natural Language Engineering, 2010. 16(1):100-103.
  • [43] Hotho A, Nürnberger A, Paaß G. A Brief Survey of Text Mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 2005. 20:19-62. doi:10.1111/j.1365-2621.1978.tb09773.x. URL http://www.kde.cs.uni-kassel.de/hotho/pub/2005/hotho05TextMining.pdf.
  • [44] McCallum A, Nigam K. A Comparison of Event Models for Naive Bayes Text Classification. AAAI/ICML-98 Workshop on Learning for Text Categorization, 1998. pp. 41-48. doi:10.1.1.46.1529. 0-387-31073-8, URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.9324{\&}rep=rep1{\&}type=pdf.
  • [45] Salton G (ed.). The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall, Englewood, Cliffs, New Jersey, 1971.
  • [46] Hui GG, Wang H, Bell D, Bi Y, Greer K. Using kNN Model-based Approach for Automatic Text. In: In Proc. of ODBASE’03, the 2nd International Conference on Ontologies, Database and Applications of Semantics, LNCS. 2003 pp. 986-996.
  • [47] Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D. Top 10 algorithms in data mining. Knowledge and Information Systems, 2008. 14(1):1-37. doi:10.1007/s10115-007-0114-2. URL http://link.springer.com/10.1007/s10115-007-0114-2.
  • [48] Freund Y, Schapire RE. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 1997. 55(1):119-139. doi:10.1006/jcss.1997.1504. URL http://www.sciencedirect.com/science/article/pii/S002200009791504X.
  • [49] Breiman L. Random Forests. Mach. Learn., 2001. 45(1):5-32. doi:10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324.
  • [50] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA Data Mining Software: An Update. SIGKDD Explor. Newsl., 2009. 11(1):10-18. doi:10.1145/1656274.1656278. URL http://doi.acm.org/10.1145/1656274.1656278.
  • [51] Demšar J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res., 2006. 7:1-30. URL http://dl.acm.org/citation.cfm?id=1248547.1248548.
  • [52] Wang P, Hu J, Zeng HJ, Chen Z. Using Wikipedia knowledge to improve text classification. Knowledge and Information Systems, 2009. 19(3):265-281. doi:10.1007/s10115-008-0152-4. URL http://link.springer.com/10.1007/s10115-008-0152-4.
  • [53] Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. Journal of machine Learning research, 2003. 3(Jan):993-1022.
  • [54] Strub F, De Vries H, Mary J, Piot B, Courvile A, Pietquin O. End-to-end optimization of goal-driven and visually grounded dialogue systems. IJCAI International Joint Conference on Artificial Intelligence, 2017. pp. 2765-2771. doi:10.1162/153244303322533223.1703.05423, URL http://arxiv.org/pdf/1301.3781v3.pdf.
  • [55] Flisar J, Podgorelec V. Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding. IEEE Access, 2019. 7:106475-106494. doi:10.1109/ACCESS.2019.2933318.
  • [56] Mouriño-García MA, Pérez-Rodríguez R, Anido-Rifón L, Vilares-Ferro M. Wikipedia-based hybrid document representation for textual news classification. Soft Computing, 2018. 22(18):6047-6065.
  • [57] Colas F, Brazdil P. Comparison of SVM and Some Older Classification Algorithms in Text Classification Tasks. In: Bramer M (ed.), Artificial Intelligence in Theory and Practice. Springer US, Boston, MA. ISBN 978-0-387-34747-9, 2006 pp. 169-178.
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-a660db1b-13f8-4aba-8d71-54fc4133918b
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.