Using Word Embeddings for Italian Crime News Categorization

Bonisoli, Giovanni; Rollo, Federica; Po, Laura

doi:10.15439/2021F118

Artykuł - szczegóły

Tytuł artykułu

Using Word Embeddings for Italian Crime News Categorization

Autorzy

Bonisoli Giovanni , Rollo Federica , Po Laura

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2021F118

Warianty tytułu

Konferencja

Federated Conference on Computer Science and Information Systems (16 ; 02-05.09.2021 ; online)

Języki publikacji

Abstrakty

Several studies have shown that the use of embeddings improves outcomes in many NLP activities, including text categorization. In this paper, we focus on how word embeddings can be used on newspaper articles about crimes to categorize them according to the type of crime they report. Our approach was tested on an Italian dataset of 15,361 crime news articles combining different Word2Vec models and exploiting supervised and unsupervised Machine Learning categorization algorithms. The tests show very promising results.

Słowa kluczowe

artificial intelligence natural language processing text analysis

sztuczna inteligencja przetwarzanie języka naturalnego analiza tekstu

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2021

Tom

Vol. 25

Strony

461--470

Opis fizyczny

Bibliogr. 26 poz., il., tab.

Twórcy

autor

Bonisoli Giovanni

204058@studenti.unimore.it

Enzo Ferrari Engineering Department, University of Modena and Reggio Emilia, Italy

autor

Rollo Federica

federica.rollo@unimore.it

Enzo Ferrari Engineering Department, University of Modena and Reggio Emilia, Italy

autor

Po Laura

laura.po@unimore.it

Enzo Ferrari Engineering Department, University of Modena and Reggio Emilia, Italy

Bibliografia

1. S. Ghankutkar, N. Sarkar, P. Gajbhiye, S. Yadav, D. Kalbande, and N. Bakereywala, “Modelling machine learning for analysing crime news,” in 2019 International Conference on Advances in Computing, Communication and Control (ICAC3), 2019, pp. 1-5. [Online]. Available: https://doi.org/10.1109/ICAC347590.2019.9036769
2. M. Hassan and M. Z. Rahman, “Crime news analysis: Location and story detection,” in 2017 20th International Conference of Computer and Information Technology (ICCIT), 2017, pp. 1-6. [Online]. Available: https://doi.org/10.1109/ICCITECHN.2017.8281798
3. D. Velásquez, S. Medina, G. Yamada, P. Lavado, M. Núñez, H. Alatrista, and J. Morzan, “I read the news today, oh boy: The effect of crime news coverage on crime perception and trust,” Institute of Labor Economics (IZA), IZA Discussion Papers 12056, Dec. 2018. [Online]. Available: https://ideas.repec.org/p/iza/izadps/dp12056.html
4. D. Ghosh, S. A. Chun, B. Shafiq, and N. R. Adam, “Big data-based smart city platform: Real-time crime analysis,” in Proceedings of the 17th International Digital Government Research Conference on Digital Government Research, DG.O 2016, Shanghai, China, June 08 - 10, 2016, Y. Kim and S. M. Liu, Eds. ACM, 2016, pp. 58-66. [Online]. Available: https://doi.org/10.1145/2912160.2912205
5. S. K and P. S. Thilagam, “Crime base: Towards building a knowledge base for crime entities and their relationships from online newspapers,” Information Processing & Management, vol. 56, no. 6, p. 102059, 2019. [Online]. Available: https://doi.org/10.1016/j.ipm.2019.102059
6. L. Po and F. Rollo, “Building an urban theft map by analyzing newspaper crime reports,” in 2018 13th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP), 2018, pp. 13-18. [Online]. Available: https://doi.org/10.1109/SMAP.2018.8501866
7. T. Dasgupta, A. Naskar, R. Saha, and L. Dey, “Crimeprofiler: Crime information extraction and visualization from news media,” in Proceedings of the International Conference on Web Intelligence, ser. WI ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 541-549. [Online]. Available: https://doi.org/10.1145/3106426.3106476
8. F. Rollo and L. Po, “Crime event localization and deduplication,” in The Semantic Web - ISWC 2020, J. Z. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres, O. Seneviratne, and L. Kagal, Eds. Cham: Springer International Publishing, 2020, pp. 361-377. [Online]. Available: https://doi.org/10.1007/978-3-030-62466-8_23
9. L. Po, F. Rollo, and R. T. Lado, “Topic detection in multichannel italian newspapers,” in Semantic Keyword-Based Search on Structured Data Sources - COST Action IC1302 Second International KEYSTONE Conference, IKC 2016, Cluj-Napoca, Romania, September 8-9, 2016, Revised Selected Papers, ser. Lecture Notes in Computer Science, A. Calì, D. Gorgan, and M. Ugarte, Eds., vol. 10151, 2016, pp. 62-75. [Online]. Available: https://doi.org/10.1007/978-3-319-53640-8_6
10. F. Rollo, “A key-entity graph for clustering multichannel news: student research abstract,” in Proceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Morocco, April 3-7, 2017, A. Seffah, B. Penzenstadler, C. Alves, and X. Peng, Eds. ACM, 2017, pp. 699-700. [Online]. Available: https://doi.org/10.1145/3019612.3019930
11. S. Bergamaschi, L. Po, and S. Sorrentino, “Comparing topic models for a movie recommendation system,” in WEBIST 2014 - Proceedings of the 10th International Conference on Web Information Systems and Technologies, Volume 2, Barcelona, Spain, 3-5 April, 2014, V. Monfort and K. Krempels, Eds. SciTePress, 2014, pp. 172-183. [Online]. Available: https://doi.org/10.5220/0004835601720183
12. L. Po and D. Malvezzi, “Community detection applied on big linked data,” J. Univers. Comput. Sci., vol. 24, no. 11, pp. 1627-1650, 2018. [Online]. Available: http://www.jucs.org/jucs_24_11/community_detection_applied_on
13. C. Wang, P. Nulty, and D. Lillis, “A comparative study on word embeddings in deep learning for text classification,” in Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, ser. NLPIR 2020. New York, NY, USA: Association for Computing Machinery, 2020, p. 37-46. [Online]. Available: https://doi.org/10.1145/3443279.3443304
14. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2013. [Online]. Available: http://arxiv.org/abs/1301.3781
15. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, 07 2016. [Online]. Available: https://doi.org/10.1162/tacl_a_00051
16. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang, and W. Daelemans, Eds. ACL, 2014, pp. 1532-1543. [Online]. Available: https://doi.org/10.3115/v1/d14-1162
17. A. Moreo, A. Esuli, and F. Sebastiani, “Word-class embeddings for multiclass text classification,” Data Min. Knowl. Discov., vol. 35, no. 3, pp. 911- 963, 2021. [Online]. Available: https://doi.org/10.1007/s10618-020-00735-3
18. A. Fesseha, S. Xiong, E. D. Emiru, M. Diallo, and A. Dahou, “Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya,” Inf., vol. 12, no. 2, p. 52, 2021. [Online]. Available: https://doi.org/10.3390/info12020052
19. A. Borg, M. Boldt, O. Rosander, and J. Ahlstrand, “E-mail classification with machine learning and word embeddings for improved customer support,” Neural Comput. Appl., vol. 33, no. 6, pp. 1881-1902, 2021. [Online]. Available: https://doi.org/10.1007/s00521-020-05058-4
20. E. Christodoulou, A. Gregoriades, M. Pampaka, and H. Herodotou, “Application of classification and word embedding techniques to evaluate tourists’ hotel-revisit intention,” in Proceedings of the 23rd International Conference on Enterprise Information Systems, ICEIS 2021, Online Streaming, April 26-28, 2021, Volume 1, J. Filipe, M. Smialek, A. Brodsky, and S. Hammoudi, Eds. SCITEPRESS, 2021, pp. 216-223. [Online]. Available: https://doi.org/10.5220/0010453502160223
21. P. Semberecki and H. Maciejewski, “Deep learning methods for subject text classification of articles,” in Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, FedCSIS 2017, Prague, Czech Republic, September 3-6, 2017, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 11, 2017, pp. 357-360. [Online]. Available: https://doi.org/10.15439/2017F414
22. T. Lin, “Performance of different word embeddings on text classification,” https://towardsdatascience.com/nlp-performance-of-different-word-embeddings-on-text-classification-de648c6262b, 2019, accessed: 7 June 2021.
23. J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and word2vec for text classification with semantic features,” in 14th IEEE International Conference on Cognitive Informatics & Cognitive Computing, ICCI*CC 2015, Beijing, China, July 6-8, 2015, N. Ge, J. Lu, Y. Wang, N. Howard, P. Chen, X. Tao, B. Zhang, and L. A. Zadeh, Eds. IEEE Computer Society, 2015, pp. 136-140. [Online]. Available: https://doi.org/10.1109/ICCI-CC.2015.7259377
24. G. Di Gennaro, A. Buonanno, A. Di Girolamo, A. Ospedale, F. A. N. Palmieri, and G. Fedele, An Analysis of Word2Vec for the Italian Language. Singapore: Springer Singapore, 2021, pp. 137-146. [Online]. Available: https://doi.org/10.1007/978-981-15-5093-5_13
25. B. Li, A. Drozd, Y. Guo, T. Liu, S. Matsuoka, and X. Du, “Scaling word2vec on big corpus,” Data Sci. Eng., vol. 4, no. 2, pp. 157-175, 2019. [Online]. Available: https://doi.org/10.1007/s41019-019-0096-6
26. K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” CoRR, vol. abs/1106.1813, 2011. [Online]. Available: https://doi.org/10.1613/jair.953

Uwagi

1. Track 3: Advances in Information Systems and Technology

2. Session: 27th Conference on Knowledge Acquisition and Management

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-58774571-da5a-4181-a9b7-e12be1ce6202