Similarity detection based on document matrix model and edit distance algorithm

Niewiarowski, Artur

doi:10.24423/cames.277

Artykuł - szczegóły

Tytuł artykułu

Similarity detection based on document matrix model and edit distance algorithm

Autorzy

Niewiarowski Artur

Wybrane pełne teksty z tego czasopisma

http://cames.ippt.gov.pl/

Identyfikatory

DOI

10.24423/cames.277

Warianty tytułu

Języki publikacji

Abstrakty

This paper presents a new algorithm with an objective of analyzing the similarity measure between two text documents. Specifically, the main idea of the implemented method is based on the structure of the so-called “edit distance matrix” (similarity matrix). Elements of this matrix are filled with a formula based on Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) is used as a replacement for various implementations of stemming or lemmatization methods. Additionally, the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g., books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for all employees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.

Słowa kluczowe

plagiarism detection plagiarism system edit distance Levenshtein distance similarity measure text mining information retrieval

wykrywanie plagiatu system plagiatu edycja odległości odległość Levenshteina miara podobieństwa eksploracja tekstu wyszukiwanie informacji

Wydawca

Instytut Podstawowych Problemów Techniki PAN

Czasopismo

Computer Assisted Methods in Engineering and Science

Rocznik

2019

Tom

vol. 26, no. 3-4

Strony

163--175

Opis fizyczny

Bibliogr. 18 poz., tab., wykr.

Twórcy

autor

Niewiarowski Artur

artur.niewiarowski@pk.edu.pl

Cracow University of Technology Department of Computer Science Faculty of Computer Science and Telecommunications Warszawska 24, 31-155 Cracow, Poland

Bibliografia

[1] S. Albitar, S. Fournier, B. Espinasse. An effective TF/IDF-based text-to-text semantic similarity measure fortext classification. In: Proceedings of the Web information systems engineering – WISE 2014. Part I, pp. 105–114,15th International Conference, Thessaloniki, Greece, October 12–14, 2014.
[2] J. Dawson. Suffix removal and word conflation. ALLC Bulletin,2(3): 33–46, 1974.
[3] A. Dziob, M. Piasecki. Implementation of the verb model in plWordNet 4.0. In: Proceedings of the 9th Global Wordnet Conference, pp. 114–123, Singapore, 8–12 January 2018.
[4] R. Gabrys, E. Yaakobi, O. Milenkovic. Codes in the Damerau distance for DNA storage. In:2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, pp. 2644–2648, 2016, doi:10.1109/ISIT.2016.7541778.
[5] R. Krovetz. Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACMSIGIR conference on Research and development in information retrieval, Pittsburgh, USA, pp. 191–202, 1993.
[6] V.I. Levenshtein. Binary codes for correcting dropouts, inserts, and symbol substitutions [in Russian: Двоичиые коды с исправлением вышадеий, вставок и замещений символов]. Reports of the Academy of Sciences of the USSR,163(4): 845–848, 1965.
[7] J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics,11(1–2): 22–31, 1968.
[8] M.M. Maulana, R. Arifudin, A. Alamsyah. Autocomplete and spell checking Levenshtein distance algorithm togetting Text Suggest Error Data Searching in Library. Scientific Journal of Informatics,5(1): 67–75, 2018.
[9] M. Maziarz, M. Piasecki, E. Rudnicka. plWordnet – the process of making a thesaurus [in Polish: Słowosieć – polski wordnet. Proces tworzenia tezaurusa].Polonica,34: 79–98, 2014.
[10] A. Niewiarowski. Short text similarity algorithm based on the edit distance and thesaurus [in Polish: Algorytm podobieństwa krótkich fragmentów tekstów oparty na odległości edycyjnej i słowniku wyrazów bliskoznacznych].Czasopismo Techniczne. Nauki Podstawowe/Technical Transactions/Fundamental Sciences.1: 159–173, 2016.
[11] M. Piasecki, T. Walkowiak, M. Eder. Open stylometric system WebSty: integrated language processing, analysis and visualisation. Computational Methods in Science and Technology,24(1): 43–58, 2018.
[12] M.F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems,14(3):130–137, 1980.
[13] T.A. Runkler, J.C. Bezdek. Web mining with relational clustering: International Journal of Approximate Rea-soning,32(2–3): 217–236, 2003.
[14] R.S. Sandeep, Ch. Vinay, M.S. Hemant. Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies,4(2): 265–269, 2013.
[15] V. Sandulescu, M. Ester. Detecting Singleton review spammers using semantic similarity. In:ProceedingWWW’15 Companion Proceedings of the 24th International Conference on World Wide Web, pp. 971–976,May 18–22, 2015, Florence, Italy.
[16] B. Wahiba, K. Abdessalem. A new stemmer to improve information retrieval. International Journal of Network Security & Its Applications,5(4): 143–154, 2013.
[17] D. Weiss. A survey of freely available Polish stemmers and evaluation of their applicability in information retrieval. In:2nd Language and Technology Conference, pp. 216–221, Poznań, Poland, 2005.
[18] D. Weiss. Stempelator: A hybrid stemmer for the Polish language. Research Report RA-002/05. Institute of Computing Science, Poznań University of Technology, Poland, 2005.

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-98997a47-aadd-4458-9c59-e11dc311e0cf