PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Similarity detection based on document matrix model and edit distance algorithm

Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
This paper presents a new algorithm with an objective of analyzing the similarity measure between two text documents. Specifically, the main idea of the implemented method is based on the structure of the so-called “edit distance matrix” (similarity matrix). Elements of this matrix are filled with a formula based on Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) is used as a replacement for various implementations of stemming or lemmatization methods. Additionally, the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g., books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for all employees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.
Rocznik
Strony
163--175
Opis fizyczny
Bibliogr. 18 poz., tab., wykr.
Twórcy
  • Cracow University of Technology Department of Computer Science Faculty of Computer Science and Telecommunications Warszawska 24, 31-155 Cracow, Poland
Bibliografia
  • [1] S. Albitar, S. Fournier, B. Espinasse. An effective TF/IDF-based text-to-text semantic similarity measure fortext classification. In: Proceedings of the Web information systems engineering – WISE 2014. Part I, pp. 105–114,15th International Conference, Thessaloniki, Greece, October 12–14, 2014.
  • [2] J. Dawson. Suffix removal and word conflation. ALLC Bulletin,2(3): 33–46, 1974.
  • [3] A. Dziob, M. Piasecki. Implementation of the verb model in plWordNet 4.0. In: Proceedings of the 9th Global Wordnet Conference, pp. 114–123, Singapore, 8–12 January 2018.
  • [4] R. Gabrys, E. Yaakobi, O. Milenkovic. Codes in the Damerau distance for DNA storage. In:2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, pp. 2644–2648, 2016, doi:10.1109/ISIT.2016.7541778.
  • [5] R. Krovetz. Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACMSIGIR conference on Research and development in information retrieval, Pittsburgh, USA, pp. 191–202, 1993.
  • [6] V.I. Levenshtein. Binary codes for correcting dropouts, inserts, and symbol substitutions [in Russian: Двоичиые коды с исправлением вышадеий, вставок и замещений символов]. Reports of the Academy of Sciences of the USSR,163(4): 845–848, 1965.
  • [7] J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics,11(1–2): 22–31, 1968.
  • [8] M.M. Maulana, R. Arifudin, A. Alamsyah. Autocomplete and spell checking Levenshtein distance algorithm togetting Text Suggest Error Data Searching in Library. Scientific Journal of Informatics,5(1): 67–75, 2018.
  • [9] M. Maziarz, M. Piasecki, E. Rudnicka. plWordnet – the process of making a thesaurus [in Polish: Słowosieć – polski wordnet. Proces tworzenia tezaurusa].Polonica,34: 79–98, 2014.
  • [10] A. Niewiarowski. Short text similarity algorithm based on the edit distance and thesaurus [in Polish: Algorytm podobieństwa krótkich fragmentów tekstów oparty na odległości edycyjnej i słowniku wyrazów bliskoznacznych].Czasopismo Techniczne. Nauki Podstawowe/Technical Transactions/Fundamental Sciences.1: 159–173, 2016.
  • [11] M. Piasecki, T. Walkowiak, M. Eder. Open stylometric system WebSty: integrated language processing, analysis and visualisation. Computational Methods in Science and Technology,24(1): 43–58, 2018.
  • [12] M.F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems,14(3):130–137, 1980.
  • [13] T.A. Runkler, J.C. Bezdek. Web mining with relational clustering: International Journal of Approximate Rea-soning,32(2–3): 217–236, 2003.
  • [14] R.S. Sandeep, Ch. Vinay, M.S. Hemant. Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies,4(2): 265–269, 2013.
  • [15] V. Sandulescu, M. Ester. Detecting Singleton review spammers using semantic similarity. In:ProceedingWWW’15 Companion Proceedings of the 24th International Conference on World Wide Web, pp. 971–976,May 18–22, 2015, Florence, Italy.
  • [16] B. Wahiba, K. Abdessalem. A new stemmer to improve information retrieval. International Journal of Network Security & Its Applications,5(4): 143–154, 2013.
  • [17] D. Weiss. A survey of freely available Polish stemmers and evaluation of their applicability in information retrieval. In:2nd Language and Technology Conference, pp. 216–221, Poznań, Poland, 2005.
  • [18] D. Weiss. Stempelator: A hybrid stemmer for the Polish language. Research Report RA-002/05. Institute of Computing Science, Poznań University of Technology, Poland, 2005.
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-98997a47-aadd-4458-9c59-e11dc311e0cf
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.