Short text similarity algorithm based on the edit distance and thesaurus

Niewiarowski, A.

doi:10.4467/2353737XCT.16.149.5760

Artykuł - szczegóły

Tytuł artykułu

Short text similarity algorithm based on the edit distance and thesaurus

Autorzy

Niewiarowski A.

Wybrane pełne teksty z tego czasopisma

http://repozytorium.biblos.pk.edu.pl/resources/35440

Identyfikatory

DOI

10.4467/2353737XCT.16.149.5760

Warianty tytułu

Algorytm podobieństwa krótkich fragmentów tekstów oparty na odległości edycyjnej i słowniku wyrazów bliskoznacznych

Języki publikacji

Abstrakty

This paper proposes a method of comparing the short texts using the Levenshtein distance algorithm and thesaurus for analysing terms enclosed in texts instead of popular methods exploiting the grammatical variations glossary. The tested texts contain a variety of nouns and verbs together with grammatical or orthographical mistakes. Based on the proposed new algorithm the similarity of such texts will be estimated. The described technique is compared with methods: Cosine distances, distance Dice and Jaccard distance constructed on the term frequency method. The proposition is competitive against well-known algorithms of stemming and lemmatization.

Artykuł przedstawia propozycję metody porównywania krótkich fragmentów tekstów bazującą na algorytmie odległości Levenshteina i słowniku wyrazów bliskoznacznych. Porównywane teksty zawierają odmienione terminy oraz celowe błędy ortograficzne i gramatyczne. Opisany mechanizm zestawiony został z popularnymi metodami porównywania tekstów, takimi jak: odległości Kosinusowa, Dice’a i Jaccard’a, dla których wartości wektorów obliczane są metodą częstości terminów. Zastosowanie w mechanizmie słownika wyrazów bliskoznacznych jest alternatywą wobec znanych algorytmów określania rdzenia terminu i lematyzacji w analizie danych tekstowych.

Słowa kluczowe

Levenshtein distance algorithm edit distance thesaurus measure of texts similarity plagiarism detection text mining natural language processing Natural Language Understanding stemming lemmatisation

odległość Levenshteina odległość edycyjna słownik wyrazów bliskoznacznych miara podobieństwa tekstów detekcja plagiatu analiza danych tekstowych przetwarzanie języka naturalnego stemming lematyzacja

Wydawca

Wydawnictwo Politechniki Krakowskiej im. Tadeusza Kościuszki

Czasopismo

Czasopismo Techniczne. Nauki Podstawowe

Rocznik

2016

Tom

Y. 113, iss. 1-NP

Strony

159--173

Opis fizyczny

Bibliogr. 18 poz., wz., wykr., tab.

Twórcy

autor

Niewiarowski A.

aniewiarowski@pk.edu.pl

Institute of Computer Science, Faculty of Physics, Mathematics and Computer Science of Cracow University of Technology

Bibliografia

[1] Niewiarowski A., Term frequency optimization for the vector space model, “Technical Transactions”, 9-M/2012, 155-165.
[2] Yih W., Meek Ch., Improving Similarity Measures for Short Segments of Text, Microsoft Research, USA 2007.
[3] Long-Scheng Cz., Chia-Wei Ch., A New Term Weighting Method by Introducing Class Information for Sentiment Classification of Textual Data, “Proceedings of the International MultiConference of Engineers and Computer Scientists”, IMECS 2011, 394-397.
[4] Metzler D., Dumais S., Meek Ch., Similarity Measures for Short Segments of Text, Microsoft Research, USA 2007.
[5] Piasecki M., Broda B., Semantic similarity measure of Polish nouns based on linguistic features, “Business Information Systems 10th International Conference, Lecture Notes in Computer Science, Springer”, BIS 2007, 381-390.
[6] Novay L.G., Novay Ch. W., Brussee R., Thesaurus Based Term Ranking for Keyword Extraction, “DEXA’10 Proceedings of the 2010 Workshops on Database and Expert Systems Applications, Computer Society”, 2010.
[7] Castillo Sequera J.L., Fernandez del Castillo Diez J.R., Gonzales Sotos L., A clustering algorithm based on a recursive function of distance and similarity, “IADIS European Conference Data Mining” 2011, 43-50.
[8] Szwed P., Conecpts extraction from unstructered Polish texts: a rule based approach, “Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, Springer”, 355-364.
[9] Lovins J.B., Development of a Stemming Algorithm, “Mechanical Translation and Computational Linguistics” vol. 11, nos. 1 and 2 1968.
[10] Willett P., The Porter stemming algorithm: then and now, “Program”, Vol. 40, 219-223.
[11] Abramowicz W., Filipowska A., Małyszko J., Wagner T., Lemmatization of Multi-Word Entity Names for Polish Language Using Rules Automatically Generated Based on the Corpus Analysis, “Language and Technology Conference”, 2009, 540-544.
[12] Niewiarowski A., Stanuszek M., Parallelization of the Levenshtein distance algorithm, “Technical Transactions”, 3-NP/2014, 109-122.
[13] Niewiarowski A., Działanie parsera Part-of-Speech Tagging w ujęciu mechanizmu Web Content Mining, 6’th National Conference „Science and Industry”, 2011, 93-100.
[14] Niewiarowski A., Stanuszek M., The mechanism of identification and classification of content, “Studia Informatica”, Volume 34, Number 2B (112), 2013, 205-222.
[15] Niewiarowski A., Mechanism of plagiarism detection based on the variation of the Levenshtein distance algorithm, 5’th National Conference „Science and Industry”, 2010, 86-103.
[16] Левенштейн В.И., Двоичные коды с исправлением выпадений, вставок и замещений символов, „Доклады Академий Наук СCCP”, 163 (4), 1965, 845-848.
[17] Singhal, Amit., Modern Information Retrieval: A Brief Overview, “Bulletin of the IEEE Computer Society Technical Committee on Data Engineering”, 24 (4), 2001, 35-43.
[18] Rajaraman, A., Ullman, J.D., Data Mining, “Mining of Massive Datasets”, Cambridge University Press, 2014, 1-17.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-c301e900-c1b8-420f-8e29-ca46cdee5739