Wyniki wyszukiwania - BazTech

1

Niewiarowski Artur

Computer Assisted Methods in Engineering and Science

|

2019

|

vol. 26, no. 3-4

163--175

EN

This paper presents a new algorithm with an objective of analyzing the similarity measure between two text documents. Specifically, the main idea of the implemented method is based on the structure of the so-called “edit distance matrix” (similarity matrix). Elements of this matrix are filled with a formula based on Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) is used as a replacement for various implementations of stemming or lemmatization methods. Additionally, the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g., books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for all employees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.

2

The evaluation of text string matching algorithms as an aid to image search

Ochelska-Mierzejewska J.

Journal of Applied Computer Science

|

2018

|

Vol. 26, nr 1

33--62

EN

The main goal of this paper is to analyse intelligent text string matching methods (like fuzzy sets and relations) and evaluate their usefulness for image search. The present study examines the ability of different algorithms to handle multi-word and multi-sentence queries. Eight different similarity measures (N-gram, Levenshtein distance, Jaro coefficient, Dice coefficient, Overlap coefficient, Euclidean distance, Cosine similarity and Jaccard similarity) are employed to analyse the algorithms in terms of time complexity and accuracy of results. The outcomes are used to develop a hierarchy of methods, illustrating their usefulness to image search. The search response time increases significantly in the case of data sets containing several thousand images. The findings indicate that the analysed algorithms do not fulfil the response-time requirements of professional applications. Due to its limitations, the proposed system should be considered only as an illustration of a novel solution with further development perspectives. The use of Polish as the language of experiments affects the accuracy of measures. This limitation seems to be easy to overcome in the case of languages with simpler grammar rules (e.g. English).

3

Niewiarowski A.

Czasopismo Techniczne. Nauki Podstawowe

|

2016

|

Y. 113, iss. 1-NP

159--173

EN

This paper proposes a method of comparing the short texts using the Levenshtein distance algorithm and thesaurus for analysing terms enclosed in texts instead of popular methods exploiting the grammatical variations glossary. The tested texts contain a variety of nouns and verbs together with grammatical or orthographical mistakes. Based on the proposed new algorithm the similarity of such texts will be estimated. The described technique is compared with methods: Cosine distances, distance Dice and Jaccard distance constructed on the term frequency method. The proposition is competitive against well-known algorithms of stemming and lemmatization.

PL

Artykuł przedstawia propozycję metody porównywania krótkich fragmentów tekstów bazującą na algorytmie odległości Levenshteina i słowniku wyrazów bliskoznacznych. Porównywane teksty zawierają odmienione terminy oraz celowe błędy ortograficzne i gramatyczne. Opisany mechanizm zestawiony został z popularnymi metodami porównywania tekstów, takimi jak: odległości Kosinusowa, Dice’a i Jaccard’a, dla których wartości wektorów obliczane są metodą częstości terminów. Zastosowanie w mechanizmie słownika wyrazów bliskoznacznych jest alternatywą wobec znanych algorytmów określania rdzenia terminu i lematyzacji w analizie danych tekstowych.

4

Sentence sentiment classification using fuzzy word matching combined with fuzzy sentiment classifier

Pietras M.

Przegląd Elektrotechniczny

|

2015

|

R. 91, nr 2

107-111

EN

This article focuses on semantic tagging of content in terms of sentimental meaning which may often lead to ambiguities between the primary sense of the word and its meaning in a particular expression. To address this issue, a specially modified Levenshtein distance algorithm for suffix-mitigation was used to measure similarity of words. Sentence sentiment classification was based on fuzzy logic approach and a fuzzy classifier. The presented method was experimentally tested with the sentimental analysis of selected sentences in the Polish language. Limitations of the presented method and possible improvements are discussed.

PL

Artykuł skupia się na semantycznym tagowaniu zawartości tekstu w kategoriach znaczenia sentymentalnego, które często może prowadzić do niejednoznaczności między pierwotnym wydzwiekiem słowa i jego znaczeniem w danej wypowiedzi. Aby zmierzyć się z tym zagadnieniem zastosowano specjalnie zmodyfikowany algorytm na odległość Levenshteina z łagodzeniem znaczenia końcówki fleksyjnej wyrazu do pomiaru podobieństwa słów. Sentymentalna klasyfikacja zdań została oparta na logice rozmytej i podejścia rozmytego klasyfikatora. Przedstawiona metoda została eksperymentalnie sprawdzona z sentymentalnej analizy wybranych zdań w języku polskim. Ograniczenia prezentowanej metody oraz możliwe ulepszenia są również omówiane.

5

Invariant Levenstein Distance as an example of the Hausdorff Distance

Bartyzel K.

Journal of Applied Computer Science

|

2014

|

Vol. 22, nr 2

7--17

EN

The properties and applications of chain codes are studied around the world for many years. One of the most important uses is the ability to use them (chain codes) to describe objects contours and therefore for the comparison of objects. In previous work author presented Levenstein distance modification which allows to compare Brownian strings. In this paper, author focuses on expanding studies on developed distance and confirmation that the obtained measure is truly the mathematical metric.

6

Parallelization of the Levenshtein distance algorithm

Niewiarowski A., Stanuszek M.

Czasopismo Techniczne. Nauki Podstawowe

|

2014

|

R. 111, z. 3-NP

109--122

EN

This paper presents a method for the parallelization of the Levenshtein distance algorithm deployed on very large strings. The proposed approach was accomplished using .NET Framework 4.0 technology with a specific implementation of threads using the System. Threading.Task namespace library. The algorithms developed in this study were tested on a high performance machine using Xamarin Mono (for Linux RedHat/Fedora OS). The computational results demonstrate a high level of efficiency of the proposed parallelization procedure.

PL

Artykuł przedstawia metodę zrównoleglenia algorytmu analizy odległości edycyjnej Levenshteina dedykowaną bardzo dużym ciągom tekstowym. Zaproponowane rozwiązanie zostało zaimplementowane na platformie .NET Framework 4.0 z uwzględnieniem metod dostępnych w przestrzeni nazw System.Threading.Task. Zastosowane algorytmy przetestowano na komputerze wysokiej wydajności, w oparciu o narzędzia Xamarin Mono (dla SO Linux RedHat/ Fedora). Otrzymane wyniki pokazują znacząco zwiększoną wydajność obliczeń dla przedstawionych w artykule rozwiązań.

7

Mechanizm identyfikacji i klasyfikacji treści

Niewiarowski A., Stanuszek M.

Studia Informatica

|

2013

|

Vol. 34, nr 2B

205--222

PL

Artykuł opisuje mechanizm identyfikacji i klasyfikacji treści, oparty na metodzie ważenia terminów, bazującej na odwrotnej częstości dokumentowej, częstości wystąpienia terminu i odległości Levenshteina. Zaproponowany mechanizm zaimplementowano w program analizujący tematy i opisy prac dyplomowych, w celu automatycznego doboru promotorów i recenzentów.

EN

This paper presents the mechanism of identification and classification of content, based on terms weighted method with inversed document frequency analysis and Levenstein distance technique. The proposed mechanism is applied in the analysis of topics and descriptions of selected diploma thesis, to automatic selection of supervisors and reviewers.

8

The use of methods of statistical analysis in signature recognition system based on Levenshtein distance

Pałys M., Doroz R., Porwik P.

Journal of Medical Informatics & Technologies

|

2012

|

Vol. 21

67--73

EN

The study being presented is a continuation of the previous studies that consisted in the adaptation and use of the Levenshtein method in a signature recognition process. Three methods based on the normalized Levenshtein measure were taken into consideration. The studies included an analysis and selection of appropriate signature features, on the basis of which the authenticity of a signature was verified later. A statistical apparatus was used to perform a comprehensive analysis. The independence test ◈ was applied. It allowed determining the relationship between signature features and the error returned by the classifier.

9

Invariant Levenstein Distance for Comparison of Brownian Strings

Bartyzel K.

Journal of Applied Computer Science

|

2010

|

Vol. 18, nr 1

7-17

EN

The paper describes a new way of comparing Brownian contours. The Brownian strings, one of the main chain symbolic structures, are an image segmentation technique where a stochastically deformable contour fits around an object. In this technique the chain code can be interpreted as a simple method for linguistic description of the object. The Levenstein distance, which is one of the known textual strings comparison method, was generalised in a way that it is able to become insensitive to rotation and mirror reflection of compared shapes.

10

Signatures recognition method by using the normalized Levenshtein distances

Doroz R., Wróbel K., Porwik P.

Journal of Medical Informatics & Technologies

|

2009

|

Vol. 13

73--77

EN

This study examines the effectiveness of normalized Levenshtein metrics in the process of recognition of handwritten signatures. Three methods of normalization of the Levenshtein metric were taken into consideration. In addition, it was determined, which signature features are most important during their comparisons with the use of the aforementioned metric. The following signature features were examined: coordinates of signature points, pen pressure in successive points, and different types of pen speed. The influence of individual parameters of the Levenshtein algorithm on the obtained results was also determined, and the best method of normalization was selected.

11

Automatyczne sprawdzanie poprawności pisowni w języku polskim oparte na odległości Levenshteina

Dorosz K.

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

|

2008

|

T. 12, z. 1

29-40

PL

Ogólnie stosowane metody sprawdzania poprawności pisowni wyrazów opierają się na wykorzystaniu zagadnienia odległości Levenshteina. Metody te do działania wymagają obecności słownika fleksyjnego języka, w którym sprawdzane wyrazy zostały napisane. Ze względu na to, że metody te zostały pierwotnie utworzone na potrzeby języka angielskiego, nie są optymalne w użyciu do przetwarzania tekstów w języku polskim. W niniejszym artykule zaprezentowano charakterystyczne cechy języka polskiego, które wpływają na budowę spellcheckera oraz propozycję pewnej adaptacji metody odległości Levenshteina z uwzględnieniem tych specyficznych cech. Nowy algorytm wykazuje się poprawą jakościową w poprawianiu tekstów napisanych w języku polskim.

EN

Today's widely used spellchecking methods are based on Levenshtein distance algorithms. Inflectional dictionary of language is also needed in spellchecking process. These methods are not optimal for spellchecking texts written in Polish language, because they were inwented for use with English texts, and are optimized for it. This article provides information about characteristics of Polish language that have impact on spellchecking optimizations, as also some proposition of spellchecker implementation based on Levenshtein distance that will use Polish language characteristics and will bring some improvement in Polish texts spellchecking process.