Wyniki wyszukiwania - BazTech

1

Time and Space Efficient Large Scale Link Discovery using String Similarities

Karampelas Andreas, Vouros George A.

Fundamenta Informaticae

|

2020

|

Vol. 172, nr 3

299--325

EN

This paper proposes and evaluates time and space efficient methods for matching entities in large data sets based on effectively pruning the candidate pairs to be matched, using edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target data set, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: the first method clusters entities and exploits the triangle inequality using the string similarity metric, in conjunction to the substring matching filtering rule. The second method uses only the substring matching rule, while the third method uses the substring matching rule in conjunction to the character frequency matching filtering rule. Evaluation results show the pruning power of the different filtering methods used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state of the art frameworks for large scale link discovery.

2

Niewiarowski Artur

Computer Assisted Methods in Engineering and Science

|

2019

|

vol. 26, no. 3-4

163--175

EN

This paper presents a new algorithm with an objective of analyzing the similarity measure between two text documents. Specifically, the main idea of the implemented method is based on the structure of the so-called “edit distance matrix” (similarity matrix). Elements of this matrix are filled with a formula based on Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) is used as a replacement for various implementations of stemming or lemmatization methods. Additionally, the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g., books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for all employees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.

3

Alignment for rooted labeled caterpillars

Ukita Yoshiyuki, Yoshino Takuya, Hirata Kouichi

Annals of Computer Science and Information Systems

|

2019

|

Vol. 19

19--25

EN

A rooted labeled caterpillar (caterpillars, for short) is a rooted labeled tree transformed to a rooted path after removing all the leaves in it. In this paper, we design the algorithm to compute the alignment distance between caterpillars in O(h2λ3) time under the general cost function and in O(h2λ) time under the unit cost function, where h is the maximum height and λ is the maximum number of leaves in caterpillars.

4

Niewiarowski A.

Czasopismo Techniczne. Nauki Podstawowe

|

2016

|

Y. 113, iss. 1-NP

159--173

EN

This paper proposes a method of comparing the short texts using the Levenshtein distance algorithm and thesaurus for analysing terms enclosed in texts instead of popular methods exploiting the grammatical variations glossary. The tested texts contain a variety of nouns and verbs together with grammatical or orthographical mistakes. Based on the proposed new algorithm the similarity of such texts will be estimated. The described technique is compared with methods: Cosine distances, distance Dice and Jaccard distance constructed on the term frequency method. The proposition is competitive against well-known algorithms of stemming and lemmatization.

PL

Artykuł przedstawia propozycję metody porównywania krótkich fragmentów tekstów bazującą na algorytmie odległości Levenshteina i słowniku wyrazów bliskoznacznych. Porównywane teksty zawierają odmienione terminy oraz celowe błędy ortograficzne i gramatyczne. Opisany mechanizm zestawiony został z popularnymi metodami porównywania tekstów, takimi jak: odległości Kosinusowa, Dice’a i Jaccard’a, dla których wartości wektorów obliczane są metodą częstości terminów. Zastosowanie w mechanizmie słownika wyrazów bliskoznacznych jest alternatywą wobec znanych algorytmów określania rdzenia terminu i lematyzacji w analizie danych tekstowych.

5

Parallelization of the Levenshtein distance algorithm

Niewiarowski A., Stanuszek M.

Czasopismo Techniczne. Nauki Podstawowe

|

2014

|

R. 111, z. 3-NP

109--122

EN

This paper presents a method for the parallelization of the Levenshtein distance algorithm deployed on very large strings. The proposed approach was accomplished using .NET Framework 4.0 technology with a specific implementation of threads using the System. Threading.Task namespace library. The algorithms developed in this study were tested on a high performance machine using Xamarin Mono (for Linux RedHat/Fedora OS). The computational results demonstrate a high level of efficiency of the proposed parallelization procedure.

PL

Artykuł przedstawia metodę zrównoleglenia algorytmu analizy odległości edycyjnej Levenshteina dedykowaną bardzo dużym ciągom tekstowym. Zaproponowane rozwiązanie zostało zaimplementowane na platformie .NET Framework 4.0 z uwzględnieniem metod dostępnych w przestrzeni nazw System.Threading.Task. Zastosowane algorytmy przetestowano na komputerze wysokiej wydajności, w oparciu o narzędzia Xamarin Mono (dla SO Linux RedHat/ Fedora). Otrzymane wyniki pokazują znacząco zwiększoną wydajność obliczeń dla przedstawionych w artykule rozwiązań.

6

Application of bioinformatics methods to recognition of network threats

Kozakiewicz A., Felkner A., Kijowski P., Kruk T. J.

Journal of Telecommunications and Information Technology

|

2007

|

nr 4

23-27

EN

Bioinformatics is a large group of methods used in biology, mostly for analysis of gene sequences. The algorithms developed for this task have recently found a new application in network threat detection. This paper is an introduction to this area of research, presenting a survey of bioinformatics methods applied to this task, outlining the individual tasks and methods used to solve them. It is argued that the early conclusion that such methods are ineffective against polymorphic attacks is in fact too pessimistic.

7

Hirschberg's algorithm for approximate matching

Drozdek A.

Computer Science

|

2002

|

Vol. 4

91-100

EN

The Hirschberg algorithm was devised to solve the longest common subsequence problem. The paper discusses the way of adopting the algorithm to solve the string matching problem in linear space to determine edit distance for two strings and their alignment.

PL

Algorytm Hirschberga został podany w celu rozwiązania problemu najdłuższego wspólnego podciągu. Niniejszy artykuł prezentuje sposób zaadoptowania tego algorytmu do rozwiązania przy liniowych wymogach pamięciowych problemu wyszukiwania wzorca w celu znalezienia odległości edycyjnej dwóch tekstów i ich wyrównania.

8

A graph-edit algorithm for hand-drawn graphical document recognition and thier automatic introduction into CAD systems

Lladós J., Martí E.

Machine Graphics and Vision

|

1999

|

Vol. 8, No. 2

195-211

EN

In this work, a graph-based algorithm for symbol recognition in hand-drawn architectural plans has been described. The algorithm belongs to a prototype of man-machine interface consisting in the introduction of hand-drawndesigns to a CADsystem. Documentsand symbol prototypes are represented in terms of a Region Adjecency Graph (RAG) structure. Hence, the localization of symbol instaces in documents is performed by an error-tolerant subgraph isomorphism algorithm that looks for the minimum cost edit sequence that transforms a model graph to an input one. In this paper we describe this algorithm and the set of graph edit operations designed to transform RAGs. The main idea of the algorithm is to formulate the distance between two RAGs in terms of the string edit distance between the boundary strings of the corresponding regions. The main advantage of the algorithm is its ability to cope with distorted structuredand its invariance to rotation, translation and scaling.