Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 8

Liczba wyników na stronie
first rewind previous Strona / 1 next fast forward last
Wyniki wyszukiwania
Wyszukiwano:
w słowach kluczowych:  edit distance
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 1 next fast forward last
1
Content available remote Time and Space Efficient Large Scale Link Discovery using String Similarities
EN
This paper proposes and evaluates time and space efficient methods for matching entities in large data sets based on effectively pruning the candidate pairs to be matched, using edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target data set, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: the first method clusters entities and exploits the triangle inequality using the string similarity metric, in conjunction to the substring matching filtering rule. The second method uses only the substring matching rule, while the third method uses the substring matching rule in conjunction to the character frequency matching filtering rule. Evaluation results show the pruning power of the different filtering methods used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state of the art frameworks for large scale link discovery.
2
Content available remote Similarity detection based on document matrix model and edit distance algorithm
EN
This paper presents a new algorithm with an objective of analyzing the similarity measure between two text documents. Specifically, the main idea of the implemented method is based on the structure of the so-called “edit distance matrix” (similarity matrix). Elements of this matrix are filled with a formula based on Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) is used as a replacement for various implementations of stemming or lemmatization methods. Additionally, the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g., books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for all employees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.
3
Content available remote Alignment for rooted labeled caterpillars
EN
A rooted labeled caterpillar (caterpillars, for short) is a rooted labeled tree transformed to a rooted path after removing all the leaves in it. In this paper, we design the algorithm to compute the alignment distance between caterpillars in O(h2λ3) time under the general cost function and in O(h2λ) time under the unit cost function, where h is the maximum height and λ is the maximum number of leaves in caterpillars.
4
Content available remote Short text similarity algorithm based on the edit distance and thesaurus
EN
This paper proposes a method of comparing the short texts using the Levenshtein distance algorithm and thesaurus for analysing terms enclosed in texts instead of popular methods exploiting the grammatical variations glossary. The tested texts contain a variety of nouns and verbs together with grammatical or orthographical mistakes. Based on the proposed new algorithm the similarity of such texts will be estimated. The described technique is compared with methods: Cosine distances, distance Dice and Jaccard distance constructed on the term frequency method. The proposition is competitive against well-known algorithms of stemming and lemmatization.
PL
Artykuł przedstawia propozycję metody porównywania krótkich fragmentów tekstów bazującą na algorytmie odległości Levenshteina i słowniku wyrazów bliskoznacznych. Porównywane teksty zawierają odmienione terminy oraz celowe błędy ortograficzne i gramatyczne. Opisany mechanizm zestawiony został z popularnymi metodami porównywania tekstów, takimi jak: odległości Kosinusowa, Dice’a i Jaccard’a, dla których wartości wektorów obliczane są metodą częstości terminów. Zastosowanie w mechanizmie słownika wyrazów bliskoznacznych jest alternatywą wobec znanych algorytmów określania rdzenia terminu i lematyzacji w analizie danych tekstowych.
5
Content available remote Parallelization of the Levenshtein distance algorithm
EN
This paper presents a method for the parallelization of the Levenshtein distance algorithm deployed on very large strings. The proposed approach was accomplished using .NET Framework 4.0 technology with a specific implementation of threads using the System. Threading.Task namespace library. The algorithms developed in this study were tested on a high performance machine using Xamarin Mono (for Linux RedHat/Fedora OS). The computational results demonstrate a high level of efficiency of the proposed parallelization procedure.
PL
Artykuł przedstawia metodę zrównoleglenia algorytmu analizy odległości edycyjnej Levenshteina dedykowaną bardzo dużym ciągom tekstowym. Zaproponowane rozwiązanie zostało zaimplementowane na platformie .NET Framework 4.0 z uwzględnieniem metod dostępnych w przestrzeni nazw System.Threading.Task. Zastosowane algorytmy przetestowano na komputerze wysokiej wydajności, w oparciu o narzędzia Xamarin Mono (dla SO Linux RedHat/ Fedora). Otrzymane wyniki pokazują znacząco zwiększoną wydajność obliczeń dla przedstawionych w artykule rozwiązań.
EN
Bioinformatics is a large group of methods used in biology, mostly for analysis of gene sequences. The algorithms developed for this task have recently found a new application in network threat detection. This paper is an introduction to this area of research, presenting a survey of bioinformatics methods applied to this task, outlining the individual tasks and methods used to solve them. It is argued that the early conclusion that such methods are ineffective against polymorphic attacks is in fact too pessimistic.
7
Content available remote Hirschberg's algorithm for approximate matching
EN
The Hirschberg algorithm was devised to solve the longest common subsequence problem. The paper discusses the way of adopting the algorithm to solve the string matching problem in linear space to determine edit distance for two strings and their alignment.
PL
Algorytm Hirschberga został podany w celu rozwiązania problemu najdłuższego wspólnego podciągu. Niniejszy artykuł prezentuje sposób zaadoptowania tego algorytmu do rozwiązania przy liniowych wymogach pamięciowych problemu wyszukiwania wzorca w celu znalezienia odległości edycyjnej dwóch tekstów i ich wyrównania.
EN
In this work, a graph-based algorithm for symbol recognition in hand-drawn architectural plans has been described. The algorithm belongs to a prototype of man-machine interface consisting in the introduction of hand-drawndesigns to a CADsystem. Documentsand symbol prototypes are represented in terms of a Region Adjecency Graph (RAG) structure. Hence, the localization of symbol instaces in documents is performed by an error-tolerant subgraph isomorphism algorithm that looks for the minimum cost edit sequence that transforms a model graph to an input one. In this paper we describe this algorithm and the set of graph edit operations designed to transform RAGs. The main idea of the algorithm is to formulate the distance between two RAGs in terms of the string edit distance between the boundary strings of the corresponding regions. The main advantage of the algorithm is its ability to cope with distorted structuredand its invariance to rotation, translation and scaling.
first rewind previous Strona / 1 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.