Wyniki wyszukiwania - BazTech

1

Applying a q-Gram based multiple string matching algorithm for approximate matching

Susik R.

Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska

|

2017

|

T. 7, nr 3

47--50

EN

We consider the application of multiple pattern matching (Multi AOSO on q-Grams) algorithm for approximate pattern matching. We propose the on-line approach which translates the problem from approximate pattern matching into a multiple pattern one (called partitioning into exact search). Presented solution allows relatively fast search multiple patterns in text with given k-differences(or mismatches). This paper presents comparison of solution based on MAG algorithm, and [4]. Experiments on DNA, English, Proteins and XML texts with up to k errors show that the new proposed algorithm achieves relatively good results in practical use.

PL

Rozważamy zastosowanie algorytmu wyszukiwania wielu wzorców (Multi AOSO on q-Grams) do wyszukiwania przybliżonego. Proponujemy rozwiązanie on-line, upraszczające problem wyszukiwania przybliżonego do wyszukiwania wielu wzorców. Zaprezentowane rozwiązanie umożliwia relatywnie szybko wyszukiwać wiele wzorców dla odległości Levenshteina (lub Hamminga) z ograniczeniem do k. W artykule porównane jest rozwiązanie oparte na algorytmie MAG oraz [4]. Badania eksperymentalne przeprowadzone na zbiorach DNA, English, Proteins and XML z różnymi wartościami k wykazały, że zaproponowany algorytm osiąga relatywnie dobre wyniki w praktycznym zastosowaniu.

2

New algorithms for exact and approximate text matching

Grabowski Sz.

Zeszyty Naukowe. Rozprawy Naukowe / Politechnika Łódzka

|

2009

|

Z. 384

3-243

EN

The presented dissertation focuses on various exact and approximate matching problems for textual data. Text should be understood rather broadly, including natural language, molecular biology and music information retrieval data. The work consists of five chapters, each dedicated to a separate problem. In the order of presentation, they deal with exact string matching, approximate string matching, matching with gaps (which could be considered a subclass of approximate string matching problems but the amount of contained material should justify a separate chapter), online compressed search and compressed full-text indexes. The author contributed to each of those research areas. Each chapter, however, starts with background presentation to place the author's achievements in proper context. Many of the algorithms proposed in the dissertation are based on bit-parallelism, i.e., a modern technique of making use of individual bits in a CPU word. In particular, two new bit-parallel techniques have been presented, one for efficient matching in the average case, the other to reduce time complexities in the worst case of bit-parallel algorithms making use on counters. Those are rather general techniques and they have been successfully applied for multiple known string matching problems. It has been shown that the problems of matching with gaps can be attacked from very different angles, and the arsenal of existing techniques in this area has been significantly expanded. The new results comprise the algorithmic techniques of bit-parallelism, sparse dynamic programming, compact bit-parallel NFA simulations and filtering. New algorithms are also presented for full-text searching in compressed data; they are both simple and efficient. Apart from theoretical analyses, most of the proposed algorithms have been experimentally validated on modern hardware and the achieved results usually place them among very competitive ones.

PL

W rozprawie skupiono się na wybranych problemach wyszukiwania dokładnego i przybliżonego w tekście. Pojęcie tekstu winno być rozumiane szeroko, obejmując dane w językach naturalnych, sekwencje bioinformatyczne oraz zapisy utworów muzycznych (nutowe). Praca składa się z pięciu rozdziałów, z których każdy poświęcony jest osobnemu zagadnieniu. W kolejności, są to: wyszukiwanie dokładne, wyszukiwanie przybliżone, wyszukiwanie sekwencji z przerwami (ang. gaps), wyszukiwanie online w tekście skompresowanym oraz pełnotekstowe indeksy skompresowane. Rozprawa wnosi wkład w rozwój każdego z tych problemów. Każdy rozdział zaczyna się jednak od przedstawienia odnośnego stanu wiedzy. Wiele z zaproponowanych w pracy algorytmów wykorzystuje równoległość bitową, nowoczesną technikę obliczeń z wykorzystaniem poszczególnych bitów rejestru procesora. W szczególności, zaprezentowano dwie nowe techniki równoległości bitowej, jedną mającą na celu optymalizację przypadku średniego w wyszukiwaniu dokładnym, drugą redukującą złożoność w przypadku najgorszym w algorytmach wykorzystujących liczniki. Te dość ogólne techniki algorytmiczne zostały pomyślnie zastosowane w szeregu konkretnych znanych algorytmów wyszukiwania. W pracy pokazano, iż do wyszukiwania sekwencji z przerwami można stosować rozliczne podejścia, rozszerzając znacznie istniejący arsenał metod dla problemów tej kategorii. Nowe wyniki bazują m. in. na technikach algorytmicznych równoległości bitowej, programowania dynamicznego rzadkiego i oszczędnych bitowo-równoległych symulacjach automatów NFA. Przedstawiono również algorytmy wyszukiwania w danych skompresowanych, cechujące się zarówno wydajnością, jak i prostotą. Obok analiz teoretycznych, większość algorytmów zaimplementowano i poddano testom empirycznym, a osiągnięte wyniki zwykle pozwalają zaliczyć nowe metody do najefektywniejszych dla danych problemów.

3

On-line Approximate String Matching in Natural Language

Fredriksson K.

Fundamenta Informaticae

|

2006

|

Vol. 72, nr 4

453-466

EN

We consider approximate pattern matching in natural language text. We use the words of the text as the alphabet, instead of the characters as in traditional string matching approaches. Hence our pattern consists of a sequence of words. >From the algorithmic point of view this has several advantages: (i) the number of words is much less than the number of characters, which in effect means shorter text (less possible matching positions); (ii) the pattern effectively becomes shorter, so bit-parallel techniques become more applicable; (iii) the alphabet size becomes much larger, so the probability that two symbols (in this case, words) match is reduced. We extend several known approximate string matching algorithms for this scenario, allowing \kW insertions, deletions or substitutions of symbols (natural language words). We further extend the algorithms to allow \kC errors inside the pattern symbols (words) as well. The two error thresholds \kW and \kC can be applied simultaneously and independently. Hence we have in effect two alphabets, and perform approximate matching in both levels. >From the application point of view the advantage is that the method is flexible, allowing simple solutions to problems that are hard to solve with traditional approaches. Finally, we extend the algorithms to handle multiple patterns at the same time. Depending on the search parameters, we obtain algorithms that run in linear or sublinear time and that perform the optimal number of word comparisons on average, We conclude with experimental results showing that the methods work well in practice.

4

Occurrence and Substring Heuristics for [ro]-Matching

Crochemore M., Iliopoulos C.S., Lecroq T., Pinzon Y.J., Plandowski W., Rytter W.

Fundamenta Informaticae

|

2003

|

Vol. 56, nr 1,2

1-21

EN

We consider a version of pattern matching useful in processing large musical data: d-matching, which consists in finding matches which are d-approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols a, b is measured as |a-b|. We also consider (d,g)-matching, where g is a bound on the total sum of the differences. We first consider ``occurrence heuristics'' by adapting exact string matching algorithms to the two notions of approximate string matching. The resulting algorithms are efficient in practice. Then we consider ``substring heuristics''. We present d-matching algorithms fast on the average providing that the pattern is ``non-flat'' and the alphabet interval is large. The pattern is ``flat'' if its structure does not vary substantially. The algorithms, named d-BM1, d-BM2 and d-BM3 can be thought as members of the generalized Boyer-Moore family of algorithms. The algorithms are fast on average. This is the first paper on the subject, previously only ``occurrence heuristics'' have been considered. Our substring heuristics are much stronger and refer to larger parts of texts (not only to single positions). We use d-versions of suffix tries and subword graphs. Surprisingly, in the context of d-matching subword graphs appear to be superior compared with compact suffix trees.

5

Fast Multipattern Search Algorithms for Intrusion Detection

Kuri J., Navarro G., Mé L.

Fundamenta Informaticae

|

2003

|

Vol. 56, nr 1,2

23-49

EN

We present new search algorithms to detect the occurrences of any pattern from a given pattern set in a text, allowing in the occurrences a limited number of spurious text characters among those of the pattern. This is a common requirement in intrusion detection applications. Our algorithms exploit the ability to represent the search state of one or more patterns in the bits of a single machine word and update all the search states in a single operation. We show analytically and experimentally that the algorithms are able of fast searching for large sets of patterns allowing a wide number of spurious characters, yielding in our machine about a 75-fold improvement over the classical dynamic programming algorithm.

6

Better Filtering with Gapped q-Grams

Burkhardt S., Kärkkäinen J.

Fundamenta Informaticae

|

2003

|

Vol. 56, nr 1,2

51-70

EN

A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. To achieve these results the arrangement of the gaps in the q-gram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.