Wyniki wyszukiwania - BazTech

1

Time and Space Efficient Large Scale Link Discovery using String Similarities

Karampelas Andreas, Vouros George A.

Fundamenta Informaticae

|

2020

|

Vol. 172, nr 3

299--325

EN

This paper proposes and evaluates time and space efficient methods for matching entities in large data sets based on effectively pruning the candidate pairs to be matched, using edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target data set, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: the first method clusters entities and exploits the triangle inequality using the string similarity metric, in conjunction to the substring matching filtering rule. The second method uses only the substring matching rule, while the third method uses the substring matching rule in conjunction to the character frequency matching filtering rule. Evaluation results show the pruning power of the different filtering methods used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state of the art frameworks for large scale link discovery.

2

Object oriented representation and processing of genomic data in relational database

Pelikant A.

Studia Informatica

|

2017

|

Vol. 38, nr 4

41--55

EN

The article shows a method of storing genomic data as the object in the relational database server. It presents a method of source data migration which is stored in the weakly determined text files on ftp servers. What is more, it describes the formal structure of the Common Language Runtime (CLR) class used to define user data type. Implementations of compulsory and optional methods are also presented. Furthermore, the paper shows a set of implemented matching algorithms and methods of using them to build adherence matrix. Finally, the paper – presents some efficiency tests which prove the advantages of the proposed algorithms.

PL

Artykuł prezentuje sposób zapisu danych opisujących genom w postaci obiektu składowanego w relacyjnym serwerze baz danych. Pokazano metodę migracji danych wejściowych, składowanych w postaci słabo zdeterminowanych plików tekstowych, a dostępnych na serwerach ftp. Opisano formalną konstrukcję obiektu z zastosowaniem CLR oraz implementacje metod obligatoryjnych i fakultatywnych. Przedstawiono oprogramowane algorytmy dopasowania oraz omówiono sposób ich wykorzystania do budowy macierzy przystawania. Artykuł zawiera wyniki kilku testów wydajnościowych potwierdzających zalety proponowanych metod.

3

Compact and hash based variants of the suffix array

Grabowski S., Raniszewski M.

Bulletin of the Polish Academy of Sciences. Technical Sciences

|

2017

|

Vol. 65, nr 4

407--418

EN

Full-text indexing aims at building a data structure over a given text capable of efficiently finding arbitrary text patterns, and possibly requiring little space. We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix array with a hash table to speed up pattern searches due to significantly narrowed search interval before the binary search phase. The other, called FBCSA, is a compact data structure, similar to Mäkinen’s compact suffix array (MakCSA), but working on fixed size blocks. Experiments on the widely used Pizza & Chili datasets show that SA-hash is about 2–3 times faster in pattern searches (counts) than the standard suffix array, for the price of requiring 0.2n–1.1n bytes of extra space, where n is the text length. FBCSA, in one of the presented variants, reduces the suffix array size by a factor of about 1.5–2, while it gets close in search times, winning in speed with its competitors known from the literature, MakCSA and LCSA.

4

Software tools to measure the duplication of information

Kostrzewski M.

Journal of Applied Computer Science

|

2014

|

Vol. 22, nr 1

61--71

EN

Data stored in average computer system usually is not unique, portions of stored data are duplicated. When duplicated data are stored in separate ﬁles containing source code of computer program of student homework, a possibility of cheating should be seriously considered. This paper presents software tools built, in order to detect re-use of pieces of code in supplied text ﬁles. Three aspects of information matching are considered: identity, similarity, and analogy. Built tools have proved useful in real life situations.

5

Wyszukiwanie dokładne i przybliżone ciągów tekstowych z wykorzystaniem transformaty Burrowsa Wheelera

Kajzer A., Pokrzywa R.

Studia Informatica

|

2012

|

Vol. 33, nr 1

77-89

PL

Niniejszy artykuł przedstawia sposoby rozwiązywania problemu wyszukiwania dokładnego i przybliżonego ciągów tekstowych z wykorzystaniem transformaty Burrowsa-Wheelera. Omówione zostało także działanie skompresowanych indeksów pełnotekstowych, dzięki którym algorytmy wyszukiwania tekstu moŹgą działać w ograniczonej przestrzeni. Głównymi strukturami danych, które zostały opisane w tekście, są Wavelet Tree oraz Bi-directional Burrows-Wheeler Transform.

EN

This paper describes how the problem of exact and approximate string matching can be resolved using Burrows-Wheeler Transform. The paper considers how compressed full-text indexes work and allow string matching algorithms to operate in a limited space. Two main data structures presented in the text are Wavelet Tree and Bi-directional Burrows Wheeler Transform.

6

Simple Random Access Compression

Fredriksson K., Nikitin F.

Fundamenta Informaticae

|

2009

|

Vol. 92, nr 1-2

63-81

EN

Given a sequence S of n symbols over some alphabet Σ of size σ, we develop new compression methods that are (i) very simple to implement; (ii) provide O(1) time random access to any symbol (or short substring) of the original sequence. Our simplest solution uses at most 2h+o(h) bits of space, where h = n(H_o(S)+1), and H_o(S) is the zeroth-order empirical entropy of S. We discuss a number of improvements and trade-offs over the basic method. For example, we can achieve n(H_k(S)+1)+o(n(H_k(S)+1)) bits of space, for k = o(logσ(n)). Several applications are discussed, including text compression, (compressed) full-text indexing and string matching.

7

External Memory Algorithms for String Problems

Roh K., Crochemore M., Iliopoulos C.S., Park K.

Fundamenta Informaticae

|

2008

|

Vol. 84, nr 1

17-32

EN

In this paper we present external memory algorithms for some string problems. External memory algorithms have been developed in many research areas, as the speed gap between fast internal memory and slow external memory continues to grow. The goal of external memory algorithms is to minimize the number of input/output operations between internal memory and external memory. These years the sizes of strings such as DNA sequences are rapidly increasing. However, external memory algorithms have been developed for only a few string problems. In this paper we consider five string problems and present external memory algorithms for them. They are the problems of finding the maximum suffix, string matching, period finding, Lyndon decomposition, and finding the minimum of a circular string. Every algorithm that we present here runs in a linear number of I/Os in the external memory model with one disk, and they run in an optimal number of disk I/Os in the external memory model with multiple disks.

8

Fast algorithm for the constrained longest common subsequence problem

Deorowicz S.

Theoretical and Applied Informatics

|

2007

|

Vol. 19, No. 2

91-102

EN

The problem of finding the constrained longest common subsequence (CLCS) for the sequences A and B with respect to the sequence P was introduced recently. Its goal is to find the longest subsequence of A and B such that P is a subsequence of it. The best known algorithms for its solving require time of order of a product of the sequence lengths. We introduce a novel approach in which time and space complexities depend also on the number of matches between A, B, and P. The time complexity is better for typical parameter settings and never worse.

PL

Problem wyznaczania najdłuższego wspólnego podciągu ciągów A i B, którego podciągiem jest podciąg P (CLCS) pojawił się˛ literaturze stosunkowo niedawno. Złożoności obliczeniowe najlepszych znanych algorytmów są˛ iloczynem długości wszystkich ciągów. W niniejszej pracy przedstawione jest nowy algorytm rozwiązywania tego problemu, dla którego złożoność´ obliczeniowa zależy od liczby dopasować pomiędzy ciągami A, B i P. Złożoność´ ta jest zwykle lepsza (a nigdy gorsza) niż najlepszych znanych do tej pory algorytmów.

9

Zastosowanie układów reprogramowalnych do realizacji wydajnego systemu przeciwdziałania włamaniom

Botowicz J., Wołowiec M., Buciak P., Sapiecha P.

Przegląd Telekomunikacyjny + Wiadomości Telekomunikacyjne

|

2006

|

nr 5

152-155

PL

Przedstawiono architekturę systemów przeciwdziałania włamaniom oraz zaprezentowano wyniki implementacji komponentów takiego systemu z użyciem układów reprogramowalnych. Przedstawione komponenty to: moduł wykrywania wzorców ataku oraz moduł śledzenia tras ataków sieciowych.

EN

In this article common architecture for Intrusion Prevention Systems has been presented. Using this architecture, two sample components: pattern matching module and IP traceback module were described.

10

Efficient String Matching in Huffman Compressed Texts

Fredriksson K., Tarhio J.

Fundamenta Informaticae

|

2004

|

Vol. 63, nr 1

1--16

EN

We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(n[(log2σ)/ b]) time, where n is the size of the compressed text in bytes, s is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O([b/( H log2σ)]) characters, where H is the entropy of the text. Each super-character is processed in O(1) time. The algorithm uses O(2b) space and O(b2b) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(n [(log2σ)/ Hω]), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n[(log2 s)/ b]+t), where t is the number of occurrences reported; and a shift-or string matching algorithm that works in time O(n [(log2σ)/ b] é (m+s-1)/ω] u`+t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O([b/( H log2σ)]). The method can be applied to several other algorithms as well. Finally, we apply the methods to natural language taking the words (vocabulary) as the alphabet. This improves the compression ratio and allows more complex search problems to be solved efficiently. We conclude with some experimental results.

11

Compact Suffix Array - A Space-Efficient Full-Text Index

Mäkinen V.

Fundamenta Informaticae

|

2003

|

Vol. 56, nr 1,2

191-210

EN

Suffix array is a widely used full-text index that allows fast searches on the text. It is constructed by sorting all suffixes of the text in the lexicographic order and storing pointers to the suffixes in this order. Binary search is used for fast searches on the suffix array. Compact suffix array is a compressed form of the suffix array that still allows binary searches, but the search times are also dependent on the compression. In this paper, we give efficient methods for constructing and querying compact suffix arrays. We also study practical issues, such as the trade off between compression and search times, and show how to reduce the space requirement of the construction. Experimental results are provided in comparison with other search methods. With a large text corpora, the index took 1.6 times the size of the text, while the searches were only two times slower than from a suffix array.