A repeat in a string is a substring that occurs more than once. A repeat is extendible if every occurrence of the repeat has an identical letter either on the left or on the right; otherwise, it is maximal. A multirepeat is a repeat that occurs at least mmin times (mmin
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
In this work, we consider a special type of uncertain sequence called weighted string. In a weighted string every position contains a subset of the alphabet and every letter of the alphabet is associated with a probability of occurrence such that the sum of probabilities at each position equals 1. Usually a cumulative weight threshold 1/z is specified, and one considers only strings that match the weighted string with probability at least 1/z. We provide an O(nz)-time and O(nz)-space off-line algorithm, where n is the length of the weighted string and 1/z is the given threshold, to compute a smallest maximal palindromic factorization of a weighted string. This factorization has applications in hairpin structure prediction in a set of closely-related DNA or RNA sequences. Along the way, we provide an O(nz)-time and O(nz)-space off-line algorithm to compute maximal palindromes in weighted strings. Finally, we provide an experiment of our proposed algorithm.
3
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
In this paper we present external memory algorithms for some string problems. External memory algorithms have been developed in many research areas, as the speed gap between fast internal memory and slow external memory continues to grow. The goal of external memory algorithms is to minimize the number of input/output operations between internal memory and external memory. These years the sizes of strings such as DNA sequences are rapidly increasing. However, external memory algorithms have been developed for only a few string problems. In this paper we consider five string problems and present external memory algorithms for them. They are the problems of finding the maximum suffix, string matching, period finding, Lyndon decomposition, and finding the minimum of a circular string. Every algorithm that we present here runs in a linear number of I/Os in the external memory model with one disk, and they run in an optimal number of disk I/Os in the external memory model with multiple disks.
4
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
We introduce the notion of [lambda]-regularities in strings that consist of [lambda]-covers and [lambda]-seeds, and study three [lambda]-regularities problems- the [lambda]-cover problem, the general [lambda]-cover problem and the [lambda]-seed problem in this paper. [lambda]-regularities can be viewed as generalized string regularities in the sense that a set of [lambda] repetitive strings rather than a single repeated string are considered. We first present a general algorithm for computing all the [lambda]-combinations of a given string, since they serve as candidates for both [lambda]-covers and [lambda]-seeds. The running time of this algorithm is O(n2). Relying on this result, we answer the above mentioned three problems all in O(n2) time.
5
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Two algorithms are presented that solve the problem of recovering the longest common subsequence of two strings. The first algorithm is an improvement of Hirschberg's divide-and-conquer algorithm. The second algorithm is an improvement of Hunt-Szymanski algorithm based on an efficient computation of all dominant match points. These two algorithms use bit-vector operations and are shown to work very efficiently in practice.
6
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
In this paper, we study the pattern matching problem in given intervals. Depending on whether the intervals are given a priori for pre-processing, or during the query along with the pattern or, even in both the cases, we develop efficient solutions for different variants of this problem. In particular, we present efficient indexing schemes for each of the above variants of the problem.
7
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The paper aims at tight upper bounds on the size of pattern classification circuits that can be used for a priori parameter settings in a machine learning context. The upper bounds relate the circuit size S(C) to n_L := .log_2mL., where mL is the number of training samples. In particular, we show that there exist unbounded fan-in threshold circuits with less than (a) [formula] gates for unbounded depth, (b) SL [formula] gates for small bounded depth, where in both cases all mL samples are classified correctly. We note that the upper bounds do not depend on the length n of input (sample) vectors. Since n_L << n in real-world problem settings, the upper bounds return values that are suitable for practical applications. We provide experimental evidence that the circuit size estimations work well on a number of pattern classification tasks. As a result, we formulate the conjecture that [formula] gates are sufficient to achieve a high generalization rate of bounded-depth classification circuits.
8
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
In this paper we introduce the Weighted Suffix Tree, an efficient data structure for computing string regularities in weighted sequences of molecular data. Molecular Weighted Sequences can model important biological processes such as the DNA Assembly Process or the DNA-Protein Binding Process. Thus pattern matching or identification of repeated patterns, in biological weighted sequences is a very important procedure in the translation of gene expression and regulation. We present time and space efficient algorithms for constructing the weighted suffix tree and some applications of the proposed data structure to problems taken from the Molecular Biology area such as pattern matching, repeats discovery, discovery of the longest common subsequence of two weighted sequences and computation of covers.
9
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
We consider a version of pattern matching useful in processing large musical data: d-matching, which consists in finding matches which are d-approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols a, b is measured as |a-b|. We also consider (d,g)-matching, where g is a bound on the total sum of the differences. We first consider ``occurrence heuristics'' by adapting exact string matching algorithms to the two notions of approximate string matching. The resulting algorithms are efficient in practice. Then we consider ``substring heuristics''. We present d-matching algorithms fast on the average providing that the pattern is ``non-flat'' and the alphabet interval is large. The pattern is ``flat'' if its structure does not vary substantially. The algorithms, named d-BM1, d-BM2 and d-BM3 can be thought as members of the generalized Boyer-Moore family of algorithms. The algorithms are fast on average. This is the first paper on the subject, previously only ``occurrence heuristics'' have been considered. Our substring heuristics are much stronger and refer to larger parts of texts (not only to single positions). We use d-versions of suffix tries and subword graphs. Surprisingly, in the context of d-matching subword graphs appear to be superior compared with compact suffix trees.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.