A New Method for Symbolic Sequences Analysis. An Application to Long Sequences

Kozarzewski, B.

doi:10.12921/cmst.2014.20.03.93-100

Artykuł - szczegóły

Tytuł artykułu

A New Method for Symbolic Sequences Analysis. An Application to Long Sequences

Autorzy

Kozarzewski B.

Wybrane pełne teksty z tego czasopisma

http://cmst.eu/

Identyfikatory

DOI

10.12921/cmst.2014.20.03.93-100

Warianty tytułu

Języki publikacji

Abstrakty

The method for symbolic sequence decomposition into a set of consecutive, distinct, non-overlapping strings of various lengths is proposed. Representation of the sequence as a set of words allows one to use set theory notions. The main result is a quite new definition of the similarity between any two sequences over a given alphabet. No prior sequence alignment is necessary. In the present paper two applications of a set of words are described. In the first a similarity measure is applied to prepare centroids for K-means algorithm. It results in a high performance grouping method for long DNA sequences. The other application concerns the statistical analysis of word attributes. It is shown that similarity, complexity and correlation function of word attributes across sequences of digits of fractional parts of some irrational numbers support the suggestion that the sequences are instances of a random sequence of decimal digits.

Słowa kluczowe

similarity and distance measures clustering DNA sequences irrational numbers

Wydawca

Institute of Bioorganic Chemistry Scientific Publishers OWN, Polish Academy of Sciences

Czasopismo

Computational Methods in Science and Technology

Rocznik

2014

Tom

Vol. 20, No. 3

Strony

93--100

Opis fizyczny

Bibliogr. 16 poz., rys.

Twórcy

autor

Kozarzewski B.

bkozarzewski@wsiz.rzeszow.pl

University of Information Technology and Management ul. H. Sucharskiego 2, 35-225 Rzeszów, Poland

Bibliografia

[1] A. Lempel, J. Ziv, On the complexity of finite sequences,IEEE Trans. Inform. Theory 22, 75-81 (1976).
[2] D.-G. Ke, Q.-Y. Tong, Easily adaptable complexity measure for finite time series, Phys. Rev. E 77, 066215 (2008).
[3] B. Kozarzewski, A method for nucleotide sequences analysis, CMST 18 (1), 5-10 (2012).
[4] M-S. Yang and K-L. Wu, A Similarity-Based Robust Clustering Method, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2 (4), 434-448 (2004).
[5] Y. Liu, The Numerical Characterization and Similarity Analysis of DNA Primary Sequences, Internet Electronic Journal of Molecular Design 1, 675-684 (2002).
[6] J.Wen, C. Li, Similarity analysis of DNA sequences based on the LZ complexity, Internet Electronic Journal of Molecular Design 6, 1-12 (2007).
[7] A. Kelil, S.Wang, Q. Jiang, R. Brzezinski, A general measure of similarity for categorical sequences, Knowl. Inf. Syst., 24, 197-220 (2010), (DOI 10.1007/s10115-009-0237-8).
[8] S. Kumar, A. Filipski, Multiple sequence alignment: In pursuit of homologous DNA positions, Genome Research 17, 27-135 (2007).
[9] S. Vinga, J. Almeida, Alignment-free sequence comparison –a review, Bioinformatics 19, 513-523 (2003).
[10] L.R. Dice, Measures of the Amount of Ecologic Association Between Species, Ecology 26 (3), 297-302 (1945).
[11] T. Kanungo, N.S. Netanyahu, A.Y. Wu, An Efficient k-Means Clustering Algorithm: Analysis and Implementation, IEEE Trans. Pattern Analysis and Machine Intelligence, 24, (7), 881-892 (2002).
[12] B. Kozarzewski, A representative set method for symbolic sequence clustering, CMST 19 (2), 35-47 (2013).
[13] G.P. Dresden, Three Transcendental Numbers from the Last Non-Zero Digits of , Fn, and n!, Mathematical Magazine, 81 (2), 96 (2007).
[14] D. Bailey, P. Borwein, S. Plouffe, On the rapid computation of various polylogarithmic constants, Mathematics of Computation, vol. 66, (218), 903-913.
[15] D.H. Bailey, A Compendium of BBP-type Formulas for Mathematical Constants, http://crd-legacy.lbl.gov/˜dhbailey/dhbpapers/bbp-formulas.pdf (2013).
[16] R. Nemiroff and J. Bonnell, http://apod.nasa.gov/htmltest/rjn _dig.html, http://apod.nasa.gov/htmltest/gifcity/e.2mil, http://apod.nasa.gov/htmltest/gifcity/sqrt2.2mil.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-e36cd84b-0fd2-4a06-9374-0750a053f39a