PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Numerical Representation of Symbolic Data

Autorzy
Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
A method of direct numerical representation of symbolic data is proposed. The method starts with parsing a sequence into an ordered set (spectrum) of distinct, non-overlapping short strings of symbols (words). Next, the words spectrum is mapped onto a vector of binary components in a high dimensional, linear space. The numerical representation allows for some arithmetical operations on symbolic data. Among them is a meaningful average spectrum of two sequences. As a test, the new numerical representation is used to build centroid vectors for the k-means clustering algorithm. It significantly enhanced the clustering quality. The advantage over the conventional approach is a high score of correct clustering several real character sequences like novel, DNA and protein.
Twórcy
  • University of Information Technology and Management ul. H. Sucharskiego 2, 35-225 Rzeszów, Poland
Bibliografia
  • [1] C. Notredame, Recent progress in multiple sequence alignment: a survey, Pharmacogenomics 3(1), 131-144 (2002).
  • [2] M. Randic, M. Vrako, On the similarity of DNA primary sequences, Journal of Chemical Information and Computer Sciences 40, 599-606 (2000).
  • [3] S. Vinga and J. Almeida, Alignment-free sequence comparison – a review, Bioinformatics, 19(4), 513-523 (2003).
  • [4] A. Kelil, S. Wang, Q. Jiang, R. Brzezinski, A general measure of similarity for categorical sequences, Knowl. Inf. Syst. 24,197-220 (2010), DOI 10.1007/s10115-009-0237-8.
  • [5] B. Kozarzewski, A method for nucleotide sequences analysis, CMST 18(1), 5-10 (2012).
  • [6] T. Kanungo, N.S. Netanyahu, A.Y. Wu, An Efficient k-Means Clustering Algorithm: Analysis and Implementation, IEEE Trans. Pattern Analysis and Machine Intelligence 24 (7), 881-892 (2002).
  • [7] B. Kozarzewski, A New Method for Symbolic Sequences Analysis, CMST 20(3), 93-100 (2014), DOI:10.12921/cmst.2014.20.03.93-100.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-828dd289-8396-4d53-a400-4caad8bd94da
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.