PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Gene prediction by the noise-assisted MEMD and wavelet transform for identifying the protein coding regions

Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
The analysis of protein coding regions of DNA sequences is one of the most fundamental applications in bioinformatics. A number of model-independent approaches have been developed for differentiating between the protein-coding and non-protein-coding regions of DNA. However, these methods are often based on univariate analysis algorithms, which leads to the loss of joint information among four nucleotides of DNA. In this article, we introduce a method on basis of the noise-assisted multivariate empirical mode decomposition (NA-MEMD) and the modified Gabor-wavelet transform (MGWT). The NA-MEMD algorithm, as a multivariate analysis tool, is utilized to reconstruct the numerical analyzed sequence since it enables a matched-scale decomposition across all variables and eliminates the mode mixing. By virtues of NA-MEMD, the MGWT method achieves a stable improvement on the general identification performance. We compare our method with other Digital Signal Processing (DSP) methods on two representative DNA sequences and three benchmark datasets. The results reveal that our method can enhance the spectra of the analyzed sequences, and improve the robustness of MGWT to different DNA sequences, thus obtaining higher identification accuracies of protein coding regions over other applied methods. In addition, another comparative experiment with the model-dependent method (AUGUSTUS) on the recently proposed benchmark dataset G3PO verifies the superiority of model-independent methods (especially NA-MEMD-MGWT) for identifying coding regions of the poor-quality DNA sequences.
Twórcy
autor
  • State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China
autor
  • State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China
  • Inertial Technology and Integrated Navigation Laboratory, Beihang University, Beijing, China
autor
  • State Key Laboratory of Industrial Control Technology, Zhejiang University, 310027 Hangzhou, China
autor
  • State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China
Bibliografia
  • [1] Salzberg SL. Next-generation genome annotation: we still struggle to get it right. BioMed Central 2019.
  • [2] Zhang MQ. Computational prediction of eukaryotic proteincoding genes. Nat Rev Genet 2002;3(9):698–709.
  • [3] Marhon SA, Kremer SC. Gene prediction based on DNA spectral analysis: a literature review. J Comput Biol 2011;18 (4):639–76.
  • [4] Ramachandran P, Lu WS, Antoniou A. Filter-based methodology for the location of hot spots in proteins and exons in DNA. IEEE Trans Biomed Eng 2012;59(6):1598–609.
  • [5] Dougherty ER. Genomic signal processing and statistics, vol. 2. Hindawi Publishing Corporation; 2005.
  • [6] Salzberg S, Delcher AL, Fasman KH, Henderson J. A decision tree system for finding genes in DNA. J Comput Biol 1998;5 (4):667–80.
  • [7] Casimiro-Soriguer CS, Rubio A, Jimenez J, Pérez-Pulido AJ. Ancient evolutionary signals of protein-coding sequences allow the discovery of new genes in the Drosophila melanogaster genome. BMC Genomics 2020;21(1):1–10.
  • [8] Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright JC, Kay M, et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res 2019;29(12):2073–87.
  • [9] Piovesan A, Antonaros F, Vitale L, Strippoli P, Pelleri MC, Caracausi M. Human protein-coding genes and gene feature statistics in 2019. BMC Res Notes 2019;12(1):315.
  • [10] Marhon SA, Kremer SC. A dynamic representation-based, de novo method for protein-coding region prediction and biological information detection. Digit Signal Process 2015;46:10–8.
  • [11] Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics 2020;21:1–20.
  • [12] Tsonis AA, Elsner JB, Tsonis PA. Periodicity in DNA coding sequences: implications in gene evolution. J Theor Biol 1991;151(3):323–31.
  • [13] Datta A, Dougherty ER. Introduction to genomic signal processing with control. CRC Press; 2018.
  • [14] Saini S, Dewan L. Comparison of numerical representations of genomic sequences: choosing the best mapping for wavelet analysis. Int J Appl Comput Math 2017;3(4):2943–58.
  • [15] Kumar MR, Vaegae NK. A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions. Biocybern Biomed Eng 2020.
  • [16] M Raman Kumar, Vaegae NK. Walsh code based numerical mapping method for the identification of protein coding regions in eukaryotes. Biomed Signal Process Control 2020;58:101859.
  • [17] Voss RF. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett 1992;68 (25):3805.
  • [18] Song NY, Yan H. Short exon detection in DNA sequences based on multifeature spectral analysis. EURASIP J Adv Signal Process 2011;2011:1–8.
  • [19] Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R. Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics 1997;13 (3):263–70.
  • [20] Kotlar D, Lavner Y. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res 2003;13(8):1930–7.
  • [21] Vaidyanathan P, Yoon BJ. Digital filters for gene prediction applications. Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, 2002, vol. 1; 2002. pp. 306–10.
  • [22] Hota MK, Srivastava VK. Identification of protein coding regions using antinotch filters. Digit signal Process 2012;22 (6):869–77.
  • [23] Mena-Chalco J, Carrer H, Zana Y, Cesar Jr RM. Identification of protein coding regions using the modified Gabor-wavelet transform. IEEE/ACM Trans Comput Biol Bioinform 2008;5 (2):198–207.
  • [24] Marhon SA, Kremer SC. Prediction of protein coding regions using a wide-range wavelet window method. IEEE/ACM Trans Comput Biol Bioinform 2015;13(4):742–53.
  • [25] Das L, Nanda S, Das J. An integrated approach for identification of exon locations using recursive Gauss Newton tuned adaptive Kaiser window. Genomics 2019;111 (3):284–96.
  • [26] Choong MK, Yan H. Multi-scale parametric spectral analysis for exon detection in DNA sequences based on forwardbackward linear prediction and singular value decomposition of the double-base curves. Bioinformation 2008;2(7):273.
  • [27] Chen B, Ji P. Visualization of the protein-coding regions with a self adaptive spectral rotation approach. Nucleic Acids Res 2011;39(1):e3.
  • [28] Lei Y, Lin J, He Z, Zuo MJ. A review on empirical mode decomposition in fault diagnosis of rotating machinery. Mech Syst Signal Process 2013;35(1–2):108–26.
  • [29] Huang NE. Hilbert-Huang transform and its applications, vol. 16. World Scientific; 2014.
  • [30] Sharma SD, Sharma SN, Saxena R. Identification of short exons disunited by a short intron in eukaryotic DNA regions. IEEE/ACM Trans Comput Biol Bioinform 2019.
  • [31] Liu G, Luan Y. Identification of protein coding regions in the eukaryotic DNA sequences based on Marple algorithm and wavelet packets transform. Abstract and Applied Analysis, vol. 2014; 2014.
  • [32] Zhang WF, Yan H. Exon prediction using empirical mode decomposition and Fourier transform of structural profiles of DNA sequences. Pattern Recognit 2012;45 (3):947–55.
  • [33] Huang NE, Shen Z, Long SR, Wu MC, Shih HH, Zheng Q, et al. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc R Soc Lond Ser A: Math Phys Eng Sci 1998;454 (1971):903–95.
  • [34] Flandrin P, Rilling G, Goncalves P. Empirical mode decomposition as a filter bank. IEEE Signal Process Lett 2004;11(2):112–4.
  • [35] Bajaj V, Pachori RB. Classification of seizure and nonseizure EEG signals using empirical mode decomposition. IEEE Trans Inf Technol Biomed 2011;16(6):1135–42.
  • [36] Srinivasan R, Rengaswamy R, Miller R. A modified empirical mode decomposition (EMD) process for oscillation characterization in control loops. Control Eng Pract 2007;15 (9):1135–48.
  • [37] ur Rehman N, Park C, Huang NE, Mandic DP. EMD via MEMD: multivariate noise-aided computation of standard EMD. Adv Adapt Data Anal 2013;5(02):1350007.
  • [38] Park C, ur Rehman N, Ahrabian A, Mandic DP, Looney D. Classification of motor imagery BCI using multivariate empirical mode decomposition. IEEE Trans Neural Syst Rehabil Eng 2012;21(1):10–22.
  • [39] Lang X, ur Rehman N, Zhang Y, Xie L, Su H. Median ensemble empirical mode decomposition. Signal Process 2020;107686.
  • [40] Wu Z, Huang NE. Ensemble empirical mode decomposition: a noise-assisted data analysis method. Adv Adapt Data Anal 2009;1(01):1–41.
  • [41] Rehman N, Mandic DP. Multivariate empirical mode decomposition. Proc R Soc A: Math Phys Eng Sci 2010;466 (2117):1291–302.
  • [42] Mandic DP, ur Rehman N, Wu Z, Huang NE. Empirical mode decomposition-based time-frequency analysis of multivariate signals: the power of adaptive data analysis. IEEE Signal Process Mag 2013;30(6):74–86.
  • [43] Park C, Looney D, Kidmose P, Ungstrup M, Mandic DP. Time-frequency analysis of EEG asymmetry using bivariate empirical mode decomposition. IEEE Trans Neural Syst Rehabil Eng 2011;19(4):366–73.
  • [44] Ur Rehman N, Mandic DP. Filter bank property of multivariate empirical mode decomposition. IEEE Trans Signal Process 2011;59(5):2421–6.
  • [45] Rilling G, Flandrin P, Gonçalves P, Lilly JM. Bivariate empirical mode decomposition. IEEE Signal Process Lett 2007;14(12):936–9.
  • [46] Rilling G, Flandrin P, Goncalves P. On empirical mode decomposition and its algorithms. IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing, vol. 3; 2003. pp. 8–11.
  • [47] Epp TA, Dixon IM, Wang HY, Sole MJ, Liew CC. Structural organization of the human cardiac a-myosin heavy chain gene (MYH6). Genomics 1993;18(3):505–9.
  • [48] Marhon SA, Kremer SC. Protein coding region prediction based on the adaptive representation method. 2011 24th Canadian Conference on Electrical and Computer Engineering (CCECE); 2011. pp. 000415–8.
  • [49] Rogic S, Mackworth AK, Ouellette FB. Evaluation of genefinding programs on mammalian sequences. Genome Res 2001;11(5):817–32.
  • [50] Burset M, Guigo R. Evaluation of gene structure prediction programs. Genomics 1996;34(3):353–67.
  • [51] Shakya DK, Saxena R, Sharma SN. An adaptive window length strategy for eukaryotic CDS prediction. IEEE/ACM Trans Comput Biol Bioinform 2013;10(5):1241–52.
  • [52] Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de Peer Y. Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res 2008;18(2):310–23.
  • [53] Florquin K, Saeys Y, Degroeve S, Rouze P, Van de Peer Y. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res 2005;33 (13):4255–64.
  • [54] Abbasi O, Rostami A, Karimian G. Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform. BMC Bioinform 2011;12(1):430.
  • [55] Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 2003;19 (suppl_2):ii215–2.
  • [56] Akhtar M, Ambikairajah E, Epps J. Optimizing period-3 methods for eukaryotic gene prediction. 2008 IEEE International Conference on Acoustics, Speech and Signal Processing; 2008. pp. 621–4.
  • [57] Burge C. The GENSCAN web server at MIT; 2019.
  • [58] ur Rehman N, Aftab H. Multivariate variational mode decomposition. IEEE Trans Signal Process 2019;67(23):6039–52.
  • [59] Yang L, Tang YY, Lu Y, Luo H. A fractal dimension and wavelet transform based method for protein sequence similarity analysis. IEEE/ACM Trans Comput Biol Bioinform 2014;12(2):348–59.
  • [60] Yang L, Wei P, Zhong C, Meng Z, Wang P, Tang YY. A fractal dimension and empirical mode decomposition-based method for protein sequence analysis. Int J Pattern Recognit Artif Intell 2019;33(11):1940020.
  • [61] Zeng P, Chen J, Meng Y, Zhou Y, Yang J, Cui Q. Defining essentiality score of protein-coding genes and long noncoding RNAs. Front Genet 2018;9:380.
  • [62] Talyan S, Andrade-Navarro MA, Muro EM. Identification of transcribed protein coding sequence remnants within lincRNAs. Nucleic Acids Res 2018;46(17):8720–9.
  • [63] Tripodi IJ, Chowdhury M, Dowell R. ATAC-seq signal processing and recurrent neural networks can identify RNA polymerase activity. BioRxiv 2019;531517.
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2021).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-a55e510a-a2b2-4de9-bf8f-765d5e337a20
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.