PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Powiadomienia systemowe
  • Sesja wygasła!
  • Sesja wygasła!
  • Sesja wygasła!
  • Sesja wygasła!
Tytuł artykułu

A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions

Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
The fundamental step in genomic signal processing applications is to assign mathematical descriptor to nucleotides {A, T, G, C} of DNA molecule for discrete representation. The discrete representation should replicate biological information of gene when analyzed with digital signal processing tools. In this aspect, a novel binary representation of DNA sequence by combining structural and chemical information of original DNA sequence has been proposed for the identification of protein coding regions of eukaryotes. The identification model comprises two stages, mainly, numerical encoding in first stage, and analysis of biological behavior through digital signal processing algorithms in second stage. In the first stage, a new numerical encoding method based on Walsh codes of order-4 is proposed to obtain 1-D binary discrete sequence. In the second stage, the modified Gabor wavelet transform (MGWT) is employed on the discretized DNA sequence for spectrum analysis. The optimal gene numerical encoding and multiresolution approach of MGWT has readily identified the structures of coding regions of unknown gene sequences. The proposed model is validated by analyzing prediction efficiency in terms of statistical metrics such as sensitivity, specificity, accuracy on both sequence and data base level. Furthermore, the results are compared by plotting receiver operating curves (ROC) for all classification thresholds for the state-of-art encoding methods. Area under curve (AUC) value of 0.86 at sequence level and 0.84 at database level is achieved. Performance metrics indicate that the proposed encoding method exhibits relatively better performance than other numerical encoding methods.
Twórcy
  • School of Electronics Engineering, Vellore Institute of Technology, Vellore, India
  • School of Electronics Engineering, Vellore Institute of Technology, Vellore 632014, India
Bibliografia
  • [1] Dougherty ER. Genomic signal processing. IEEE Signal Process Mag 2012;29:124–9. http://dx.doi.org/10.1109/MSP. 2012.2185868.
  • [2] Rao KD, Swamy MNS. Analysis of genomics and proteomics using DSP techniques. IEEE Trans Circuits Syst I Regul Pap 2008;55:370–8. http://dx.doi.org/10.1109/TCSI. 2007.910541.
  • [3] Sahu SS, Panda G. Identification of protein-coding regions in DNA sequences using a time-frequency filtering approach. Genom Proteom Bioinform 2011;9:45–55. http://dx.doi.org/10.1016/S1672-0229(11)60007-7.
  • [4] Singh B, Trincado JL, Tatlow PJ, Piccolo SR, Eyras E. Genome sequencing and RNA-motif analysis reveal novel damaging noncoding mutations in human tumors. Mol Cancer Res 2018;16:1112–24. http://dx.doi.org/10.1158/1541-7786.MCR-17-0601.
  • [5] Sun WH, Wang YZ, Xu Y, Yu XW. Genome-wide analysis of long non-coding RNAs in Pichia pastoris during stress by RNA sequencing. Genomics 2019;111:398–406. http://dx.doi.org/10.1016/j.ygeno.2018.02.016.
  • [6] Zhou W, Sherwood B, Ji Z, Xue Y, Du F, Bai J, et al. Genomewide prediction of DNase i hypersensitivity using gene expression. Nat Commun 2017;8. http://dx.doi.org/10.1038/s41467-017-01188-x.
  • [7] Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright JC, Kay M, et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res 2019. http://dx.doi.org/10.1101/gr.246462.118. gr.246462.118.
  • [8] Marhon SA, Kremer SC. Prediction of protein coding regions using a wide-range wavelet window method. IEEE/ACM Trans Comput Biol Bioinforma 2016;13:742–53. http://dx.doi.org/10.1109/TCBB.2015.2476789.
  • [9] GENSCAN tool available at http://genes.mit.edu/GENSCAN.
  • [10] Mathe C. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002;30:4103–17. http://dx.doi.org/10.1093/nar/gkf543.
  • [11] Silverman BD, Linsker R. A measure of DNA periodicity. J Theor Biol 1986;118:295–300. http://dx.doi.org/10.1016/S0022-5193(86)80060-1.
  • [12] Tsonis AA, Elsner JB, Tsonis PA. Periodicity in DNA coding sequences: implications in gene evolution. J Theor Biol 1991;151:323–31. http://dx.doi.org/10.1016/S0022-5193(05)80381-9.
  • [13] Shah K, Krishnamachari A. On the origin of three base periodicity in genomes. Biosystems 2012;107:142–4. http://dx.doi.org/10.1016/j.biosystems.2011.11.006.
  • [14] Anastassiou D. Frequency-domain analysis of biomolecular sequences. Bioinformatics 2000;16:1073–81. http://dx.doi.org/10.1093/bioinformatics/16.12.1073.
  • [15] Raman Kumar M, Naveen Kumar V. Review on DSP based dynamic gene encoding schemes for the detection of protein coding region. Proceeding First Annu Conf Comput Dev Electron Commun, Vellore Institute of Technology, Amaravati, India. CRC Press; 2019. p. 191.
  • [16] Yu N, Li Z, Yu Z. Survey on encoding schemes for genomic data representation and feature learning—from signal processing to machine learning. Big Data Min Anal 2018;1:191–210. http://dx.doi.org/10.26599/BDMA.2018.9020018.
  • [17] Voss RF. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett 1992;68:3805–8. http://dx.doi.org/10.1103/PhysRevLett.68.3805.
  • [18] Ranawana R, Palade V. A neural network based multi- classifier system for gene identification in DNA sequences. Neural Comput Appl 2005;14:122–31. http://dx.doi.org/10.1007/s00521-004-0447-7.
  • [19] Demeler B, Zhou G. Neural network optimization for E. coli promoter prediction. Nucleic Acids Res 1991;19:1593–9. http://dx.doi.org/10.1093/nar/19.7.1593.
  • [20] Arniker SB, Kwan HK, Law NF, Lun DPK. DNA numerical representation and neural network based human promoter prediction system. Proc – 2011 Annu IEEE India Conf Eng Sustain Solut INDICON-2011, vol. 1. 2011. pp. 1–4. http://dx.doi.org/10.1109/INDCON. 2011.6139326.
  • [21] Nair AS, Sreenadhan SP. A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation 2006;1:197–202.
  • [22] Tenneti SV, Vaidyanathan PP. IMUSIC: a family of MUSIC-like algorithms for integer period estimation. IEEE Trans Signal Process 2019;67:367–82. http://dx.doi.org/10.1109/TSP.2018.2879039.
  • [23] Chakravarthy N, Spanias A, Iasemidis LDKT. Autoregressive modeling and feature analysis of DNA sequences. EURASIP J Appl Signal Processing 2004;1:13–28. http://dx.doi.org/10.1155/S111086570430925X.
  • [24] Rosen GL. Biologically-inspired gradient source localization and DNA sequence analysis. Georg Inst Technol 2006. hdl. handle.net/1853/11628.
  • [25] Cristea PD. Genetic signal representation and analysis. Proc SPIE Conf Int Biomed Opt Symp (BIOS'02), vol. 4623. 2002. pp. 77–84. http://dx.doi.org/10.1117/12.491244.
  • [26] Das B, Turkoglu I. A novel numerical mapping method based on entropy for digitizing DNA sequences. Neural Comput Appl 2018;29:207–15. http://dx.doi.org/10.1007/s00521-017-2871-5.
  • [27] Das L, Nanda S, Das JK. An integrated approach for identification of exon locations using recursive gauss Newton tuned adaptive Kaiser window. Genomics 2018. http://dx.doi.org/10.1016/j.ygeno.2018008. 0-1.
  • [28] Dessouky AM, Taha TE, Dessouky MM, Eltholth AA, Hassan E, Abd El-Samie FE. Visual representation of DNA sequences for exon detection using non-parametric spectral estimation techniques. Nucleosides Nucleotides Nucleic Acids 2019;38:321–37. http://dx.doi.org/10.1080/15257770.2018.1536270.
  • [29] Kundal S, Lohiya R, Bansal H, Johri S, Sarwal V, Shah K. Computational prediction of replication sites in DNA sequences using complex number representation. arXiv:1909.13751 [q-bio.GN].
  • [30] RK M, Vaegae NK. Walsh code based numerical mapping method for the identification of protein coding regions in eukaryotes. Biomed Signal Process Control 2020;58:101859. http://dx.doi.org/10.1016/j.bspc.2020.101859.
  • [31] Zhang WF, Yan H. Exon prediction using empirical mode decomposition and Fourier transform of structural profiles of DNA sequences. Pattern Recogn 2012;45:947–55. http://dx.doi.org/10.1016/j.patcog.2011.08.016.
  • [32] Hota MK, Srivastava VK. A multirate DSP structure for the identification of protein-coding regions. Int J Biomath 2017;10:1750112. http://dx.doi.org/10.1142/S1793524517501121.
  • [33] Hota MK, Srivastava VK. Identification of protein coding regions using antinotch filters. Digit Signal Process Rev J 2012;22:869–77. http://dx.doi.org/10.1016/j.dsp.2012.06.005.
  • [34] Ramachandran P, Lu WS, Antoniou A. Filter-based methodology for the location of hot spots in proteins and exons in DNA. IEEE Trans Biomed Eng 2012;59:1598–609. http://dx.doi.org/10.1109/TBME. 2012.2190512.
  • [35] Singha Roy S, Barman S. Polyphase filtering with variable mapping rule in protein coding region prediction. Microsyst Technol 2017;23:4111–21. http://dx.doi.org/10.1007/s00542-016-2884-5.
  • [36] Hota MK, Srivastava VK. Improvement in protein-coding region identification based on sliding window trigonometric fast transforms using singular value decomposition. Int J Data Min Bioinform 2011. http://dx.doi.org/10.1504/IJDMB.2011.038580.
  • [37] Shakya DK, Saxena R, Sharma SN. An adaptive window length strategy for eukaryotic CDS prediction. IEEE/ACM Trans Comput Biol Bioinform 2013;10:1241–52. http://dx.doi.org/10.1109/TCBB.201376.
  • [38] Ahmad M, Jung LT, Bhuiyan AA. A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing. Comput Methods Programs Biomed 2017;149:11–7. http://dx.doi.org/10.1016/j.cmpb.2017.06.021.
  • [39] Singh AK, Srivastava VK. Performance evaluation of different window functions for STDFT based exon prediction technique taking paired numeric mapping scheme. 2019 6th Int Conf Signal Process Integr Networks. SPIN; 2019. p. 739–43. http://dx.doi.org/10.1109/SPIN.2019.8711741.
  • [40] Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R. Prediction of probable genes by fourier analysis of genomic sequences. Bioinformatics 1997;13:263–70. http://dx.doi.org/10.1093/bioinformatics/13.3.263.
  • [41] Sharma SD, Sharma SN, Saxena R. Identification of short exons disunited by a short intron in eukaryotic DNA regions. IEEE/ACM Trans Comput Biol Bioinform 2019;5963:1. http://dx.doi.org/10.1109/tcbb.2019.2900040.
  • [42] Zhang X, Shen Z, Zhang G, Shen Y, Chen M, Zhao J, et al. Short exon detection via wavelet transform modulus maxima. PLOS ONE 2016;11. http://dx.doi.org/10.1371/journal.pone.0163088.
  • [43] Mena-Chalco JP, Carrer H, Zana Y, Cesar RM. Identification of protein coding regions using the modified Gabor-wavelet transform. IEEE/ACM Trans Comput Biol Bioinform 2008;5:198–207. http://dx.doi.org/10.1109/TCBB.2007.70259.
  • [44] Abbasi O, Rostami A, Karimian G. Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform. BMC Bioinformatics 2011;12. http://dx.doi.org/10.1186/1471-2105-12-430.
  • [45] Inbamalar TM, Sivakumar R. Improved algorithm for analysis of DNA sequences using multiresolution transformation. Sci World J 2015. http://dx.doi.org/10.1155/2015/2015786497.
  • [46] Liu G, Luan Y. Identification of protein coding regions in the eukaryotic dna sequences based on marple algorithm and wavelet packets transform. Abstr Appl Anal 2014. http://dx.doi.org/10.1155/2014/2014402567.
  • [47] Li Z, Guan Y, Yuan X, Zheng P, Zhu H. Prediction of Sphingosine protein-coding regions with a self adaptive spectral rotation method. PLOS ONE 2019;14. http://dx.doi.org/10.1371/journal.pone.0214442.
  • [48] Talyan S, Andrade-Navarro MA, Muro EM. Identification of transcribed protein coding sequence remnants within lincRNAs. Nucleic Acids Res 2018;46:8720–9. http://dx.doi.org/10.1093/nar/gky608.
  • [49] Li J, Liu C. Coding or noncoding, the converging concepts of RNAs. Front Genet 2019;10:1-10. http://dx.doi.org/10.3389/fgene.2019.00496.
  • [50] Donelan H, O'Farrell T. Method for generating sets of orthogonal sequences. Electron Lett 1997;35:1537–8. http://dx.doi.org/10.1049/el:19991046.
  • [51] Stanley HE, Buldyrev SV, Goldberger AL, Goldberger ZD, Havlin S, Mantegna RN, Ossadnik SM, Peng CKMS. Statistical mechanics in biology: how ubiquitous are long-range correlations? Phys A Stat Mech Appl 1994;204:214–53. http://dx.doi.org/10.1016/0378-4371(94)90502-9.
  • [52] Marhon SA, Kremer SC, East SR, Canada NG. Protein coding region prediction based on the adaptive represntation method. 24th Can Conf Electr Comput Eng (CCECE) IEEE. 2011. pp. 415–8. http://dx.doi.org/10.1109/CCECE.2011.6030484.
  • [53] HRM195 dataset. http://www.vision.ime.usp.br/jmena/MGWT/datasets/2010.
  • [54] Akhtar M, Epps J, Ambikairajah E. Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J Sel Top Signal Process 2008;2:310–21. http://dx.doi.org/10.1109/JSTSP.2008.923854.
  • [55] Yin C, Yau SST. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J Theor Biol 2007;247:687–94. http://dx.doi.org/10.1016/j.jtbi.2007.03.038.
  • [56] Marhon S, Kremer SC. Theoretical justification of computing the 3-base periodicity using nucleotide distribution variance. Biosystems 2010;101:185–6. http://dx.doi.org/10.1016/j.biosystems.2010.07.001.
  • [57] Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett 2006;27:861–74. http://dx.doi.org/10.1016/j.patrec.2005010.
  • [58] Ahmad M, Jung LT, Bhuiyan AA. From DNA to protein: why genetic code context of nucleotides for DNA signal processing? A review. Biomed Signal Process Control 2017;34:44–63. http://dx.doi.org/10.1016/j.bspc.2017.01.004.
Uwagi
PL
Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-79822025-fe94-49fb-ac3b-648a21ff47bf
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.