Assembly of repetitive regions using next-generation sequencing data

Nowak, R. M.

doi:10.1016/j.bbe.2014.12.001

Artykuł - szczegóły

Tytuł artykułu

Assembly of repetitive regions using next-generation sequencing data

Autorzy

Nowak R. M.

Wybrane pełne teksty z tego czasopisma

Identyfikatory

DOI

10.1016/j.bbe.2014.12.001

Warianty tytułu

Języki publikacji

Abstrakty

High read depth can be used to assemble short sequence repeats. The existing genome assemblers fail in repetitive regions of longer than average read. I propose a new algorithm for a DNA assembly which uses the relative frequency of reads to properly reconstruct repetitive sequences. The mathematical model for error-free input data shows the upper limits of accuracy of the results as a function of read coverage. For high coverage, the estimation error depends linearly on repetitive sequence length and inversely proportional to the sequencing coverage. The model depicts, the smaller de Bruijn graph dimensions, the more accurate assembly of long repetitive regions. The algorithm requires high read depth, provided by the next-generation sequencers and could use the existing data. The tests on errorless reads, generated in silico from several model genomes, pointed the properly reconstructed repetitive sequences, where existing assemblers fail. The C++ sources, the Python scripts and the additional data are available at http://dnaasm.sourceforge.net.

Słowa kluczowe

genome assembler repetitive sequences mathematical model next generation sequencing de Bruijn graph parameters

powtarzająca się sekwencja model matematyczny sekwencjonowanie nowej generacji

Wydawca

Nałęcz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences
Elsevier

Czasopismo

Biocybernetics and Biomedical Engineering

Rocznik

2015

Tom

Vol. 35, no. 4

Strony

276--283

Opis fizyczny

Bibliogr. 27 poz., tab., wykr.

Twórcy

autor

Nowak R. M.

r.m.nowak@elka.pw.edu.pl

Electronic Systems Institute, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland

Bibliografia

[1] Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol 2008;26(10):1135–45.
[2] Pagani I, Liolios K, Jansson J, Chen I-MA, Smirnova T, Nosrat B, et al. The genomes online database (gold) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 2012;40(D1):D571–9.
[3] Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A 2001;98(17):9748–53.
[4] Myers EW. The fragment assembly string graph. Bioinformatics 2005;21(Suppl. 2):ii79–85.
[5] Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics 2010;95(6):315–27.
[6] Zhang W, Chen J, Yang Y, Tang Y, Shang J, Shen B. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS ONE 2011;6(3):e17915.
[7] Earl D, Bradnam K, John JS, Darling A, Lin D, Fass J, et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011;21(12):2224–41.
[8] Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2013;2(1):1–31.
[9] Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, et al. Gage: a critical evaluation of genome assemblies and assembly algorithms. Genome Res 2012;22 (3):557–67.
[10] Kingsford C, Schatz MC, Pop M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinform 2010;11(1):21.
[11] Cox R, Mirkin SM. Characteristic enrichment of DNA repeats in different genomes. Proc Natl Acad Sci U S A 1997;94(10):5237–42.
[12] van Belkum A, Scherer S, van Alphen L, Verbrugh H. Short-sequence DNA repeats in prokaryotic genomes. Microbiol Mol Biol Rev 1998;62(2):275–93.
[13] Cao MD, Tasker E, Willadsen K, Imelfort M, Vishwanathan S, Sureshkumar S, et al. Inferring short tandem repeat variation from paired-end short reads. Nucleic Acids Res 2013;gkt1313.
[14] Xie C, Tammi MT. Cnv-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinform 2009;10(1):80.
[15] Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res 2009;19(9):1586–92.
[16] Chaisson MJ, Pevzner PA. Short read fragment assembly of bacterial genomes. Genome Res 2008;18(2):324–30.
[17] Pevzner P, Tang H, Tesler G. De novo repeat classification and fragment assembly. Genome Res 2004;14 (9):1786–96.
[18] Cormen T, Leiserson C, Rivest R, Stein C. Introduction to algorithms. The MIT Press; 2001.
[19] Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol İ. ABySS: a parallel assembler for short read sequence data. Genome Res 2009;19(6):1117–23.
[20] Chevreux B, Wetter T, Suhai S. Genome sequence assembly using trace signals and additional sequence information. German Conference on Bioinformatics. 1999. pp. 45–56.
[21] Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008;18 (5):821–9.
[22] Ronen R, Boucher C, Chitsaz H, Pevzner P. Sequel: improving the accuracy of genome assemblies. Bioinformatics 2012;28(12):i188–96.
[23] Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, et al. Allpaths: de novo assembly of whole-genome shotgun microreads. Genome Res 2008;18(5):810–20.
[24] Piotrowski P, Nowak R. New tool to combine contigs by usage of paired-end tags. Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2013. International Society for Optics and Photonics; 2013. p. 890318.
[25] Medvedev P, Pham S, Chaisson M, Tesler G, Pevzner P. Paired de Bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. J Comput Biol 2011;18(11):1625–34.
[26] Bresler M, Sheehan S, Chan AH, Song YS. Telescoper: de novo assembly of highly repetitive regions. Bioinformatics 2012;28(18):i311–7.
[27] Nowak RM. Polyglot programming the applications to analyze genetic data. BioMed Res Int 2014;2014:1–7.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-eb7055f1-9260-4ac8-9cbc-c70da803c45e