PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Preprocessing and Storing High - Throughput Sequencing Data

Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
DNA sequencing is a process of recognizing DNA sequences of genomes. The process consists in reading short sequences, that are subsequences of a genome, and merging them into longer sequences, preferably the whole genome. In the first phase even billions of short sequences are read at once. To simplify and speed up the second phase, we develop a pipeline of preprocessing the initial set of short sequences that is removing low quality reads and duplicated reads.We also propose a method for preliminary joining overlapping sequences, which resulted in decreasing the cardinality of initial sets to 13.9% and 27.8%. We also examine possible ways to store the huge amount of experimental data. We compare different compression methods, from which the best appeared to be DSRC, developed for special type of text files containing short sequences and their quality. We test the parameters for TCP data transferring to find the best transfer rate.
Twórcy
autor
  • Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland
  • Institute of Bioorganic Chemistry Polish Academy of Sciences, ul. Noskowskiego 12/14, Poznań, Poland
autor
  • Poznań Supercomputing and Networking Center ul. Noskowskiego 10, Poznań, Poland
  • Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland
autor
  • Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland
autor
  • Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland
  • Institute of Bioorganic Chemistry Polish Academy of Sciences, ul. Noskowskiego 12/14, Poznań, Poland
autor
  • Poznań Supercomputing and Networking Center ul. Noskowskiego 10, Poznań, Poland
autor
  • Poznań Supercomputing and Networking Center ul. Noskowskiego 10, Poznań, Poland
  • Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland
  • Institute of Bioorganic Chemistry Polish Academy of Sciences, ul. Noskowskiego 12/14, Poznań, Poland
Bibliografia
  • [1] T. Jiang, M. Li, DNA sequencing and string learning, Mathematical Systems Theory 29, 387-405 (1996).
  • [2] J.C. Dohm, C. Lottaz, T. Borodina, H. Himmelbauer, SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing, Genome Res 17, 1697-1706 (2007).
  • [3] D.W. Bryant, W.K. Wong Jr., T.C. Mockler, QSRA: a quality-value guided de novo short read assembler, BMC Bioinformatics 10, 69 (2009).
  • [4] R.L.Warren, G.G. Sutton, S.J. Jones, R.A. Holt, Assembling millions of short DNA sequences using SSAKE, Bioinformatics 23, 500-501 (2007).
  • [5] R. Li, W. Fan, G. Tian, H. Zhu, L. He, et al., The sequence and de novo assembly of the giant panda genome. Nature 463, 311-317 (2010).
  • [6] J.T. Simpson, K. Wong, S.D. Jackman, J.E. Schein, S.J. Jones, et al., ABySS: a parallel assembler for short read sequence data, Genome Res 19, 1117-1123 (2009).
  • [7] Li R, Zhu H, Ruan J, Qian W, Fang X, et al., De novo assembly of human genomes with massively parallel short read sequencing, Genome Res 20, 265-272 (2010).
  • [8] D.R. Zerbino, E. Birney, Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome Res 18, 821-829 (2008).
  • [9] J. Blazewicz, M. Figlerowicz, P. Gawron, M. Kasprzak, E. Kirton, D. Platt, A. Swiercz, L. Szajkowski, Whole genome assembly from 454 sequencing output via modified DNA graph concept, Computational Biology and Chemistry 33, 224-230 (2009).
  • [10] T.F. Smith, M.S. Waterman, Identification of common molecular subsequences, Journal of Molecular Biology 147, 195-197 (1981).
  • [11] http://www.bioinformatics.nl/tools/crab_fasta.html.
  • [12] P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer, P.M. Rice.The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucleic Acids Res 38, 1767-1771 (2010).
  • [13] D. Merrill, A. Grimshaw, Revisiting sorting for GPGPU stream architectures, Technical Report CS2010-03, University of Virginia, Department of Computer Science, Charlottesville, VA, USA, (2010).
  • [14] Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra).
  • [15] Escherichia coli str. K-12 substr. MG1655, complete genome (http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.2).
  • [16] WormBase, ftp://ftp.wormbase.org/pub/wormbase/releases/WS230/species/c_elegans/ c_elegans.WS230.genomic.fa.gz.
  • [17] K. Yook, T.W. Harris, T. Bieri, et al., WormBase 2012: More genomes, more data, new website, Nucleic Acids Research 40, D735-D741 (2012).
  • [18] B. Langmead, S. Salzberg, Fast gapped-read alignment with Bowtie 2, Nature Methods 9, 357-359 (2012).
  • [19] L. Ilie, F. Fazayeli, S. Ilie, HiTEC: accurate error correction in high-throughput sequencing data, Bioinformatics 27, 295-302 (2011).
  • [20] W.C. Kao, A.H. Chan, Y.S. Song, ECHO: a referencefree short-read error correction algorithm, Genome Res 21, 1181-1192 (2011).
  • [21] J.T. Simpson, R. Durbin, Efficient de novo assembly of large genomes using compressed data structures, Genome Res 22, 549-556 (2012).
  • [22] J. Ziv, A. Lempel, A Universal Algorithm for SequentialData Compression, IEEE Transactions on Information Theory 23, 337-343 (1977).
  • [23] J. Ziv, A. Lempel, Compression of individual sequences via variable rate coding, IEEE Transactions on Information Theory 24, 530-535 (1978).
  • [24] Deflate and inflate algorithms (http://www.gzip.org/algorithm.txt ).
  • [25] J-L Gailly, GNU gzip, 2013 (http://www.gnu.org/software/gzip/manual/gzip.pdf).
  • [26] P. Deutsch, J-L. Gailly, ZLIB Compressed Data Format Specification version 3.3, 1996 (http://tools.ietf.org/pdf/rfc1950.pdf).
  • [27] P. Deutsch, DEFLATE Compressed Data Format Specification version 1.3, 1996 (http://tools.ietf.org/pdf/rfc1951.pdf).
  • [28] P. Deutsch, GZIP file format specification version 4.3, 1996(http://tools.ietf.org/pdf/rfc1952.pdf).
  • [29] M. Burrows, D.J. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation. (1994).
  • [30] D.A. Huffman, A Method for the Construction of Minimum-Redundancy Codes, Proceedings of the I.R.E., 1098-1102 (1952).
  • [31] 7-zip file archiver (http://www.7-zip.org/).
  • [32] S. Deorowicz, S. Grabowski, Compression of DNA sequence reads in FASTQ format, Bioinformatics 27, 860-862 (2011).
  • [33] E. Grassi, F.D. Gregorio, I Molineris. KungFQ: A Simple and Powerful Approach to Compress fastq Files, IEEE/ACM Transactions on Computational Biology and Bioinformatics 9, pp. 1837-1842 (2012).
  • [34] M. Brzezniak, N. Meyer, R. Mikołajczak, G. Jankowski, M. Jankowski. Popular Backup/Archival Service and its Application for the Archival of the Network Traffic in the PIONIER Academic Network, CMST Special Issue, 109-118 (2010).
  • [35] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster, The Globus Striped GridFTP Framework and Server, SC’05, ACM Press, (2005).
  • [36] J. Semke, M. Jamshid, M. Matthew, Automatic TCP buffer tuning. ACM SIGCOMM Computer Communication Review 28.4, 315-323 (1998).
  • [37] Enabling High Performance Data Transfers, at Pittsburg Supercomputing Center, (http://www.psc.edu/index.php/networking/641-tcp-tune#Linux).
  • [38] GT 5.0.2 GridFTP (http://www.globus.org/toolkit/docs/5.0/5.0.2/data/gridftp/).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-a71d6f76-cc0a-4a25-a911-44bb0400548b
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.