Preprocessing and Storing High - Throughput Sequencing Data

Świercz, A.; Bosak, B.; Chłopkowski, M.; Hoffa, A.; Kasprzak, M.; Kurowski, M.; Piontek, T.; Błażewicz, J.

doi:10.12921/cmst.2014.20.01.9-20

Artykuł - szczegóły

Tytuł artykułu

Preprocessing and Storing High - Throughput Sequencing Data

Autorzy

Świercz A. , Bosak B. , Chłopkowski M. , Hoffa A. , Kasprzak M. , Kurowski M. , Piontek T. , Błażewicz J.

Wybrane pełne teksty z tego czasopisma

http://cmst.eu/

Identyfikatory

DOI

10.12921/cmst.2014.20.01.9-20

Warianty tytułu

Języki publikacji

Abstrakty

DNA sequencing is a process of recognizing DNA sequences of genomes. The process consists in reading short sequences, that are subsequences of a genome, and merging them into longer sequences, preferably the whole genome. In the first phase even billions of short sequences are read at once. To simplify and speed up the second phase, we develop a pipeline of preprocessing the initial set of short sequences that is removing low quality reads and duplicated reads.We also propose a method for preliminary joining overlapping sequences, which resulted in decreasing the cardinality of initial sets to 13.9% and 27.8%. We also examine possible ways to store the huge amount of experimental data. We compare different compression methods, from which the best appeared to be DSRC, developed for special type of text files containing short sequences and their quality. We test the parameters for TCP data transferring to find the best transfer rate.

Słowa kluczowe

DNA sequencing compression preprocessing transfer optimization

Wydawca

Institute of Bioorganic Chemistry Scientific Publishers OWN, Polish Academy of Sciences

Czasopismo

Computational Methods in Science and Technology

Rocznik

2014

Tom

Vol. 20, No. 1

Strony

9--20

Opis fizyczny

Bibliogr. 38 poz., rys.

Twórcy

autor

Świercz A.

aswiercz@cs.put.poznan.pl

Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland
Institute of Bioorganic Chemistry Polish Academy of Sciences, ul. Noskowskiego 12/14, Poznań, Poland

autor

Bosak B.

Poznań Supercomputing and Networking Center ul. Noskowskiego 10, Poznań, Poland

autor

Chłopkowski M.

Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland

autor

Hoffa A.

Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland

autor

Kasprzak M.

Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland
Institute of Bioorganic Chemistry Polish Academy of Sciences, ul. Noskowskiego 12/14, Poznań, Poland

autor

Kurowski M.

Poznań Supercomputing and Networking Center ul. Noskowskiego 10, Poznań, Poland

autor

Piontek T.

Poznań Supercomputing and Networking Center ul. Noskowskiego 10, Poznań, Poland

autor

Błażewicz J.

Institute of Computing Science Poznan University of Technology, ul. Piotrowo 2, Poznań, Poland
Institute of Bioorganic Chemistry Polish Academy of Sciences, ul. Noskowskiego 12/14, Poznań, Poland

Bibliografia

[1] T. Jiang, M. Li, DNA sequencing and string learning, Mathematical Systems Theory 29, 387-405 (1996).
[2] J.C. Dohm, C. Lottaz, T. Borodina, H. Himmelbauer, SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing, Genome Res 17, 1697-1706 (2007).
[3] D.W. Bryant, W.K. Wong Jr., T.C. Mockler, QSRA: a quality-value guided de novo short read assembler, BMC Bioinformatics 10, 69 (2009).
[4] R.L.Warren, G.G. Sutton, S.J. Jones, R.A. Holt, Assembling millions of short DNA sequences using SSAKE, Bioinformatics 23, 500-501 (2007).
[5] R. Li, W. Fan, G. Tian, H. Zhu, L. He, et al., The sequence and de novo assembly of the giant panda genome. Nature 463, 311-317 (2010).
[6] J.T. Simpson, K. Wong, S.D. Jackman, J.E. Schein, S.J. Jones, et al., ABySS: a parallel assembler for short read sequence data, Genome Res 19, 1117-1123 (2009).
[7] Li R, Zhu H, Ruan J, Qian W, Fang X, et al., De novo assembly of human genomes with massively parallel short read sequencing, Genome Res 20, 265-272 (2010).
[8] D.R. Zerbino, E. Birney, Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome Res 18, 821-829 (2008).
[9] J. Blazewicz, M. Figlerowicz, P. Gawron, M. Kasprzak, E. Kirton, D. Platt, A. Swiercz, L. Szajkowski, Whole genome assembly from 454 sequencing output via modified DNA graph concept, Computational Biology and Chemistry 33, 224-230 (2009).
[10] T.F. Smith, M.S. Waterman, Identification of common molecular subsequences, Journal of Molecular Biology 147, 195-197 (1981).
[11] http://www.bioinformatics.nl/tools/crab_fasta.html.
[12] P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer, P.M. Rice.The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucleic Acids Res 38, 1767-1771 (2010).
[13] D. Merrill, A. Grimshaw, Revisiting sorting for GPGPU stream architectures, Technical Report CS2010-03, University of Virginia, Department of Computer Science, Charlottesville, VA, USA, (2010).
[14] Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra).
[15] Escherichia coli str. K-12 substr. MG1655, complete genome (http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.2).
[16] WormBase, ftp://ftp.wormbase.org/pub/wormbase/releases/WS230/species/c_elegans/ c_elegans.WS230.genomic.fa.gz.
[17] K. Yook, T.W. Harris, T. Bieri, et al., WormBase 2012: More genomes, more data, new website, Nucleic Acids Research 40, D735-D741 (2012).
[18] B. Langmead, S. Salzberg, Fast gapped-read alignment with Bowtie 2, Nature Methods 9, 357-359 (2012).
[19] L. Ilie, F. Fazayeli, S. Ilie, HiTEC: accurate error correction in high-throughput sequencing data, Bioinformatics 27, 295-302 (2011).
[20] W.C. Kao, A.H. Chan, Y.S. Song, ECHO: a referencefree short-read error correction algorithm, Genome Res 21, 1181-1192 (2011).
[21] J.T. Simpson, R. Durbin, Efficient de novo assembly of large genomes using compressed data structures, Genome Res 22, 549-556 (2012).
[22] J. Ziv, A. Lempel, A Universal Algorithm for SequentialData Compression, IEEE Transactions on Information Theory 23, 337-343 (1977).
[23] J. Ziv, A. Lempel, Compression of individual sequences via variable rate coding, IEEE Transactions on Information Theory 24, 530-535 (1978).
[24] Deflate and inflate algorithms (http://www.gzip.org/algorithm.txt ).
[25] J-L Gailly, GNU gzip, 2013 (http://www.gnu.org/software/gzip/manual/gzip.pdf).
[26] P. Deutsch, J-L. Gailly, ZLIB Compressed Data Format Specification version 3.3, 1996 (http://tools.ietf.org/pdf/rfc1950.pdf).
[27] P. Deutsch, DEFLATE Compressed Data Format Specification version 1.3, 1996 (http://tools.ietf.org/pdf/rfc1951.pdf).
[28] P. Deutsch, GZIP file format specification version 4.3, 1996(http://tools.ietf.org/pdf/rfc1952.pdf).
[29] M. Burrows, D.J. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation. (1994).
[30] D.A. Huffman, A Method for the Construction of Minimum-Redundancy Codes, Proceedings of the I.R.E., 1098-1102 (1952).
[31] 7-zip file archiver (http://www.7-zip.org/).
[32] S. Deorowicz, S. Grabowski, Compression of DNA sequence reads in FASTQ format, Bioinformatics 27, 860-862 (2011).
[33] E. Grassi, F.D. Gregorio, I Molineris. KungFQ: A Simple and Powerful Approach to Compress fastq Files, IEEE/ACM Transactions on Computational Biology and Bioinformatics 9, pp. 1837-1842 (2012).
[34] M. Brzezniak, N. Meyer, R. Mikołajczak, G. Jankowski, M. Jankowski. Popular Backup/Archival Service and its Application for the Archival of the Network Traffic in the PIONIER Academic Network, CMST Special Issue, 109-118 (2010).
[35] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster, The Globus Striped GridFTP Framework and Server, SC’05, ACM Press, (2005).
[36] J. Semke, M. Jamshid, M. Matthew, Automatic TCP buffer tuning. ACM SIGCOMM Computer Communication Review 28.4, 315-323 (1998).
[37] Enabling High Performance Data Transfers, at Pittsburg Supercomputing Center, (http://www.psc.edu/index.php/networking/641-tcp-tune#Linux).
[38] GT 5.0.2 GridFTP (http://www.globus.org/toolkit/docs/5.0/5.0.2/data/gridftp/).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-a71d6f76-cc0a-4a25-a911-44bb0400548b