The correctness of large scale analysis of genomic data

Wojciechowski, Paweł; Krause, Karol; Lukasiak, Piotr; Blazewicz, Jacek

doi:10.2478/fcds-2021-0024

Artykuł - szczegóły

Tytuł artykułu

The correctness of large scale analysis of genomic data

Autorzy

Wojciechowski Paweł , Krause Karol , Lukasiak Piotr , Blazewicz Jacek

Wybrane pełne teksty z tego czasopisma

Identyfikatory

DOI

10.2478/fcds-2021-0024

Warianty tytułu

Języki publikacji

Abstrakty

Implementing a large genomic project is a demanding task, also from the computer science point of view. Besides collecting many genome samples and sequencing them, there is processing of a huge amount of data at every stage of their production and analysis. Efficient transfer and storage of the data is also an important issue. During the execution of such a project, there is a need to maintain work standards and control quality of the results, which can be difficult if a part of the work is carried out externally. Here, we describe our experience with such data quality analysis on a number of levels - from an obvious check of the quality of the results obtained, to examining consistency of the data at various stages of their processing, to verifying, as far as possible, their compatibility with the data describing the sample.

Słowa kluczowe

genomic data large scale analysis process pipelines

Wydawca

Wydawnictwo Politechniki Poznańskiej

Czasopismo

Foundations of Computing and Decision Sciences

Rocznik

2021

Tom

Vol. 46, No. 4

Strony

423--436

Opis fizyczny

Bibliogr. 24 poz., rys.

Twórcy

autor

Wojciechowski Paweł

Institute of Computing Science, Poznan University of Technology, Poland
Laboratory of Genomics, Institute of Bioorganic Chemistry, Polish Academy of Sciences

autor

Krause Karol

Institute of Computing Science, Poznan University of Technology, Poland

autor

Lukasiak Piotr

Institute of Computing Science, Poznan University of Technology, Poland
Laboratory of Genomics, Institute of Bioorganic Chemistry, Polish Academy of Sciences

autor

Blazewicz Jacek

Institute of Computing Science, Poznan University of Technology, Poland
Laboratory of Genomics, Institute of Bioorganic Chemistry, Polish Academy of Sciences

Bibliografia

[1] Bai H., Guo X., Zhang D., et al. The genome of a Mongolian individual reveals the genetic imprints of Mongolians on modern human populations. Genome Biology and Evolution, 6(12):3122-3136, 2014.
[2] Brittain H., Scott R., and Thomas E. The rise of the genome and personalised medicine. Clinical Medicine, 17(6):545-551, 2017.
[3] Caulfield M., Davies J., Dennys M., et al. National genomic research library, 2020.
[4] Chan T., Golub G., and Leveque R. Algorithms for computing the sample variance: Analysis and recommendations. The American Statistician, 37(3):242-247, 1983.
[5] Chen S., Zhou Y., Chen Y., et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17):i884-i890, 2018.
[6] Cho Y., Kim H., Kim H., et al. An ethnically relevant consensus Korean reference genome is a step towards personal reference genomes. Nature Communications, 7:13637, 2016.
[7] Cibulskis K., McKenna A., Fennell T., et al. ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics, 27(18):2601-2602, 2011.
[8] Consortium T.G.P. A global reference for human genetic variation. Nature, 526(7571):68-74, 2015.
[9] Danecek P., Bonfield J., Liddle J., et al. Twelve years of SAMtools and BCFtools. GigaScience, 10(2), 2021.
[10] Durbin R., Altshuler D., Abecasis G., et al. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061-1073, 2010.
[11] Fiévet A., Bernard V., Tenreiro H., et al. ART-DeCo: easy tool for detection and characterization of cross-contamination of DNA samples in diagnostic next-generation sequencing analysis. European Journal of Human Genetics, 27(5), 2019.
[12] Fiorito G., Di Gaetano C., Guarrera S., et al. The Italian genome reflects the history of Europe and the Mediterranean basin. European Journal of Human Genetics, 24(7):1056-1062, 2016.
[13] Guo J., Wu Y., Zhu Z., et al. Global genetic differentiation of complex traits shaped by natural selection in humans. Nature Communications, 9(1):1865, 2018.
[14] Hehir-Kwa J., Marschall T., Kloosterman W., et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nature Communications, 7:12989, 2016.
[15] Kehr B., Helgadottir A., and Melsted P. Diversity in non-repetitive human sequences not found in the reference genome. Nature Genetics, 49(4):588-593, 2017.
[16] Li Q., Tian S., Yan B., et al. Building a Chinese pan-genome of 486 individuals. Communications Biology, 4(1):1016, 2021.
[17] McDermott U. Next-generation sequencing and empowering personalised cancer medicine. Drug Discovery Today, 20(12):1470-1475, 2015.
[18] Nagasaki M., Yasuda J., Katsuoka F., et al. Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nature Communications, 6(1):8018, 2015.
[19] Takayama J., Tadaka S., Yano K., et al. Construction and integration of three de novo Japanese human genome assemblies toward a population-specific reference. Nature Communications, 12(1):226, 2021.
[20] Tishkoff S. and Kidd K. Implications of biogeography of human populations for ’race’ and medicine. Nature Genetics, 36(11):S21-S27, 2004.
[21] Van der Auwera G. and O’Connor B. Genomics in the cloud : using Docker, GATK, and WDL in Terra. O’Reilly Media, Sebastopol, CA, first edition. edition, 2020.
[22] Welford B. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419-420, 1962.
[23] Zhao S., Agafonov O., Azab A., et al. Accuracy and efficiency of germline variant calling pipelines for human genome data. Scientific Reports, 10(1):20222, 2020.
[24] Zimani A., Peterlin B., and Kovanda A. Increasing genomic literacy through national genomic projects. Frontiers in Genetics, 12:693253, 2021.

Uwagi

Badania wykonano w oparciu o grant nr POIR.04.02.00-30-A004/16

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2021).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-50157436-b62d-46d4-b0c9-3a4e3299f56b