Efficient alternatives to PSI-BLAST

Startek, M.; Lasota, S.; Sykulski, M.; Bułak, A.; Noé, L.; Kucherov, G.; Gambin, A.

Artykuł - szczegóły

Tytuł artykułu

Efficient alternatives to PSI-BLAST

Autorzy

Startek M. , Lasota S. , Sykulski M. , Bułak A. , Noé L. , Kucherov G. , Gambin A.

Treść / Zawartość

Pełne teksty:

httpbpasts_czasopisma_pan_plimagesdatabpastswydaniano3september201213efficientalternativestopsiblast.pdf

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

In this paper we present two algorithms that may serve as efficient alternatives to the well-known PSI BLAST tool: SeedBLAST and CTX-PSI Blast. Both may benefit from the knowledge about amino acid composition specific to a given protein family: SeedBLAST uses the advisedly designed seed, while CTX-PSI BLAST extends PSI BLAST with the context-specific substitution model. The seeding technique became central in the theory of sequence alignment. There are several efficient tools applying seeds to DNA homology search, but not to protein homology search. In this paper we fill this gap. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifically designed for a given protein family. The seeds are represented by deterministic finite automata (DFAs) and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original BLAST and PSI-BLAST on several protein families. Our results demonstrate a superiority of SeedBLAST in terms of efficiency, especially in the case of twilight zone hits. The contextual substitution model has been proven to increase sensitivity of protein alignment. In this paper we perform a next step in the contextual alignment program. We announce a contextual version of the PSI-BLAST algorithm, an iterative version of the NCBI-BLAST tool. The experimental evaluation has been performed demonstrating a significantly higher sensitivity compared to the ordinary PSI-BLAST algorithm.

Słowa kluczowe

PSI BLAST tool sequence alignment seeding technique

Wydawca

Polska Akademia Nauk, Wydział IV Nauk Technicznych

Czasopismo

Bulletin of the Polish Academy of Sciences. Technical Sciences

Rocznik

2012

Tom

Vol. 60, nr 3

Strony

495--505

Opis fizyczny

Bibliogr. 45 poz., rys., tab.

Twórcy

autor

Startek M.

autor

Lasota S.

autor

Sykulski M.

autor

Bułak A.

autor

Noé L.

autor

Kucherov G.

autor

Gambin A.

Institute of Informatics, University of Warsaw, 2 Banacha St., 02-097 Warszawa, Poland

Bibliografia

[1] T. Smith and M. Waterman, “The identification of common molecular subsequences”, J. Molecular Biology 147, 195-197 (1981).
[2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic local alignment search tool”, J. Molecular Biology 215, 403-410 (1990).
[3] S. Altschul, T. Madden, A. Sch¨affer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, NucleicAcids Research 25, 3389-3402 (1997).
[4] G. Kucherov, L. No´e, and M. Roytberg, “A unifying framework for seed sensitivity and its application to subset seeds”, J. Bioinformatics and Computational Biology 4 (2), 553-570 (2006).
[5] A. Gambin, S. Lasota, R. Szklarczyk, J. Tiuryn, and J. Tyszkiewicz, “Contextual alignment of biological sequences”, Proc. ECCB’02, Bioinformatics 18, 116-127 (2002).
[6] B. Brejova, D.G. Brown, and T. Vinar, “Optimal spaced seeds for homologous coding regions”, J. Bioinformatics and Computational Biology 1 (4), 595-610 (2004).
[7] A.S. Shiryev, J.S. Papadopoulos, A.A. S chaffer, and R. Agarwala, “Improved BLAST searches using longer words for protein seeding”, Bioinformatics 23, 2949-2951 (2007).
[8] B. Ma, J. Tromp, and M. Li, “PatternHunter: faster and more sensitive homology search”, Bioinformatics (Oxford, England) 18, 440-445 (2002).
[9] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: highly sensitive and fast homology search”, J. Bioinformaticsand Computational Biology 2 (3), 417-439 (2004).
[10] D. Kisman, M. Li, B. Ma, and L. Wang, “tPatternHunter: gapped, fast and sensitive translated homology search”, Bioinformatics(Oxford, England) 21, 542-544 (2005).
[11] L. Noe and G. Kucherov, “YASS: enhancing the sensitivity of DNA similarity search”, Nucl. Acids Res. 33, W540-543 (2005).
[12] J. Buhler, U. Keich, and Y. Sun, “Designing seeds for similarity search in genomic DNA”, J. Comput. Syst. Sci. 70 (3), 342-363 (2005).
[13] B. Brejov´a, D.G. Brown, and T. Vinar, “Vector seeds: an extension to spaced seeds”, J. Comput. Syst. Sci. 70 (3), 364-380 (2005).
[14] Y. Sun and J. Buhler, “Designing multiple simultaneous seeds for DNA similarity search”, RECOMB 1, 76-84 (2004).
[15] G. Kucherov, L. Noe, and M. Roytberg, “Multiseed lossless filtration”, IEEE/ACM Trans. Comput. Biol. Bioinformatics 2 (1), 51-61 (2005).
[16] M. Roytberg, A. Gambin, L. No´e, S. Lasota, E. Furletova, E. Szczurek, and G. Kucherov, “On subset seeds for protein alignment”, IEEE/ACM Trans. on Computational Biology andBioinformatics 6 (3), 483-494 (2009).
[17] W. Li, B. Ma, and K. Zhang, “Amino acid classification and hash seeds for homology search”, BICoB 1, 44-51 (2009).
[18] S.M. Kiebasa, R. Wan, K. Sato, P. Horton, and M.C. Frith, “Adaptive seeds tame genomic sequence comparison”, GenomeResearch 21 (3), 487-493 (2011).
[19] C.D. Livingstone and G.J. Barton, “Protein sequence alignments: a strategy for the hierarchical an alysis of residue conservation”, Computer Applications in the Biosciences: CABIOS 9, 745-756 (1993).
[20] T. Li, K. Fan, W. Wang, and J. Wang, “Reduction of protein sequence complexity by residue grouping”, Protein Engineering 16 (5), 323-330 (2003).
[21] L. Murphy, A. Wallqvist, and R. Levy, “Simplified amino acid alphabets for protein fold recognition and implications for folding”, Protein Engineering 13, 149-152 (2000).
[22] B. Rost, “Twilight zone of protein sequence alignments”, ProteinEngineering Design and Selection 12 (2), 85-94 (1999).
[23] A. Gambin and J. Tyszkiewicz, “Substitution matrices for contextual alignment”, Journees Ouvertes Biologie InformatiqueMathematique 1, 227-238 (2002).
[24] S. Henikoff and J. Henikoff, “Amino acid substitution matrices from protein blocks”, Proc. Natl. Acad. Sci. USA 89, 10915- 10919 (1992).
[25] A. Gambin and P. Wojtalewicz, “CTX-BLAST: context sensitive version of protein blast”, Bioinformatics 23 (13), 1686- 1688 (2007).
[26] I. Friedberg, T. Kaplan, and H. Margalit, “Evaluation of PSIBLAST alignment accuracy in comparison to structural alignments”, Protein Science 9, 2278-2284 (2000).
[27] A. Gambin, S. Lasota, M. Startek, M. Sykulski, L. Noé, and G. Kucherov, “Subset seed extension to protein blast”, Bioinformatics 1, 149-158 (2011).
[28] B. Korte and D. Hausmann, “An analysis of the greedy heuristic for independence systems”, Ann. Discrete Math. 2, 65-74 (1978).
[29] S. Cheng and Y.-F. Xu, “Constrained independence system and triangulations of planar point sets”, Computing and Combinatorics 1, 41-50 (1995).
[30] B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O’Donovan, and I. Phan, “The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003”, Nucl. Acids Res. 31 (1), 365-370 (2003).
[31] Y. Ponty, M. Termier, and A. Denise, “GenRGenS: software for generating random genomic sequences and structures”, Bioinformatics 22, 1534-1535 (2006).
[32] I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao, “Efficient methods for generating optimal single and multiple spaced seeds”, BIBE ’04: Proc.4th IEEE Symp. on Bioinformatics and Bioengineering 1, 411 (2004).
[33] B. Ma and H. Yao, “Seed optimization is no easier than optimal golomb ruler design”, APBC 1, 133-144 (2008).
[34] M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, London, 1996.
[35] F. M. Liang, “Word hy-phen-a-tion by com-put-er”, Tech. Rep., Stanford University, Stanford, 1983.
[36] A. Gambin, J. Tiuryn, and J. Tyszkiewicz, “Alignment with context dependent scoring function”, J. Computational Biology 13 (1), 81-101 (2006).
[37] S. Altschul and W. Gish, “Local alignment statistics”, MethodsEnzymol. 266, 460-480 (1996).
[38] S. Altschul, R. Bundschuh, R. Olsen, and T. Hwa, “The estimation of statistical parameters for local alignment score distributions”, Nuclear Acids Res. 29 (2), 351-361 (2001).
[39] A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S. Eddy, S. Griffiths-Jones, K. Howe, M. Marshall, and E. Sonnhammer, “The pfam protein families database”, Nucl.Acids Res. 30 (1), 276-280 (2002).
[40] R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, S.J. Sammut, H. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L.L. Sonnhammer, and A. Bateman, “The pfam protein families database”, Nucl. Acids Res. 36 (1), D281-288 (2008).
[41] L. Oliveira, A.C.M. Paiva, and G. Vriend, “A common motif in g-protein-coupled seven transmembrane helix r eceptors”, J.Computer-Aided Molecular Design 7, 649-658 (1993).
[42] P. Peterlongo, L. No, D. Lavenier, G. illes Georges, J. Jacques, G. Kucherov, and M. Giraud, “Protein similarity search with subset seeds on a dedicated reco nfigurable hardware”, ParallelProcessing and Applied Mathematics 1, 1240-1248 (2008).
[43] V.H. Nguyen and D. Lavenier, “Speeding up subset seed algorithm for intensive protein sequence comparison”, RIVF 1, 57-63 (2008).
[44] T. Kahveci and A. Singh, “An efficient index structure for string databases”, Proc. 27th VLDB 1, 352-360 (2001).
[45] M. Cameron, H. Williams, and A. Cannane, “A deterministic finite automaton for faster protein hit detection in BLAST”, J.Comput. Biol. 13 (40), 965-78 (2006). Bull.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BPG8-0096-0013