In an accompanying paper by Minervini et al., we deal with the scientific problem of studying the sequence to structure relationships in “never born proteins” (NBPs), i.e. protein sequences which have never been observed in nature. The study of the structural and functional properties of "never born proteins" requires the generation of a large library of protein sequences characterized by the absence of any significant similarity with all the known protein sequences. In this paper we describe the implementation of a simple command-line software utility used to generate random amino acid sequences and to filter them against the NCBI non redundant protein database, using as a threshold the value of the Evalue parameter returned by the well known sequence comparison software Blast. This utility, named RandomBlast, has been written using C programming language for Windows operating systems. The structural implications of NBPs random amino acid composition are discussed as compared to natural proteins of comparable length.
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The number of known natural protein sequences, though quite large, is infinitely small as compared to the number of proteins theoretically possible with the twenty natural amino acids. Thus, there exists a huge number of protein sequences which have never been observed in nature, the so called “never born proteins”. The study of the structural and functional properties of "never born proteins" represents a way to improve our knowledge on the fundamental properties that make existing protein sequences so unique. Furthermore it is of great interest to understand if the extant proteins are only the result of contingency or else the result of a selection process based on the peculiar physico-chemical properties of their protein sequence. Protein structure prediction tools combined with the use of large computing resources allow to tackle this problem. In fact, the study of never born proteins requires the generation of a large library of protein sequences not present in nature and the prediction of their three-dimensional structure. This is not trivial when facing 105-107 protein sequences. Indeed, on a single CPU it would require years to predict the structure of such a large library of protein sequences. On the other hand, this is an embarassingly parallel problem in which the same computation (i.e. the prediction of the three-dimensional structure of a protein sequence) must be repeated several times (i.e. on a large number of protein sequences). The use of grid infrastructures makes feasible to approach this problem in an acceptable time frame. In this paper we describe the set up of a simulation environment within the EUChinaGRID infrastructure that allows user friendly exploitation of grid resources for largescale protein structure prediction.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.