ISSN: 0974-276X
Review Article - (2011) Volume 4, Issue 12
Keywords: Protein structure prediction; SOPMA; TMHMM; Comparative modeling; MODELLER; Drug design; Docking
Most modern drug discovery projects start with protein target identification and verification to obtain a verified drug target. For structure-based drug design the three-dimensional structure of the protein needs to be determined experimentally by using either x-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy [1]. While both methods are increasingly being applied in a highthroughput manner, structure determination is not yet a straightforward process. X-ray crystallography is limited by the difficulty of getting some proteins to form crystals, and NMR can only be applied to relatively small protein molecules.
Proteins are essential to biological processes. They are responsible for catalyzing and regulating biochemical reactions, transporting molecules, the chemistry of vision and of the photosynthetic conversion of light to growth, and they form the basis of structures such as skin, hair, and tendon. Protein function can be understood in terms of its structure. Indeed, the three-dimensional structure of a protein is closely related to its biological function. Proteins that perform similar functions tend to show a significant degree of structural homology [2].
The amino acid sequence of a protein is known as its primary structure, while local conformations in this sequence, namely alphahelices, beta sheets, and random coils are known as secondary structures. The angles between adjacent amino acids, called the torsion angles [3], determine the twists and turns in the sequences which result in these secondary structures. The three-dimensional configuration of the primary structure is defined as the tertiary structure, describing the fold of the protein.
Each amino acid consists of a rigid plane formed by single nitrogen, carbon, alpha-carbon (Ca), oxygen, and hydrogen atoms, and a distinguishing side chain. The individual amino acids are distinguished from each other by a number of physical chemical properties that give rise to the three dimensional structure [4].
ExPASy [5,6] is a proteomics server operated by the Swiss Institute of Bioinformatics, it is used to analyze protein sequences and structures and two-dimensional gel electrophoresis (2-D Page electrophoresis) [7-9].
Primary structure analysis
Amino acid sequence analysis [10] provides important insight into the structure of proteins, which in turn greatly facilitates the understanding of its biochemical and cellular function. Efforts to use computational methods in predicting protein structure based only on sequence information started 30 years ago [11]. However, only during the last decade, has the introduction of new computational techniques such as protein fold recognition and the growth of sequence and structure databases due to modern high-throughput technologies led to an increase in the success rate of prediction methods.
Sequence retrieval database searches
Sequence similarity searching is a crucial step in analyzing newly determined protein sequences. Typically, large sequence databases such as the non-redundant (nr) database at the NCBI [12] (synthesis of Gen-Bank, EMBL and DDBJ databases) or genome sequences are scanned for DNA or amino acid sequences that are similar to a target sequence. Alignments of the target sequence are constructed for each database entry, typically using dynamic programming algorithms [13]. Scores derived from these alignments are used to identify statistically significant matches.
Traditionally, searches were carried out using programs for pairwise sequence comparisons like FASTA [14] or BLAST [15-17]. However, the relationship between sequences of homologous proteins can be recognized by pairwise sequence comparisons. The most sensitive methods available today use the initial search for homologues to construct a multiple sequence alignment (MSA) [18], which provide insight into the positional constraints of the amino acid composition, and allow the identification of conserved and variable regions in the family, comprising the target and its presumed homologues
Protein domain identification
After protein discovery, there are many questions that are associated with protein’s overall identity, putative function and biologically significant sites identification [19]. To answer these questions, a number of databases and tools have been customized. Most proteins are composed from a finite number of evolutionarily conserved modules or domains. Protein domains are distinct units of three-dimensional protein structures, which often carry a discrete molecular function, such as the binding of a specific type of molecule. These domains vary in length from between about 25 amino acids up to 500 amino acids. The direct functional and structural determination of all the proteins in an organism is prohibitively costly and time consuming because of the relative scarcity of 3D structural information therefore primary sequence analysis is preferred to identify majority of protein domain families [20].
A few thousand conserved domains, which cover more than two thirds of known protein sequences have been identified and described in literature. PFAM [21,22] and SMART [23] databases are the largest collections of the manually curetted protein domains of information. Each deposited domain family is extensively annotated in the form of textual descriptions, as well as cross-links to other resources and literature references.
Secondary structure prediction
The important concepts in secondary structure prediction [24] are identified as: residue conformational propensities, sequence edge effects, moments of hydrophobicity, position of insertions and Deletions in aligned homologous sequence, moments of conservation, auto-correlation, residue ratios, secondary structure feedback effects, and filtering [25].
The early methods of secondary structure prediction are suffered from a lack of data and Predictions were performed on single sequences rather than families of homologous sequences, and there were relatively few known 3D structures from which to derive parameters. The most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim. Although the authors originally claimed quite high accuracies (70-80 %), under careful examination, the methods were shown to be only between 56 and 60% accurate [26].
Recent improvements
The availability of large families of homologous sequences revolutionized secondary structure prediction. Traditional methods, when applied to a family of proteins rather than a single sequence proved much more accurate at identifying core secondary structure elements. The combination of sequence data with sophisticated computing techniques such as neural networks [27] has lead to accuracies well in excess of 70 %. Though this seems a small percentage increase, these predictions are actually much more useful than those for single sequence, since they tend to predict the core accurately. Moreover, the limit of 70-80% may be a function of secondary structure variation within homologous proteins.
SOPMA (Self Optimized Prediction Method from Alignment) [28] was employed for prediction of secondary structure features like alpha helix, extended strand, beta turn and random coils in terms of percentage for all the sequences. These features were considered as input parameters for self organizing maps for further analysis [29]. SOPMA accurately predicts 69.5% of amino acid for the three states describing the secondary structure (α-helix, β-beta sheet and coil). This tool works on the basis of neural network method (PHD) [30]. The PHD algorithm first performs a database search for possible homologous proteins, then aligns and filters the sequences to decide on the most likely homologues, and finally feeds the sequences and alignment profile to a feed-forward neural network for secondary structure prediction [31].
Transmembrane region prediction
Different servers TMHMM, SOSUI [32], HMMTOP and TMpred servers were accessed to validate the TM region [33-34]. TMHMM, a new membrane protein topology prediction method, is based on a hidden Markov model [35].
Transmembrane topology predictions
In a study conducted by Kumar et al. the transmembrane topology of AHA1 was predicted from the amino acid sequence [36] by averaging the results of four different predictive algorithms: DAS [37], HMMTOP [38], TMHMM [39] and TMPRED [40]. The accuracy of the prediction was assessed by using the same algorithms on 3b8c protein and the results were compared with the topology defined in the 3.6Aº structures (3b8c).
Tertiary structure
Knowing a protein’s 3-dimensional structure (Tertiary Structure) helps us to understand its functionality and provides means for planning experiments and drug design. The Brookhaven Protein Data Bank (PDB) is the repository for those structures. Files including atom coordinates which are suited for visualization by graphical molecule viewers like rasmol can be obtained at this site. PDB is also searchable with a sequence as a query, e.g. with the BLAST service located at NCBI with a polypeptide as a query.
Tertiary structure of a protein is build by packing of its secondary structure elements to form discrete domains or autonomous folding units [41]. Two main approaches in determination of protein 3D structure are: Ab initio prediction and comparative modeling.
Comparative modeling
Homology or comparative protein structure modeling constructs a three-dimensional model of a given protein sequence based on its similarity to one or more known structures [42]. It is carried out in four sequential steps: finding known structures (templates) related to the sequence to be modeled (target), aligning the target sequence with the templates, building the model, and assessing the model [43]. Therefore, comparative modeling is only applicable when the target sequence is detectably related to a known protein structure.
3D structure generation by using MODELLER
Modeller is a computer program for comparative modeling of protein three-dimensional structures. Alignment of a sequence to be modeled is provided with known related structures and modeller automatically calculates a model containing all non-hydrogen atoms. Modeller [44] implements comparative protein structure modeling by satisfaction of spatial restraints. The homology modeling requires sequences of known 3D structure and the target having above 35% of similarity.
Template identification and sequence alignment
Template identification is an important step. It lays the foundation by identifying appropriate homologues of known protein structure, called template, which are sufficiently similar to the target sequence to be modeled. Template sequence were selected by a simple search submits the target sequence to programs BLASTP search along with default parameters was performed against the Brook Heaven Protein Data Bank (PDB). Based on the high identity, lowest e-value and low gaps the high resolution having sequence was selected as a template. To ensure the high accuracy of the structure, the target and the template sequence can be aligned.
Model building and refinement
Although the theory behind building a protein homology model is complicated, using available programs is relatively easy. Several modeling programs [45] are available, using different methods to construct the 3D structures. In segment matching methods, the target is divided into short segments, and alignment is done over segments rather than over the entire protein. Satisfying spatial restraints is the most common method. It uses either distances or optimization techniques to satisfy the spatial restraints. The method is implemented using the popular program, Modeller [46] and which includes the CHARMM [47] energy terms that ensure valid stereochemistry is combined with spatial restraints [48].
Validation
The best validation combines common sense, biological knowledge and results from analytical tools. Most refinement involves adjusting the alignment. PROCHECK [49,50] is used to calculate the main-chain torsion angles, i.e. the Ramachandran plot [51] for our predicted structures. Three models were predicted using different templates among those the one that shows the good resolution factor and R-factor was used as a template and evaluated by Procheck performing full geometric analysis with a resolution of 1.5 Å. The validation for structure models obtained from the three software tools was performed by using PROCHECK [52].
Comparative modeling of rat cathepsin L
In a research work done by sunil kumar, Priya ranjan and supakar the amino acid sequence of rat cathepsin L was retrieved from the sequence database of NCBI. It was ascertained that the three-dimensional structure of the protein was not available in Protein Data Bank; hence BLAST search was performed against Brookhaven Protein Data Bank (PDB) with the default parameters to find suitable templates for homology modeling. Sequences were aligned and the one that showed the maximum identity with high score and lower e-value and 73% sequence identity was used as a reference structure to build a 3D model for rat cathepsin L. The rat cathepsin L structure was modeled by means of comparative modeling procedure using the 1CS8 as the template. The rat cathepsin L sequence was submitted to Genesilico protein fold-recognition metaserver [53].
Fold-recognition server Fugue and 3D PSSM reported 1CS8 as the best template with highly significant score. The academic version of MODELLER 9v2 [54] was used for model building. Backbone of the core regions of the protein were transferred directly from the corresponding coordinates of 1CS8. Side chains confirmation for backbone residues was generated automatically by homology. Out of 20 models generated by MODELLER, the one with the best G-score of PROCHECK and with the best VERIFY3D [55,56] profile was subjected to energy minimization.
Comparative modeling of Viral Protein R (VpR)
In a study conducted by Seenivasagan et al. the protein sequence of VpR [57] was retrieved form from the NCBI database (P0C1P5) which has 96 amino acids. The target sequence was searched for similar sequence using the BLAST, against Protein Database (PDB). The BLAST results yielded NMR structure of HIV-1 regulatory protein R VpR with 85% similarity to our target protein. The theoretical structure of VpR is generated using Modeller-9v1 for comparative modeling of protein structure prediction [58]. It implements comparative structural modeling by conforming special restraints [59].
Quaternary structure
In the case of complexes of two or more proteins, where the structures of the proteins are known or can be predicted with high accuracy, protein–protein docking methods can be used to predict the structure of the complex.
Methods for the prediction of protein interactions
Ramon Aragues and his co workers used four different methods for predictions of protein-protein interactions [60]: (i) Gene fusion, in which two proteins are predicted to interact if their corresponding genes appear fused in another genome [61].
(ii) Phylogenetic profiles, in which similarity of phylogenetic profiles is interpreted as being indicative of two proteins need to be simultaneously present to perform a given function together [62].
(iii) Distant conservation of sequence patterns and structure relationships, in which structural similarities among domains of known interacting proteins and conservation of pairs of sequence patches involved in protein–protein interfaces are used to predict putative protein interaction pairs [63] and
(iv) Structural interologs, in which interactions are transferred between proteins with the same structural domains [64].
Drug development based on protein structure
The object of drug design is to find or develop a, mostly small, drug molecule that tightly binds to the target protein, moderating its function or competing with natural substrates of the protein. Such a drug can be best found on the basis of knowledge of the protein structure. If the spatial shape of the site of the protein is known, to which the drug is supposed to bind, then docking methods can be applied to select suitable lead compounds that have the potential of being refined to drugs.
Docking
Docking is a method which predicts the preferred orientation of one molecule to a record when bound to each other to form stable complex knowledge of the preferred orientations in turn may be used to predict the binding strength of association or binding affinity between two molecules [65]. Docking is frequently used to predict the binding orientations of small molecules drug candidates to protein targets in order to in turn predict the affinity and activity of the small molecule [66].
The development and implementation of a range of molecular docking algorithms [67] based on different search methods [68] were observed in the last few years. This approach has had several recent successes in drug discovery [69].
A number of powerful software programs, e.g. AutoDock [70,71], HEX [72,73], GOLD [74,75], FlexX, DOCK, Glide, Surflex, LigandFit, have been developed over the past several decades to carry out docking calculations, and good success in both binding mode and binding affinity prediction has often been achieved in selected test cases [76].
Computational methods for protein structure prediction are still in the stage of development and methods like homology-based prediction become especially helpful in an environment where the methods can be used in concert with experimental techniques for structure and function determination of protein. The use of computers and computational methods permeates all aspects of drug discovery today and forms the core of structure-based drug design. Availability of protein 3D structures, high-performance computing, data management software and internet are facilitating the access of huge amount of data generated and transforming the massive complex biological data into workable knowledge in modern day drug discovery process. Computational tools offer the advantage of delivering new drug candidates more quickly and at a lower cost.