ISSN: 0974-276X
Research Article - (2010) Volume 3, Issue 8
Streptococcus mutans is the principal etiological agent of dental caries worldwide and is considered to be the most cariogenic of all of the oral streptococci. We sought to expand our understanding of this organism at the molecular level through identifi cation and function prediction of novel protein domains by two automated approaches using HMM and BLAST to develop a complete set of domains, which were then subsequently manually analysed. A final set of 19 novel domains and families was identifi ed. This enhances our understanding of both S mutans and also general bacterial molecular mechanisms, including protein synthesis to catalytic action. Furthermore we demonstrate that using this type of in silico method it is possible to fairly rapid generation of new biological information from previously uncorrelated data to increase the rate of discoveries in the laboratory.
Keywords: Oral cariogen, Family, Domain hunting, All-against-all BLAST, HMM, Prosite.
Streptococcus mutans is Gram-positive, nonmotile, facultatively anaerobic, oral pathogen and is considered to be the most cariogenic of all of the oral streptococci (Loesche, 1986). S. mutans adheres to the tooth surface and produces a sticky polysaccharide called dextran that enables the further colonisation of other microorganisms, forming dental plaques which serve as a biofilm (Nyvad and Kilian, 1990). These microorganisms have to withstand changes in temperature, nutrition and osmotic pressure, pH variations (Carlsson et al., 1997) as well as exposure to the mucosal immune system, natural virulence factors, antibiotics, competitors as well as pathogenity (Kado, 2009; Ochman et al., 2000). Lateral gene transfer (LGT) or horizontal gene transfer (HGT) (Lawrence, 1997; Ochman et al., 2005) is a major way by which organisms acquires novel genes, and it has played an important role in how S. mutans has adapted to sustain the oral environment through resource acquisition, defense against host factors, and use of gene products that maintain its niche against microbial competitors. Thus LGT creates unusually high similarities among organisms, particularly those that are closely related or share the same habitat; it can also be helpful for the understanding of gene evolution and species diversification, but also for the development of drugs that inhibit the transfer of resistance (Ochman et al., 2000). S. mutans genome is composed of the 1963 open reading frames out of which 65% were initially assigned putative functions through developed Bioinformatics methodologies and more ORFs, or their orthologs, have been identified by microarray techniques, phenotype studies and so on (Niu et al., 2008; Deng et al., 2009; Ajdic and Pham, 2007). There are very small numbers of related proteins in the modern databases and these homologs are all hypothetical proteins having no function. The plethora of proteins reflects both a diversity of novel protein families and an expansion within identified families when compared to other organisms and that serves as good platform in the search for novel protein domains.These methodologies clearly showed how completely sequenced genomes can be exploited to further understand the biology of an organism by predicting relationships between molecular structure, function, and evolution.
Protein domains
After protein discovery, there are many questions that are associated with protein’s overall identity, putative function and biologically significant sites identification. To answer these questions, a number of databases and tools have been customised. Structural and functional features of proteins are determined by using different methods to exploit characteristic sequence patterns and amino acid frequency and other properties (Bateman and Birney, 2000). Sometimes newly discovered protein sequence may lack considerable identity with known sequences of known functions over their entire length, which makes functional prediction difficult. So important aspect of protein sequence characterisation is identification of domains and families based on primary sequence data (Gribskov et al., 1996). These functional domains and family databases are useful to deal with the question, “What types of functional domains are present within this sequence? Or what family does this protein belong to?” Even though, some family and domain databases were developed for genomic sequences annotation purpose, these tools are better option to characterise proteins with unknown function (Bateman and Birney, 2000). A protein domain is a discrete portion of a protein sequence and structure that can evolve, function, and assume to fold independently of the rest of the protein and possessing its own function. The independent evolutionary theories of domains found within the same protein lead to the hypothesis that the domain is the basic unit of protein structure and function (Doolittle, 1995). The direct functional and structural determination of all the proteins in an organism is prohibitively costly and time consuming because of the relative scarcity of 3D structural information therefore primary sequence analysis is preferred to identify majority of protein domain families (Sonnhammer and Kahn, 1994). Large number of protein domain families recognised from sequence has been escalating progressively over the years which have led to the development of online domain and families databases such as SMART (Schultz et al., 1998) and Pfam (Schultz et al., 1998; Bateman and Birney, 2000). Organism’s metabolic potential as well as other molecular systems can be explored using sophisticated genomic tools that have been directed toward understanding the key function of a particular protein in a fundamental biological process using a primary sequence only. A classier means of analysing proteins is through the detection of their domain structures which is distinct steady amino acids piece of sequence, typically ranges between 40 and 400 residues. A numbers of evolutionarily related proteins may contain same domain as structural or functional unit (Bateman and Birney, 2000; Galperin and Koonin, 1998). Here we try to present some of available tools and techniques to detect possible functional domains and novel family of a protein sequences in Streptococcus mutans.
The domain hunting approach
One of the methods to correctly predict novel domains is through inspection of high resolution protein 3D structures however structural databases contain limited numbers of sequences that have representative structures (Amer et al., 2008). Novel domains are usually detected by employing automated methods to rapidly generate an optimised set of targets, which were subsequently analysed manually. At one extreme a researcher will start a solo protein sequence and hunt for partial matches against other sequences. These short matches can serves as template to build new families. At the other extreme fully automated methods that work on large protein sets to detect novel families are available (Yeats et al., 2003). We explored these two approaches to investigate the S. mutans genome by means of amalgamation of rapid automatic detection of potential novel domains followed by careful manual analyses to assist elucidation putative biological mechanisms and thorough understanding of described systems within the oral dwelling prokaryote Streptococcus mutans. Firstly a set of novel domains is detected using the recently completed genome sequence and explanatory information was obtained through literature searching and other analytical tools. These predictions were then viewed within the framework of the Streptococcus mutans. These outcomes provide functions for many proteins leading to a number of testable hypotheses.
Protein sequences retrieval
Uniprot database (Copleya et al., 2002) was used to retrieve all sequences provided in domain hunting approach. SMART (Letunic et al., 2004), Prosite (Falquet et al., 2002), Pfam (Finn et al., 2006) and InterPro (Apweiler et al., 2001) were used to identify novel domain on the basis of sequence similarity.
Approach I
A set of 3000 potential and known protein sequences from Streptococcus mutans was used as the preliminary data. An initial alignment generated by CLUSTAL W (Thompson et al., 1994) was used to create profile-HMMs using the HMMER3 tool (Eddy, 2001).The resultant profile-HMMs were searched against the Uniprot protein resource. A threshold of 0.01 was selected to detect homologs and this alignment was built by means of the hmmalign tool from the HMMER package. This alignment was then queried against the Pfam and Prosite database to identify any similarities with the known domain and families. The last step is a manual examination of the domain to widen its relationship as well as to develop better multiple sequence alignment and with anticipation of the domain function prediction. This analysis uses a wide variety of tools and methodologies.
Approach II
A complementary approach was also tried here for detection of novel domains that may be of significance to the biology of S. mutans. All-against-all BLAST (Altschul et al., 1990) was done by means of single-linkage clustering methodology. The proteins were clustered with a cutoff threshold of 50 bits, which helped to avoid clustering of unrelated proteins. Single proteins and all other clusters that corresponded to Pfam database were then removed from the primary dataset. T-Coffee (Notredame et al., 2000) was used to align the clustered sequences. The aligned sequences (clusters) were subsequently used as template for an iteration using HMMER 3, same as in approach I. The sequences were iterated until convergence. Afterward they were again realigned with T-Coffee and a single round of iteration was done. Then the iterative search process was repeated until new family members were identified.
Predictions of function
On the basis of information in the literature and/or co-occurrence with formerly well-known domains, some functional characteristics can be predicted for newly discovered domain and families. The predicted functions such as protein synthesis to drug resistance represent a range of cellular and molecular functions.
From an initial set of 150 potential domain targets, 25 targets were removed by the step single-linkage clustering methodology that lay within Pfam families, Prosite domain database and most related to the same set of overlapping families. A final set of 19 targets were discovered as novel domains to S. mutans. Table 1 lists and briefly describes all novel domains identified in the domain hunting approaches.
Pfam/Prosite Acc No | Family/Domain Name | Pfam Type | Function |
---|---|---|---|
PF00472 | RF-1 | Domain | peptidyl-tRNA hydrolase activity |
PF00702 | Hydrolase | Family | catalytic activity |
PF01368 | DHH | Family | phosphoesterase function |
PF03462 | PCRF | Domain | protein synthesis |
PF04327 | DUF464 | Family | unknown function |
PF00480 | ROK | family | unknown function |
PF00005 | ABC transporter | family | Translocation of compounds across membranes. |
PF00013 | KH | domain | RNA binding |
PF00293 | NUDIX | domain | removing an oxidatively damaged form of guanine |
PF00308 | Bacterial dnaA | family | initiating and regulating chromosomal replication |
PF00344 | eubacterial secY | family | protein transport |
PF00391 | PEP-utilising enzyme | motif | transferase activity |
PF00467 | KOW | motif | rRNA tertiary structure |
PF00595 | PDZ | domain | targeting signalling molecules to sub-membranous sites |
PF00627 | UBA/TS-N | domain | ubiquitination pathway |
PS51353 | ArsC | family | converts arsenate to arsenite |
PS01125 | ROK | family | transcriptional repressors |
PF00254 | FKBP-type peptidyl-prolyl cis-trans isomerase | domain | peptidyl-prolyl cis-trans isomerase activity |
PS50847 | LPxTG | motif | catalytic action |
Table 1: List of all domains identifi ed by Approach I and II, as well as their probable function.
Description of some significant domains
LPxTG motif (PS50847): LPXTG motif in a protein serves as a platform for the catalytic action of proteolytic enzyme sortase, resulting in a transpeptidation reaction. The targeted bond between threonine and glycine is cleaved by the enzyme exposing the carboxy terminal of threonine residue that in turn binds to amino terminus of Pentaglycine Bridge in the peptidoglycan, causing crosslinking through covalent interactions. The hydrophobic LPXTG motif is present as conserved sequence in a 35-residue sorting signal along with a tail of positively charged residues (Mazmanian et al., 1999). Figure 1 reveals the structure of rep sorting signal comprising a hydrophobic LPXTG motif and its positively charged residual tail.
Such motif is generally found in the surface proteins of gram positive cocci, possessing N-terminal signal peptide and a C-terminal sorting signal, the specific substrate for sortase, resulting in cleavage of LPXTG motif and attachment of the protein to the peptidoglycan as a consequence of transpeptidation reaction (Marraffini et al., 2006; Navarre and Schneewind, 1999). This particular activity of sortase enzyme and encoding and accessibility of such motif by pathogens is crucial for the establishment of an infectious disease. For example, S. aureus anchors on the host cell by the transpeptidation reaction processed by sortase enzyme. According to a recent study, mutations in sortase A and sortase B genes of S. aureus resulted in abortive infections due to failure in cell wall anchorage and projection of surface proteins on cells bearing sorting signal with LPXTG or NPQTN motifs (Mazmanian et al., 2001).
Structural analysis of LPXTG family proteins disclosed their modular architecture (Figure 2) and their evolutionary mode as the acquisition of distinct domain sized polypeptides may be evolved through duplication and homologous recombination. Such domains are also explored from various other species such as B repeats of sasA and sasG, SD and SX repeats of Sdr proteins, the conserved 212 residual domain of sasG and signal peptide consisting (Y/F)SIRK motif, as the evidence of horizontal transfer ( Fiona et al., 2003). Proteins from LPXTG family show presence of N-terminal secretory signal sequence as a peculiar feature. These were found using the SIGNALP prediction algorithm. These when aligned with the signal sequence of S. aureus, resulted in identification of 15 sequences containing (Y/F) SIRK motif as a conserved sequence. Local alignment tools affirmed this motif as common in Sortase substrates of gram positive cocci (Tettelin et al. 2001).
ArsC family (PS51353): Detoxification of arsenate, arsenite and antimonite is observed as a chromosomal encoded resistance mechanism in many bacterial species (Carlin et al., 1995). The resistance is through efflux mechanism. Reduced Glutathione (GSH) acts as a cofactor for ArsC (~150-residue), an arsenate reductase in the conversion of arsenate to arsenite. Redox active cysteine is an active site conserved amino acid in ArsC (Figure 3) (Liu and Rosen, 1997). Arsenate reductase and low molecular weight bovine protein tyrosine phosphatase show significant structural similarity in spite of the low sequence identity. Similarity is significantly high in their active sites. In vitro analysis affirmed this structural homology functionally relevant by displaying phosphatase activity by arsenate reductase (Figure 4).
Figure 3: Sequence logo of family proteins containing ARSC domain. Sequences from O. tritici were aligned with ArsC homologues from E. coli pR773 (AAA21096) and Staphylococcus aureus pI258 (AAA25638).
The arsC family proteins are also expressed in gram positive bacteria such as Spx proteins that act as transcription factors, regulating transcription of multiple genes under disulfide stress (Zuber, 2004).The structure of ArsC protein is found to be belonging to the thioredoxin superfamily fold characterized by α-helices wrapped around a β-sheet core. The loop between the first β-strand and the first helix encloses the active site cysteine residue. Such structure is found to be conserved in Spx proteins and other homologs (Martin et al., 2001). This suggests the horizontal transfer of this conserved domain.
ROK family (PS01125): It is a family of bacterial proteins which groups transcriptional repressors, uncharacterised ORFs and sugar kinases, for this reason known as ROK (Repressor, ORF, Kinase). At present, consist of Xylose operon repressor (gene xylR) in Bacillus subtilis, Lactobacillus pentosus and Staphylococcus xylosus, N-acetylglucosamine repressor (gene nagC) from Escherichia coli and Glucokinase (gene glk) from Streptomyces coelicolor.
The repressor proteins (xylR and nagC) from this family possess an N-terminal region contains a helix-turn-helix DNA-binding motif. The domain common to all these proteins consists of about 300 residues (Titgemeyer et al., 1994). Sequence logo demonstrates conservation of glycine residues in many positions (Figure 5).The presence of ROK (Repressor, ORF, Kinase) domain in the wide varieties from bacteria to humans designates its conservation (Figure 6).
Figure 5: Sequence logo from multiple sequence alignment of ROK family proteins.
DUF464 Family (PF04327): This family is an interesting case, and has been previously mentioned as a family of uncharacterised archaeal proteins with 38 sequences in pfam (Shin et al., 2005). A protein BLAST search (Altschul et al., 1997) was performed on the NCBI site using the established default parameters in order to search whether selected DUF464 family member protein have sequence similarity with other proteins. Blast results showed 27% identity with Ribosomal protein of Leptotrichia hofstadii F0254. In Gene ontology it showed molecular function as ribonucleoprotein and cellular component as ribosome which is inferred from electronic annotation. This relevance with ribosome suggests that it may be used as a potential target for antibiotics in order to resist cariogenic pathogens. GOR server located at Expasy server was used to predict secondary structure of DUF464 Family member protein. It showed unusual number of extended coils and and random coil (Figure 7).
Figure 7: Secondary structure prediction of DUF464 Family member protein using GOR server.
FKBP-type peptidyl-prolyl cis-trans isomerase (PF00254): FKBP (Tropschug et al., 1990) is a domain that typically occurs in the major high-affinity binding protein mostly found in vertebrates, for the immunosuppressive drug FK506. FKBP12 is notable in humans for binding the immunosuppressant molecule tacrolimus (originally designated FK506), which is used in treating patients after organ transplant and patients suffering from autoimmune disorders (Wang et al.,1994). Both the FKBP-tacrolimus complex and the ciclosporincyclophilin complex inhibit a phosphatase called calcineurin, thus blocking signal transduction in the T-lymphocyte transduction pathway. Slow protein-folding reactions are accelerated by a prolyl cis/trans isomerase isolated from porcine kidney which is identical to cyclophilin, a protein that is probably the cellular receptor for the immunosuppressant cyclosporin A. It exhibits peptidyl-prolyl cis-trans isomerase activity (PPIase or rotamase) (Stein, 1991). An FKBP was found to act as a modulator of an intracellular calcium release channel along with the cyclophilins. Presence of such immunosuppressive binder motif in S. mutans makes a proposition that it actually helps the pathogen in dealing with the therapeutic agents used against it. The FKBP domain is found in discrete varieties of organisms from prokaryotes to eukaryotes, showing its significant conservation (Figure 8).
Comparative genomics is still a growing field and it is hoped that through these methods we can get better categorisation of domains and families of protein sequences toward an understanding of the biology of S. mutans. Manual investigation of every single protein is an incalculably time-consuming activity therefore it would not be feasible to annotate protein families. We presented here a combination domain hunting approach in order to concentrate on potentially the most interesting domain families. Our approach discovered common domains e.g., the ROK domain that is observed in a wide variety of species. Majority of domains identified indicate that they have essential biological activities; they are, on average, present in smaller number of proteins than previously described domains. FKBP-type peptidyl-prolyl cis-trans isomerase domain was also found in S. mutans that is also present in various other pathogens is immunosuppressive in nature. So a hypothesis can be made that this domain allows development of S. mutans as a biofilm over tooth surfaces by suppressing mucosal immune barrier. The domains that are found in the study suggest that the highly conserved domains are not acquired through the lateral gene transfer but they are of ancient origin. Such investigations are helpful for phylogenetic analysis that will lead to demonstrate the single origin of functional domains and also be a significant contributor in the evolutionary aspects of life. These discoveries provide a basis for future drug development and new approach in prevention and treatment of dental caries.