ISSN: 0974-276X
Research Article - (2009) Volume 2, Issue 2
As we are ushering in new age of data driven world, we face an enormous challenge of deriving information from heaps of data available. The amount of data being generated is overwhelming and this calls for exploring novel and effective methods for clustering and classification of such data. CAM kinase family is known to contain many enzymes involved in important physiological processes. In the present study, 13 important physicochemical parameters were calculated for 56 sequences of CAM kinase family in silico. Self organizing Maps (SOM) were employed for the classifying and clustering similar sequences and visualization of high dimensional data spaces as they are known for their capability to maintain the essence of topological relationships between the features. SOM effectively yielded 4 clusters which were distinct from each other and marked by characteristic features.
Keywords: Kohonen map, Self Organizing Maps (SOM), CAM Kinase, Bioinformatics, In silico, Clustering.
The urge to describe and explore not only the complex phenomena of life but also to seek answers to what lies beyond the realm of current understanding of life processes at molecular level continues to be a major inspiration in modern biology. Human mind is an advance neural cognitive system, a fact exemplified and reinforced by its learning and decision making ability, hence, his endeavor to automate the process of learning and decision making process by devising and employing machine learning techniques should not come as a surprise. In the present data- driven, information - starved world, this gold rush to produce enormous volumes of data would have been of no avail if not empowered by advanced powerful and comprehensive machine learning methods for analysis. The exponential rise in challenges posed to a biologist has propelled a new impetus for development of new and efficient algorithms and methods for analysis of such data or exploring the existing ones in biological contexts. Proliferation of low cost technology, astonishing growth in computing power and interdisciplinary nature of this field has led to revolution of a sort in recent times. Literature abounds with examples of application of machine learning in biological systems (Tarca et al., 2007). Though both supervised and unsupervised learning methods are being employed in bioinformatics analyses yet unsupervised leaning methods are attracting more interest as they offer many advantages like elimination of need of labeling and predefined knowledge of classes and are valuable in gaining an understanding of basic nature of data.
Self Organizing Map (also known as Kohonen Map) is a unsupervised learning algorithm (Kohonen, 2001) used for clustering and reducing dimensions of complex data with out loosing ‘essence’ of the data and is capable of organizing data based on the similarity by putting entities geometrically close to each other. SOMs have been applied in diverse fields like assessment of water quality (Walley et al., 2000), classification of communities (Chon et al., 1996; Arab et al., 2004; Tison et al., 2005), gene expression studies (Tamayo et al., 1999), disease diagnosis (Chen et al., 2000; Hoshi et al., 2006), medical imaging(Chuang et al., 2007), biochemical profiling (Kaartinen et al., 1998) and epidemiology (Murty and Arora, 2007 ). Self organizing maps have been earlier used in classification of families (Andrade et al., 1997), secondary structure determination(Unneberg et al., 2001) and pattern recognition in proteins (Hanke et al., 1996).Owing to its use for multidimensional data visualization, SOM has aptly become the method of choice in bioinformatics studies (Hsu et al., 2003). Previously, data mining techniques have been employed for clustering and classification of Internal Transcribed Spacer sequences in mosquito species (Banerjee et al., 2008, 2009).
The interplay of various inherent sequence and structural features of biological molecules is quite complex and intriguing. Minute and slight variation in physiochemical properties even in the member of same protein family is of common occurrence. Data mining techniques like SOM can be employed to aid the knowledge discovery processes in such instances.
The Ca2+/calmodulin-dependent kinases (CaMK) belong to family of structurally related Serine /threonine-specific protein kinase, which are activated in response to elevation of intracellular Ca2+, and include CaMKI, CaMKII, CaMKIV and CaMK-kinases (CaMKKs). These are known to play a role in a wide range of activities like regulation of diverse biological events mediated by intracellular calcium like muscle contraction, neurotransmitter release and gene expression (Eto et al., 1999; Nairn et al., 1985; Edelman et al., 1987; Soderling, 1996; Braun et al., 1995). This study is an attempt to cluster CaMK kinase sequences belonging to different species on basis of their physiochemical properties by applying Kohonen maps.
Sequence Collection and Pre-processing
CAM kinase protein sequences were retrieved from the SWISS-PROT, a public domain protein database (Bairoch and Apweiler, 2000 ).During the sequence retrieval process, the keyword ‘Calcium/calmodulin-dependent protein kinase’ was used which yielded 68 sequences. Sequences representing putative, partial, precursor and fragment of CAM Kinase protein were excluded from the study. Hence, 56 unique proteins were retrieved and considered for this study. The selected CAM kinase protein sequences were retrieved in FASTA format and used for further analysis
Reconstruction of Phylogeny
All 56 sequences were considered for reconstruction of phylogeny. PHYLIP (Felsenstein, 1982) was used for this purpose. CLUSTALW (Thompson et al., 1994) was employed for the initial multiple sequence alignment. Alignment output was used as input for Seqboot and Protpars program and finally Consense program was used to get the best tree with maximum parsimony method (Felsenstein, 1983) which was visualized with TREEVIEW (Page, 1996) (Fig. 7 in Supplement).
Feature Identified as Parameters for SOM
Physicochemical Characterization
Calculation of physiochemical properties of proteins by traditional experimental methods besides being expensive, is time consuming and cumbersome. The ProtParam is a program used for predicting various physical and chemical properties which may be useful in enhancing our knowledge for experiment design. Physiochemical properties like Length, Molecular Weight, Isoelectric point, Number of negatively charged amino acids, Number of Positively charged amino acids, Extinction coefficient (considering all cysteine residues appear as half cystines), Extinction coefficient *(assuming that no cysteine appears as half cystine), Instability coefficient, aliphatic index and GRAVY were calculated using Protparam (http://expasy.org/tools/ protscale.html) (Gasteiger et al., 2005) for these sequences(Table 1 in Supplement). Amino acid composition of the protein sequences can reveal their nature; hence, amino acid composition was also computed (Data not shown).
Secondary Structure Prediction
SOPMA (Self Optimized Prediction Method from Alignment) (Geourjon and Deléage, 1995 ) was employed for prediction of secondary structure features like alpha helix, extended strand, beta turn and random coils in terms of percentage for all the sequences (Table 2 in Supplement). These features (except amino acid composition) were considered as input parameters for self organizing maps for further analysis.
Data Mining – Self Organizing Maps
In SOM, the neurons are organized in a lattice, typically a one or two-dimensional array, which is placed in the input space and is spanned over the input distribution. It is feasible to achieve a map of input space where imminence between units or clusters in the map represents closeness of the input data using a two-dimensional SOM network. Processing units in the SOM lattice are associated with weights of the same dimension of the input data. Using the weights of each processing unit as a set of coordinates, the lattice can be positioned in the input space. Throughout the learning stage, the weights of the units change their position and “move” towards the input points. Progress of the movement acquires a gradually slower pace and network is almost “frozen” in the input space at the end of the learning stage. On the completion of the learning stage, the inputs can be associated to the nearest network unit. On visualization, the inputs can be associated to each cell on the map. Cells that evidently contain analogous entities can be considered as a cluster on the map. These clusters are generated during the learning phase without any prior information. The main application of the SOM is the visualization of high-dimensional data in a two dimensional way and the construction of abstractions akin to other clustering techniques.
Steps Involved in the Algorithm
1. Initialization: Randomly initialize a weight vector (Wi) for each neuron
I Wi = [ wi1; wi2; . . . ; wi n ]; n denotes the dimension of input data.
2. Sampling: Select an input vector X=[x1, x2, . . . , xn]
3. Similarity matching: Find the winning neuron whose weight vector best matches with the input vector j(t)= arg min {||X-Wi||}
4. Updating: Update weight vector of winning neuron, such that it becomes still closer to the input vector. Also, update weight vectors of neighbouring neurons-the further the neighbour, the lesser the degree of change.
Wi(t+1)=Wi(t) +a(t) X hij(t)X [X(t)-Wi(t)]
α (t): learning rate that decreases with time t, 0< α
(t) = 1
hij(t)= exp(-|| rj- ri || 2/2 X σ(t)2)
||rj-r i||2=distance between winning neuron and other neurons
s (t)=neighbourhood radius that decreases with time t.
5. Continuation:Repeat steps 2–4 until there is no change in weight vectors or up to certain number of iterations. For each input vector, find the best matching weight vector and allot the input vector to the corresponding neuron/cluster.
Data Normalization
Data was normalized linearly such that value in each category ranged between 0 and 1. This is done to get unbiased results while ensuring equal importance to all parameters while clustering.
Normalization Formula=(Original data value- Minimum Data value) / (Maximum data value - Minimum Data value)
The length of considered sequences varied from 335 to 926 and the molecular weight was found to be in the range of 38163.7-105122.7.The sequences that lie on higher extreme of molecular weight were found to be peripheral Plasma protein belonging to Homo sapiens, Mus musculus and Rattus novergicus. All the sequences possess more negatively charged residues except Q10KY3, Q96NX5, Q91VB2, Q7TNJ7, Q9P7I2, P11730, Q07250, Q13554, Q13555 , Q6DGS3 while Q923T9 and Q2HJF7 contains equal number of negatively and positively charged residues.
The pH at which a protein carries no charge and exists as zwitterion is termed as Isoelectric point (pI). The pI value of all considered CAM kinase protein sequences were in the range of 4.83 -9.11 where 13 proteins (understandably those with higher number of negative amino acids except for Q00168) are basic and rest of them are acidic. The instability index which gives clue about the stability of a protein in vitro can be calculated using the following formula:
i=L-1
II = (10/L) * Sum DIWV(x(i)x(i+1))
i=1
where L denotes length of sequence, DIWV(x(i)x(i+1)) is the instability weight value for the dipeptide starting in position i.
This will be particularly useful in comparing the metabolic stabilities of proteins. All the considered sequences were classified as unstable except Q14012 (37.09), Q9P7I2 (38), Q16566 (31.64) and O42844 (36.67) as a value> 40 indicates an unstable protein. The aliphatic index (AI) which is defined as the relative volume of a protein occupied by aliphatic side chains is regarded as a positive factor for the increase of thermal stability of globular proteins(Ikai, 1980). It can be calculated by the formula:
Aliphatic index = X (Ala) + a*X (Val) + b*X (Leu) + b*X (Ile)
where X (Ala), X (Val), X (Ile) and X (Leu) are the amino acid compositional fractions.
Aliphatic index ranged from 76.24- 96.31.From the molar extinction coefficient of tyrosine, tryptophan and cystine (cysteine does not absorb appreciably at wavelengths>260 nm, while cystine does) at a given wavelength, the extinction coefficient of the native protein in water can be computed using the following equation:
E (Prot) = N (Tyr)*Ext (Tyr) + N (Trp)*Ext (Trp) + N (Cystine)*Ext (Cystine)
where (for proteins in water measured at 280 nm): N= number , Ext(Tyr) = 1490, Ext(Trp) = 5500, Ext(Cystine) = 125.
Extinction coefficients of considered sequences at 280 nm range from 30410 to 98180 M–1 cm–1 assuming all cysteine residues appear as half cystines. High value of extinction coefficients of some sequences connotes incidence of Cys, Trp and Tyr in high concentration. The extinction coefficients are useful in determining protein concentration required for quantitative study of protein-protein and protein-ligand interactions in solutions.
The Grand Average hydropathy (GRAVY) value for a peptide or protein is calculated as the sum of hydropathy values of all the amino acids, divided by the number of residues in the sequence (Kyte and Doolittle, 1982). Low values of GRAVY indices which ranged from -0.571 to -0.214 indicate the possibility of better interaction with water. The secondary structure indicates whether a given amino acid lies in a helix, strand or coil. Secondary structure features as predicted using SOPMA are represented in Table 4. The results revealed that alpha helices were found to be predominant followed by random coil, extended strands and beta turns in majority of the sequences while for sequences(Accession number: Q91YS8, Q63450, Q96NX5, Q91VB2, Q7TNJ7, Q13554, Q13555, Q923T9, O42844, Q8N5S9, Q8VBY2, P97756, Q96RR4, Q8C078, O88831), random coils outnumbered other secondary structural features. For Calcium/calmodulindependent protein kinase type II beta chain (Protein ID: P28652) belonging to Mus musculus, random coils were found to be equal to alpha helices. Normalized data was clustered using SOM on a 2x2 grid (shown in Figure 1). Unsupervised learning was done on the fly using the data using a learning constant of 0.01 and for 10,000 iterations following which the data got clustered based on the neighborhood distance.
In short:
Total no of sequences selected for study =56
Total number of input parameters =13
Total iterations per sequence to form a neuron = 100000
Total iterations to form 4 grid (2X2) structure = 5600000
Successful or winning neurons = 4
Unsuccessful neuron = 0
In short, all 4 neurons were successful and the data got assembled into 4 clusters. The pie chart below (Fig.2) shows the distribution of sequences in the clusters.
Cluster (1, 1): This cluster contains 6 sequences which are exclusively Calcium/calmodulin-dependent protein kinase kinase sequences belonging to Homo sapiens, Mus musculus and Rattus novergicus and thus, is characterized by very similar trends which make this cluster distinct from all other clusters. This cluster is also marked by lowest values of GRAVY and isoelectric point. At the same time, this cluster shows a distinctly high range of values of instability index and random coils and uniformly low range of extinction coefficient.
Cluster (1, 2):5 sequences lie in this cluster. Q91YS8 and Q8IU85 although similar in length and type of amino acids varied in isoelectric point, GRAVY, alpha helix and random coils. Q96Nx5, Q91VB2, 7TNJ7which belonged to Calcium/calmodulin dependent protein kinase type 1G got clustered together and showed similar profiles though differing slightly in Instability index, GRAVY and beta turn. This cluster comprised of shortest sequences where random coils were more than alpha helices. This cluster is marked by uniformly high range of extended strands.
Cluster (2, 1): This cluster is constituted by 26 sequences. Except for 4 sequences (Q24210, o14396, O70859, Q62915, this cluster comprises of sequences with low molecular weight and length. O14936, O70589 and Q62915 which belonged to peripheral plasma membrane protein showed nearly identical profiles in SOM cluster and got assembled in neighboring cells. Calcium/calmodulin-dependent serine/threonine-protein kinases sequences also showed similar range of values and were placed at neighboring places. Sequences that belonged to Calcium/ calmodulin-dependent protein kinase type II alpha chain also were lying in proximity in the cluster with similar profiles and differed markedly from next sequence that belonged to Calcium/calmodulin dependent protein kinase type II Delta chain sequence from Xenopus laevis.
Cluster (2, 2): 19 sequences that got assembled in this cluster are Calcium/calmodulin dependent protein kinase type II sequences except Q10KY3 which is described as Calcium/ calmodulin-dependent serine/threonine-protein kinase 1. All these sequences are longer and are of high molecular weights. In general, the alpha helices were more in number as compared to random coils in the considered sequences. Sequences belonging to Calcium/calmodulin-dependent protein kinase type II beta chain got positioned in vicinity in this cluster and showed similar profiles for all the parameters except for Q13554. 3 sequences that belong to Calcium/ calmodulin-dependent protein kinase type II gamma chain also got clustered together with slight variation. Gradient in parameter values can be attributed to the fact that this cluster is assemblage of various types of sequences belonging to Calcium/calmodulin-dependent protein kinase type II á, â chains, ä and ã chains belonging to various species. Even trivial differences at the sequence level and type of chain are clearly reflected in case of sequences from Danio rerio.
A fairly good amount of raw sequence data pertaining to Protein super families and families exist in public domain databases. Conventional methods for defining a protein family rely on signatures, motifs and structural or functional domain information. The method presented in this report allow us to think in a different direction where we can go for further sub-classification of these available large data and this approach may provide a cue for sophisticated intelligent classification and clustering enabling categorization of new subclasses or classes which may aid in new criteria generation for tapping into this wealth of information.
Bioinformatics analyses have been employed by researchers to provide substantial information about the biological macromolecules in shortest span while eliminating to a certain extent, the need of time consuming expensive experiments. With the exponential rise in amount of data being generated, one can not overlook the need of exploring new methods for clustering and classification of such data. Recently, there have been attempts to employ data mining approaches in biological relevance (Banerjee et al., 2007, Banerjee et al., 2008). Artificial Neural Networks (ANN) like Self Organizing Maps have innate penchant to learn and can recognize patterns in data without prior information (Lampinen and Oja, 1992). SOM is highly effective sophisticated data clustering tool for visualizing complex data by reducing dimensions. These have been successfully exploited in bioinformatics in chromosome structural studies (Kyan et al., 2001), motif discovery(Mahony et al., 2006; Arrigo et al., 1991), identification of genome signature (Abe et al., 2002), codon usage diversity (Kanaya et al., 2001; Wang et al., 2001), gene prediction(Mahony et al., 2004), identification of transcription binding sites(Mahony et al., 2005), sequence analysis(Oja et al., 2005), nucleic acid classification (Naenna et al., 2003) and gene expression analysis (Ressom et al., 2003; Covell et al., 2003).
In this study, physiochemical properties were calculated for 56 CAM kinase sequences using in silico tools. SOMs were employed to segregate data according to variation in properties and group them in separate clusters according to trend observed in properties. SOMs seem to be a perfect solution for clustering and visualization of such sequence data for easy interpretation owing to its innate simplicity
Authors thank Dr. J.S.Yadav, Director, IICT for his continuous support and encouragement. AKB thanks CSIR for Senior Research Fellowship. NA thanks DST for Research Associate fellowship. Authors thank anonymous reviewers for their valuable suggestions for improvement of manuscript.