ISSN: 0974-276X
Research Article - (2009) Volume 2, Issue 5
Cluster analysis is one of the most popular techniques applied in microarray data studies. Thousands of genes can be analyzed within minutes if cluster analysis is embedded in a computational tool. With such modern technologies, it has now become easier to find practical manifestations of microarray data in the fields of pharmacogenomics, cancer genetics and biological network construction. With this project work, we have developed a cluster identifying tool, i.e. CIT which is based on two different clustering methodologies namely; Biclustering and Hierarchical Clustering. We intend to embrace new possibilities in CIT in future, like; dendogram view, interactive outputs etc.
Keywords: Gene expression data, Microarray, Dendogram, Clustering, Hierarchical clustering, Biclustering.
HC: Hierarchical Clustering
 CC: Cheng and Church
 SAMBA: Statistical-Algorithmic Method for Bicluster Analysis
 AHC: Agglomerative Hierarchical Clustering
Rapid advances in genome-scale sequencing has led to immense increase in the amount of biological information .Simply visualizing this kind of data which is widely called gene expression data or simply expression data is challenging and extracting biologically relevant knowledge is harder still (Eisen et al., 1998).
By knowing groups of genes that are expressed in a similar fashion through a biological process, biologists are able to infer gene function and gene regulation mechanisms [Quackenbush, 2001; Slonim, 2002]. Since these data consist of expression profiles of thousands of genes, their analysis cannot be carried out manually, making necessary the application of computational methods which are included under the domain of microarray data analysis techniques.
Microarray Data Analysis
Microarrays and high-throughput sequencing methods can be used to measure the expression of thousands of genes in a biological sample in a few days. A natural follow-up to such experiments is organizing and inferring useful information from this data [Risques et al., 2008]. Microarray technology is although a powerful technique but it relies heavily on the availability of computational methods which help in the array design, microarray image analysis, storage of microarray data and lastly the comparison of expression profiles to achieve functional interpretation of groups of genes (which were studied in the initial experiment) [Tamames et al., 2002]. We are presenting a cluster analysis tool named CIT which can perform gene expression data analysis.
Implementation
CIT (Cluster Identification Tool) has been made to perform cluster analysis on genes based on two different methods, namely; Biclustering and Hierarchical Clustering. A brief overview of both algorithms and how they are implemented in this tool is as under:
Hierarchical clustering builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. The type of clustering that we have used is called Agglomerative Hierarchical Clustering (AHC). AHC, agglomerative approach is the one where each entity/gene is taken as a single cluster and at each step the cluster is expanded.
Biclustering algorithms do not belong to traditional datamining techniques. Simple clustering methods can be applied to either the rows or the columns of the data matrix. Contrarily a more focused version of clustering is ‘biclustering’; where simultaneous row and column clustering takes place. A bicluster (or a module) is a subset of the genes exhibiting consistent patterns over a subset of the conditions.
The algorithm used in CIT to perform biclustering is the Statistical-Algorithmic Method for Bicluster Analysis (SAMBA) [Tanay et al., 2002]. SAMBA is incorporated in the cluster analysis tool called Expander [Shamir et al., 2005]. Using a statistical model for the data, normalization is done by translating the gene expression matrix to a weighted bipartite graph.
The objective of making the proposed cluster analysis tool is to outline the behaviour of genes in biological processes. In addition, the need for making cluster analysis tools is due to the large amounts of data generated by whole-genome expression profiling, aided by the advent of microarray technology, which needs to be interpreted to construct biological networks [Hughes et al., 2000; Zhu et al., 2007]. The way clustering is performed in CIT is illustrated in the following steps. (Refer to Figure.1).
Data Normalization
The tool will take a microarray dataset as its input and pre-process the loaded dataset, if required, and then analyze this data through the selected algorithm. Normalizing the data is also called as data pre-processing. Cluster analysis tools frequently incorporate options to pre-process the data.
This helps in bringing the data in a standardized range. The normalization applied in CIT is ‘Statistical Normalization with mean 0 and variance 1’.
Selection of Clustering Method
After the normalization or pre-processing step the user will specify algorithm that he wants to apply on the normalized dataset. After the analysis has been performed by either AHC or SAMBA the results will be displayed in the form of charts, graphs or tables to outline the clusters.
Visualization of Clusters
Initially when the dataset has been loaded the gene expression matrix can be viewed as a heat map (Refer to Figure 2) which is in the form a coloured map mimicking the pattern of fluorescence from a microarray chip. Bright red signifies up-regulation of expression while green indicates down regulation in expression. After the clustering has been performed, the Bicluster is shown as a heat map and a line graph while the AHC clusters are shown in the form of bar charts and line graphs (Refer to Figure. 3).
Clustering is the classification of objects into different groups, the grouping of gene expression data is usually carried out with cluster analysis. Traditionally clustering techniques are divided in two categories, namely hierarchical and partitional. Biclustering constructs a subset of genes exhibiting consistent pattern over a subset of conditions. Using the techniques significant biclusters and clusters are generated in an unsupervised manner.
This cluster analysis tool, CIT, has the ability to pre-process the data if required by user. The pre- processing method implemented by CIT is statistical normalization with mean 0 and variance 1, which is a commonly used efficient data normalization technique. The tool uses two diverse approaches to perform clustering. Simple clustering is carried out with the popular Agglomerative hierarchical clustering (AHC) algorithm. CIT can also perform biclustering with SAMBA (Statistical- Algorithmic Method for Bicluster Analysis), SAMBA is a relatively newer method in the field of Biclustering as compared to CC algorithm [Cheng and Church, 2000] which is commonly used for biclustering. SAMBA has improved performance and can handle datasets with thousands of conditions profiled over large no. of genes.
So far CIT is compatible to run in Windows Operating System only, it cannot read data from file formats that are other than the .txt format and does not generate dendogram in the output of AHC. In the future we intend to update our tool with innovations like, a tree view of AHC and multiple functionalities with interactive outputs.
System Requirements and Availability
For access to CIT, Contact us at: 123@gmail.com
Operating System: Windows XP and higher
Programming Language: Java
Runtime Environment: JRE 5 and higher