ISSN: 0974-276X
Research Article - (2009) Volume 2, Issue 1
High- throughput technologies, as DNA microarrays, generate a huge amount of data, which are difficult to interpret. Biologists require easy-to-use tools to analyze them and explore new scientific hypotheses. We propose a visual data mining tool to extract a new type of useful genomic information buried in gene lists generated by differential expression studies. We compare the genomic distribution that is observed within the gene list with the expected distribution, which is estimated from public genomic databases. An algorithm of research and a statistical test give reliable and optimal results. GExMap helps identify genomic regions that are enriched in genes whose differential expression is of potential interest for target diseases. This software is freely available20 and easily customized. Since sources are frequently updated, it offers tools for updates at any moment. GExMap is usable by any commercially and publicly available microarray platform. Furthermore, GExMap helps in interpretation by showing an ordered list for each gene ontology. Presently, GExMap can also be used to analyse gene lists not only from the Homo sapiens genome but also the Mus musculus and the Rattus norvegicus genomes.
Microarray technology allows parallel and simultaneous monitoring of the whole genome. It is used to detect differentially expressed genes under different conditions. General studies of microarray gene expression generate lists of differentially expressed genes and molecular signatures of diseases. In some cases, the list of genes might be used as a powerful diagnostic and/or prognostic tool for the disease in question-7.
Since its beginnings with cDNA microarrays8-9, targeting only parts of the genome, microarray technology has been improved in accuracy10-12, 21 and now targets the whole genome.
Microarray data analysis produces large lists of genes that are difficult to interpret. Although still imperfect, several programs are able to extract gene functions and bibliographic associations from ontology databases18, while others are able to document13-14 or propose biological pathways15-16. Few programs, however, can analyze the genomic distributions buried in the data. GExMap has been designed to help fill this gap, by providing solutions to specific requests, and mapping and analyzing several types of microarray data17.
GExMap is an R software package implementing visual and statistical tools to study and map genomic distribution of differentially expressed genes, in order to facilitate gene list interpretation. With visual representation, the genomic distribution observed in a gene list can be compared with the expected genomic distribution. The results obtained are also statistically tested to unravel genomic regions of potential interest. Moreover, GExMap produces a user-friendly documentation, widely annotated source code and a “visual technical manual” available online20. GExMap can also
produce scored lists and graphical illustrations of Gene Ontology18 identifiers to facilitate gene list interpretation and highlight specific molecular function, biological processes or cellular component.
GExMap produces one individual genomic map per chromosome, complete with statistical information. By default, chromosome maps are not created for chromosomes that contain fewer than five of the listed genes. In those cases, the genomic density comparisons will not be relevant. On the other hand, this parameter could be easily modified by the user, taking into account the increasing false-positive rate of potentially interesting regions.
To facilitate correspondence between identifiers of several types of microarray and genome databases, we have chosen the ENSEMBL genome as the reference. Once the identifier type is recognized, the appropriate correspondence file is automatically loaded. Then, all identifiers are replaced by ENSEMBL annotation and appropriate genomic information. The ENSEMBL genome is presently composed of around 33 000 identifiers corresponding to all known genes and
pseudogenes. The microarray probes do not always correspond to one known gene or to one ENSEMBL identifier. Meaning that probes with no correspondences cannot be allocated to genomic locations. This problem will be solved when the genome is fully and scrupulously annotated. Furthermore the unprocessed identifiers are listed and reported in the final report. The main GExMap data file will be frequently updated and freely available to limit this problem. We have chosen not to take into account identifiers without ENSEMBL correspondence for the subsequent steps.
There is virtually no limit to the size of the tested gene list. In terms of contents, we would recommend the gene list be generated by any rigorous mathematical, statistical or experimental approach designed to identify of differentially expressed genes. If any arbitrary selection has been performed, the gene list may import a significant bias for the results of further tests.
We define a genomic map associated with a gene list as the information about the genomic location of each gene of the list, in the corresponding genome draft.
GExMap produces the genomic map of a gene list, aggregates genomic locations into genomic units, calculates the observed gene frequency by unit, and then performs the statistical comparison of the observed genomic distribution to the one expected in the full genome. The graphical and computational unit can be chosen by the user. Default scaling has been set to 1 million bp. The hazard curve represents the expected genomic distribution which could be found when the genome-wide number of genes is scaled to the size of the tested gene list. Two hazard computations are available. The first one only takes into account ENSEMBL genes located in units where at least one gene of the tested list is located. The second type of hazard computation takes into account ENSEMBL genes of all genome (fig.1). The hazard frequency is used to test the hypothesis that the expected and observed genomic distributions are not homogeneous, and the latter must vary in relation to the disease studied. The total number of genes is broken into two regulation curves showing the number of up- and downregulated genes per unit.
Figure 1: Hazard estimations. A: For the specific hazard estimation method, only units encompassing at least one gene from the gene list are taken into account for the hazard estimation. Units are not filtered for the global hazard estimation method. B: Scaling factor computation, by the ratio of the sums of the ENSEMBL genes and the tested list genes. C: Hazard estimation is computed by scaling the ENSEMBL genome density to the scale of the tested gene list.
When the tested gene list curve is higher than the hazard curve, then one can expect that the cluster of genes encompassed in the genomic unit could be of interest. GExMap has been set to detect these clusters automatically. On the other hand, in some particular cases, it could be interesting to detect the regions of absence of genes. For this reason, this parameter can be customized thought the use of a variable. A statistical test is performed to validate each detected cluster, and the result is depicted graphically. Significant regions are highlighted on the map by a specific green line. The complete analysis process can be summarized into 6 simple steps (Fig. 2).
Figure 2: Local Analysis and optimization in six steps. 1: The first step is to construct a matrix of genomic localisation for the tested gene list. 2: Then the corresponding hazard distribution is computed.2: A global test is performed to detect potential regions of interest at a large scale. Then each base unit are tested if they are located in large regions of interest and if there are more genes than expected (real number > hazard estimation). 4: A step of optimization test if extension of statistically significant regions of interest brings better results. 5: The final test is done after optimization. 6: Closer significant regions of interest can be merged if the resulting larger region improves statistical significance.
Data Sources
The program loads the preformatted source file “Data.Rdata” containing all ENSEMBL identifiers and their genomic locations. The algorithm is designed to implement several processing modes. The user can define the scale of the resulting graphics, the path of the source file, and the results folder. At the moment, there is only one default mode mapping differentially expressed gene lists. Other modes, as CGH and SNP mapping, are under development. The mode is automatically implemented by the program through the analysis of the source file’s column names.
Data Identification
The appropriate file with correspondences between identifiers and ENSEMBL genome is loaded. The replicates are filtered and their numbers reported in a report text file. A table is then created with unique ENSEMBL identifiers for each selected gene, with their genomic locations. Once no existing correspondence can be found, the identifier will increment a “none mapped” list in the results folder. After filtering of the gene list from replicates and unassigned identifiers, a data matrix is created. Genomic locations are calculated by taking the center of each gene (start + end / 2). Genes are then classified by chromosomes and by chromosomal location.
Hazard Estimation
The matrix used to calculate the genomic distribution is computed chromosome by chromosome. The observed number of genes from the tested list and the expected number of genes from the ENSEMBL genome are summarized per scale unit and stored in a matrix. At this stage the hazard computation can start. The hazard computation scales the ENSEMBL genomic distribution to the size of gene list generated by microarray analysis. For each unit, hazard is computed as follows:
Specific hazard of a unit = Number of ENSEMBL genes of the current unit * (Total number of tested genes / N ENSEMBL genes).
N ENSEMBL genes is computed by summarizing all genes of ENSEMBL genome for the first step and, in the second step, taking into account only ENSEMBL genes which are localized in units that contains at least one gene from the tested list (Fig. 3).
Figure 3: Selection of regions to test. First, regions of interest are detected by comparison of observed and expected distributions (red squares). Second, the regions are 5’ and 3’ extended by one unit to allow a sufficient sample size to be statistically tested. A: Comparison of observed and expected distribution at the chromosome scale by two global tests: CHIgoodness of fit, Wilcoxon. B: Selection of the chromosomes according to the statistical results (global tests) or the user choice. C: Comparison of observed and expected distribution unit by unit throughout each selected chromosome with binomial local test. D: Local pval correction according to the user choice. E: Selection of the units showing observed effectives greater than expected according to the user choice. F: Selection of units of interest. G: Extension of regions of interest.
Statistical Tests and Progressive Optimization
First of all, two global statistical tests are performed to select chromosomes showing potential interest (fig. 3A). The CHI2-Goodness of fit test and the Wilcoxon paired sample test are the two global statistical tests used to compare expected with observed gene distributions, chromosome by chromosome.
Then a local test is performed for chromosomes validated by at least one global test (this step is customizable by the user) (fig. 3B). The binomial test is used to detect units in which observed gene population is statistically different from expected (fig. 3C to D). These units showing potential interest are then filtered, according to their significance and the difference between expected and observed distributions (this second selection is customizable by the user) before being extend to regions of potential interest (fig. 3E to G). A last step of optimization is performed to adjust statistically selected regions to seek the largest significant regions of interest (Fig. 4).
Figure 4: Local statistical test and progressive optimization. First of all, the full region of interest is statistically tested (blue). When the test invalidates the region a step of progressive optimization takes place. Progressive optimization consists in statistically testing the two potential reduced regions (green and orange). The reduced region allowing the best statistical result is used in the next steps. The optimization steps and test are performed as long as the result of the test is not significant (pval > 0.05) and the size of the tested region is longer than three units.
To preserve clarity, only significant results are graphically reported. All statistical results, even those that are nonsignificant, are reported in the final result file.
To facilitate the interpretation of the results, the total effective curve is separated into curves of up- (red) and a curve of down regulated (blue) genes (fig.5). In an optional part the annotated genes are sorted and analyzed by Gene Ontology (fig.6).
Figure 5: Genomic map and statistical results. One genetic map is generated for each chromosome. The map shows the density curves of the ENSEMBL genome (green), the tested gene list (orange) and the estimated hazard (black). The tested gene list curve is divided into up- (red) and down regulated (blue) curves drawn in mirror.
All the data produced and formatted by GExMap are saved in a text file. These data allow the user to identify genes located in a region of interest, or to verify data used by the statistical tests.
GExMap provides helpful new types of information to interpret data generated by microarray studies. The software combines easy use and easy interpretation of results. Source code is freely available, optimized and annotated to facilitate customization, as implementation of different types of statistical tests or customized detection of interesting regions. All results are separately generated to preserve clarity of graphic representations. We have tested the software with a list of 3855 genes generated from microarray analysis19. GExMap allowed us to focus on a limited number of interesting regions which are now under biological investigation. Further optimizations are planned to complete the generated information. We work on integration of new statistical tests, as a t test adapted to small effective samples with a resampling pre-processing step. To improve compatibility with most of database identifiers and microarray types, a tool dedicated to the generation of correspondence files from text data files will be soon available.
Additional functions for comparisons between microarray gene lists or genomic association studies and to map and analyze several types of genetic markers, genetic association studies, and genetic maps of all kinds are also under development.
GExMap is publicly accessible from the URL http://gexmap.site.voila.fr/. Questions and comments are welcomed through the site.