ISSN: 0974-276X
Research Article - (2008) Volume 1, Issue 4
Here we describe DataBiNS-Viz – a visualization and exploration environment for non-synonymous coding single nucleotide polymorphisms (nsSNPs) data gathered by the BioMoby-based DataBiNS workflow. DataBiNSViz enables execution of the DataBiNS workflow on proteins described by KEGG, PubMed, or OMIM identifiers, followed by manual exploration of the integrated structure/function and pathway data for those proteins, with a particular focus on nsSNP data in-context. The tool can be freely accessed at http://bioinfo.icapture.ubc.ca:8090/ DataBiNS (please use the Firefox or Safari web browsers). Examples of the retrieved data are given under the “Help on inputs” option. Detailed documentation can be accessed at http://bioinfo.icapture.ubc.ca/mywiki/ DataBiNS.
Keywords: Bioinformatics; Web services; Data mining; Visualization; Genomics; Single nucleotide polymorphisms
Single nucleotide polymorphisms (SNPs) are single base mutations in a genomic sequence that occur at a frequency greater than 1% in a defined population. Codons are sets of three DNA bases in a gene sequence that code for a particular amino acid. Non-synonymous SNPs (nsSNPs) are SNPs that occur within codons and that change the encoded amino acid, sometimes ultimately affecting the protein that is constructed from the gene blueprint. nsSNPs are of great interest to researchers as they may be key to identifying and understanding various human disease susceptibilities, as well as disease and non-disease phenotypes in many other species.
In silico analysis of the potential biological impact of nsSNPs requires integration of data and knowledge from various Web-based resources, both databases and analytical tools. Manual retrieval and integration of this information is error-prone and tedious. This provided the motivation for the original DataBiNS - data-mining workflow (Song et al., 2007) for the BioMOBY (Wilkinson and Links, 2002) and Taverna (Oinn et al., 2004) environments which retrieved and integrated data relating to nsSNPs and the biological pathways affected by them. DataBiNS consumes Kyoto Encyclopedia of Genes and Genomes [KEGG] Pathway Identifiers (Kanehisa et al., 2006), and retrieves a list of publications, gene ontology annotations and nsSNP information for each gene involved in the pathway. Although the public DataBiNS workflow successfully retrieved and integrated these data, lack of a visualization tool for the output significantly limited its utility. We report here important extensions to the original DataBiNS workflow and environment, including retrieval of additional nsSNP data such as mapping of SNPs to their altered amino acids on a 3D protein structure, as well as easy to navigate web-based visualizations of the global DataBiNS output.
To facilitate interoperability between the various Web resources, the workflow extensions we report here continue to be provided through the BioMoby Web Services framework. Rather than being limited to a single KEGG identifier, the new services allow for different types of identifiers to be used to initialize the workflow, including:
1. KEGG gene (http://www.genome.jp/kegg/)
2. PubMed (http://www.ncbi.nlm.nih.gov/pubmed/)
3.OMIM - Online Mendelian Inheritance in Man (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim)
4.UniGene(http://www.ncbi.nlm.nih.gov/sites/ entrez?db=unigene)
5. UniProt (http://www.pir.uniprot.org/)
6. GenBank (http://www.ncbi.nlm.nih.gov/Genbank/)
7. NCBI-GI
To initiate searches on multiple KEGG genes simultaneously, a comma can be placed between the different identifiers (e.g., hsa:7097, hsa:7098)
Once the workflow has been initialized, the workflow first visits KEGG, PDB (http://www.rcsb.org/pdb/home/ home.do), SwissProt (http://www.expasy.ch/sprot/), and Entrez (http://www.ncbi.nlm.nih.gov/Entrez/), to find the corresponding gene id(s) corresponding to the input identifier. Once retrieved, the LS-SNP database (Karchin et al., 2005) is initialized with the corresponding SwissProt id to find all the nsSNPs for the gene. The PDB id is then used on the coliSNP (http://yayoi.kansai.jaea.go.jp/colisnp/) database (Kono et al., 2008) to retrieve the 3D structure of the protein (if available). The various SNPs associated with this gene are already mapped onto this protein structure (within coliSNP), providing an efficient technique to analyze the location of SNPs on the protein. Supplementing the SNP information are frequency pie-charts of each SNP id from the HapMap (http://www.hapmap.org/) database (Thorisson et al., 2005). Detailed annotations about the gene are retrieved from the Gene Ontology (http://www.geneontology.org/) website, and finally the most recently relevant publications to the gene are retrieved from PubMed.
Rather than being limited to the default Taverna nestedfolder browsing, or export of the data from Taverna as an Excel spreadsheet, both of which are problematic for manual exploration of these complex data networks, we have created a task-specific Web-based visualization and exploration environment for DataBiNS. The application is built using the Java Platform, Enterprise (J2EE) and is accessed by end-users through an intuitive Web page. The user simply enters an identifier of interest (i.e., KEGG PATHWAY, OMIM, etc.), and then presses the “Execute Workflow” button. In the backend, the Taverna workflow execution engine is triggered to execute the modified DataBiNS workflow. The results of the workflow are then cached on the server to allow rapid browsing of results, and are browsable via the Web interface (Figure 1). There is an option on the front page to re-execute the workflow, where the tool will ignore any saved results and retrieve new, possibly updated data.
In addition to displaying all the results using standard Web technologies, two navigation tools/methods have been added to the web-application to help with the study of the data. First, a “search publication abstract” option allows for users to quickly search the retrieved publications for keywords. If a retrieved publication has the keyword, the publication will be highlighted allowing the user to focus on that publication. The PubCloud application has also been integrated into the web-application. The user can select a group of the retrieved publications and quickly use the PubCloud keyword tag-cloud visualization system to find possible correlations between the publications.
In a significant advance over prior exploration/browsing environments, the Web interface intuitively associates multiple inputs with their respective outputs. Thus the Webapplication displays all data about each gene in a discrete section of the browser window; on a given results page there can be several genes and each gene will have its associated information clearly and intuitively organized and displayed. This approach eliminates the user’s need to backtrack through the results to correlate inputs to outputs, as was required in earlier versions of DataBiNS, thus allowing them to quickly analyze and use the retrieved data.
Though the framework we have developed to display the data is specific for the DataBiNS workflow, it can be generalized to accept any Taverna-based workflow, displaying the results as a browsable Web page and facilitating exploration of results from Taverna-based workflows. The modularized nature of workflows allows one to develop new BioMoby services to add to DataBiNS in order to expand the information retrieved and displayed.
One lingering question is always the validity of the data being retrieved. The workflow is designed to retrieve information from a specified group of web resources. The validity of the information obtained from the web resources is not checked by the workflow and thus there is currently no way to verify the integrity of the information across the different resources, without a great deal of manual data inspection. Future developments may lead to automation of such processes with electronic flags highlighting inconsistencies in data between different web resources.
This research was supported by the National Sanitarium Association (Canada), AllerGen NCE, and the Michael Smith Foundation for Health Research. EK is supported by an award to MDW from Genome Alberta, in part through Genome Canada.