ISSN: 2379-1764
Mini Review - (2015) Volume 3, Issue 3
Determining the structure of a protein given its sequence is a challenging problem. Deep learning is a rapidly evolving field which excels at problems where there are complex relationships between input features and desired outputs. Deep Neural Networks have become popular for solving problems in protein science. Various deep neural network architectures have been proposed including deep feed-forward neural networks, recurrent neural networks and more recently neural Turing machines and memory networks. This article provides a short review of deep learning applied to protein prediction problems.
Keywords: Deep neural networks, Recurrent neural networks, Protein structure prediction
There is a complex dependency between a protein’s sequence and its structure, and determining the structure given a sequence is one of the greatest challenges in computational biology [1]. In recent years Deep Neural Networks (DNNs) and other related deep neural architectures have become popular tools for machine learning; with DNNs currently state-of-the-art in many problem domains including speech recognition, image recognition and natural language processing tasks [2]. Conventional machine learning techniques such as support vector machines, random forests and neural networks with a single hidden layer are limited in the complexity of the functions they can efficiently learn. These methods often require careful design of features so that patterns can be classified. DNNs have recently been shown to outperform these conventional methods in some areas as they are capable of learning intermediate representations, with each layer of the network learning a slightly more abstract representation than the previous layer [3,4]. With enough layers very complex patterns can be learned. This ability to learn features automatically is helpful for proteins, as it is not known what the ideal mid or high level features are for protein structure prediction problems.
DNNs are neural networks with multiple hidden layers (usually more than 2) which can efficiently learn complex mappings between features and labels. The problems that DNNs excel at are those where there may be very complex relationships between the inputs and the labels, and where large amounts of training data are available. These characteristics have made DNNs popular for solving problems in protein science. The basic DNN consists of layers of hidden units connected by trainable weights. The weights are trained using the backpropagation algorithm to minimise the error between the NN output and the true output on a training set. Various architectures have been tailored to specific problems, the simplest architecture is the multi-layer feedforward neural network, for images convolutional neural networks are used, and for sequence problems recurrent neural networks are used.
Having a large amount of training data is a requirement for DNNs, as more trainable parameters usually requires more training data to reliably learn. When designing neural networks input features need to be normalised, especially when features are heterogeneous. Feature selection is also beneficial for achieving good classification and regression performance [5,6].
Many current state-of-the-art protein predictors are based on feedforward DNNs that use a fixed-width window of amino acids, centered on the predicted residue. The window is moved over the protein so that predictions can be made for each residue.
PSIPRED was an early protein secondary structure predictor based on a neural network with a single hidden layer [7]. PSIPRED achieved accuracies of around 80% when predicting 3 secondary structure elements: helix, coil and sheet. Later predictors include SPINE-X, Scorpion, DNSS and SPIDER-2 which are based on deeper neural networks and increase the secondary structure prediction accuracy to around 82% [8-11]. In addition to 3 state secondary structure, other protein properties have also been predicted using deep neural networks including Accessible Surface Area (ASA), phi and psi angles, theta and tau angles, and disorder prediction [8,11-16].
In addition to standard feed forward DNN architectures, Recurrent Neural Networks (RNNs) are tailored to sequence prediction problems. RNNs were developed to handle time series of information such as speech signals. These networks can pass information from one time step to the next, so context information contained earlier in the sequence can be utilized later in the sequence. Bidirectional Recurrent Neural Networks (BRNNs) were later introduced to utilize information along the entire sequence [17]. RNNs can be considered to be very deep neural nets since information may potentially be passed through many time steps. Early RNNs had problems learning when they were required to remember information over long time periods. Long Short Term Memory (LSTM) RNNs were proposed to circumvent these problems and have become widely used for sequence prediction tasks [18].
RNNs have been applied to secondary structure prediction with some success [19-25]. The recurrent connections in the RNN remove the need for large context windows, as the surrounding context is provided by the network. RNNs have also been applied to protein disorder prediction [26]. Basic RNNs can handle arbitrary length 1-dimensional input and output sequences, but they can be modified to handle arbitrarily sized 2-dimensional (and higher) inputs and outputs. These 2-D RNNs have been applied to protein contact map prediction in which a prediction is made for every pair of residues in a protein, as well as prediction of disulfide bridges [27-32]. The latest area of research in neural network architectures is towards adding memory to RNNs for so called neural Turing machines and memory networks [33-35]. These networks can be trained to solve problems that basic RNNs are incapable of solving, e.g., given examples of sorted and unsorted data, learn to sort new unseen data. These architectures have not yet been applied to protein prediction problems and it remains to be seen whether they will be able to succeed where simpler architectures have not.
This article has attempted to give a short non-exhaustive overview of the applications of DNNs to protein structure prediction problems. Deep learning is a rapidly evolving field which excels at problems where there are complex relationships between input features and desired outputs, problems that simpler classifiers are incapable of solving. The main strength of deep learning is the ability to easily take advantage of increases in the amount of data and computational power. One of the catalysts for the success of deep learning for speech and image recognition problems was the emergence of large datasets and sufficient computational power to process them. As more protein data becomes available we hope that deep learning can provide similar improvements to protein structure prediction problems. New deep learning architectures will only accelerate this progress.