April II project: Exploring fold space with Probabilistic Inductive Logic Programs

Overview

The purpose of this project is to attempt to learn features of protein structures using an area of machine learning known as probabilistic inductive logic programming.

In the real world of protein bioinformatics, a major unsolved problem is protein structure prediction from amino acid sequence. Too date, the two main methods developed to tackle this problem are:

Fold recognition/threading methods

Ab initio folding simulation

Both of these techniques are limited in their accuracy. Both of these techniques generate reasonably accurate solutions some of the time, and badly innacurate solutions the rest of the time. A major focus in protein structure prediction currently is the generation of more accurate energy functions for descriminating between good and bad computer-generated models of proteins.

The most successful energy functions currently in common use are based on a maximum likelihood approach and log-odds probability measures. In this project we are investigating whether a very different approach based on learning logical rules and relations may be better suited to the task of discriminating good from bad protein models.

Representation of Protein structures as graphs

Our approach to applying Inductive and Stochastic Logic Programming to the problem of protein structures is based on the fact that a protein structure can be completely defined by the pairwise distances between all residues/amino acids in the protein. This set of pairwise distances forms a matrix known as a distance matrix or contact map.

This contact map can be thought of as a graph with amino acids as vertices and their pairwise distances as weights on the edges. Strictly speaking, this is a fully connected graph. However, for our purposes we can restrict the graph to those connections (or distances) below some threshold. This is because we are only interested in pairs of amino acids that are physically interacting.

This representation of protein structures is inherently relational and therefore ideally suited to logic-based machine learning.

What are we trying to learn?

Protein structures are determined by their amino acid sequence. This is because interactions between pairs, triplets, quadruplets etc of different amino acid side chains in the protein, both near and far apart in sequence, stabilise the protein structure by hydrogen bonding, electrostatic and van der Waals interactions. In addition the interaction of an amino acid side chain and the surrounding solvent leads to characteristic patterns of polar or charged amino acids on the surface and hydrophobic, oily residues packed in the centre.

We will attempt to learn rules that describe the preferred environment of a particular amino acid type. i.e. given an amino acid such as Alanine, what patterns of neighbouring amino acids are found around alanine in native and near-native protein structures compared to the patterns found in incorrectly modelled proteins that are produced by our various modelling protocols?

Pre-pilot study: 1 amino acid type studied using ILP

We have applied our current fold recognition software to a training set of 105 protein sequences for which we have the experimentally derived structure. The results of this fold recognition benchmark are a set of 3-dimensional models, some correct (within a tolerance) and some incorrect.

We have used this dataset as our primary source of background knowledge and written a preliminary ILP program using Progol.

Downloads, data, programs...

The entire gzipped tarball containing everything can be downloaded