The aims of this research project are (1) to derive new principles governing the formation of protein folds such as common substructures and the relationship between local sequence and tertiary structure. (2) To challenge a new program (Progol) for data mining using ILP by applying it to a complex real-world (biological) problem. (3) To develop a Web server for dissemination of the methodology.
The first phase of the project involved constructing a deductive database in Prolog that included information about the protein three-dimensional fold type (from the SCOP database [1]), the secondary structure and its packing (from PROMOTIF [2]), the sequence and the percentage sequence identity between the proteins.
The second phase was to apply Inductive Logic Programming (ILP)
implemented in the program Progol [3] to learning
rules governing protein topology. Rules were learnt that would
specify a particular fold type (e.g. a TIM barrel) from non TIM
barrels of the same structural class (formed from alternate
secondary structures). A subset of the deductive
database was used as background information from which the rules would
be developed. The features included the total number of secondary
structures and information about the individual components (length,
hydrophobicity, location along the sequence, length of coil to next
secondary structure).
Rules were learnt for the five most populated folds of each of the
four structural classes. The cross validated accuracy (% correct
assignments / total number of examples tested) average over the 20
trials is 74
10 %. Inspection of the rules showed that several
provided expert-type insight into key features of the fold. For
example (1) the presence of a Pro residue at the B/C corner of the
globin fold (noted by Lesk & Chothia, 1980) (2) the presence of a
long strand at the N-terminus of several OB-folds. The next step is
to screen these rules learnt to establish which are yielding expert-type
insights.
A Web interface has been constructed which allows the user to browse through the SCOP hierachy and see the rules attached to each fold. The user can visualise the content of a rule graphically, as the application of a rule to a particular structure produces a RasMol script which highlight the secondary structure elements involved.
We have therefore shown that a computer program can learn expert-type rules that characterise a protein fold. The application of the rules for structure prediction needs now to be evaluated.