Next: Methods Up: Mining Knowledge of Protein Previous: Contents

Overview

The aims of this research project are (1) to derive new principles governing the formation of protein folds such as common substructures and the relationship between local sequence and tertiary structure. (2) To challenge a new program (Progol) for data mining using ILP by applying it to a complex real-world (biological) problem. (3) To develop a Web server for dissemination of the methodology.

The first phase of the project involved constructing a deductive database in Prolog that included information about the protein three-dimensional fold type (from the SCOP database [1]), the secondary structure and its packing (from PROMOTIF [2]), the sequence and the percentage sequence identity between the proteins.

The second phase was to apply Inductive Logic Programming (ILP) implemented in the program Progol [3] to learning rules governing protein topology. Rules were learnt that would specify a particular fold type (e.g. a TIM barrel) from non TIM barrels of the same structural class (formed from alternate $\alpha/\beta$ secondary structures). A subset of the deductive database was used as background information from which the rules would be developed. The features included the total number of secondary structures and information about the individual components (length, hydrophobicity, location along the sequence, length of coil to next secondary structure).

Rules were learnt for the five most populated folds of each of the four structural classes. The cross validated accuracy (% correct assignments / total number of examples tested) average over the 20 trials is 74 $\pm$ 10 %. Inspection of the rules showed that several provided expert-type insight into key features of the fold. For example (1) the presence of a Pro residue at the B/C corner of the globin fold (noted by Lesk & Chothia, 1980) (2) the presence of a long strand at the N-terminus of several OB-folds. The next step is to screen these rules learnt to establish which are yielding expert-type insights.

A Web interface has been constructed which allows the user to browse through the SCOP hierachy and see the rules attached to each fold. The user can visualise the content of a rule graphically, as the application of a rule to a particular structure produces a RasMol script which highlight the secondary structure elements involved.

We have therefore shown that a computer program can learn expert-type rules that characterise a protein fold. The application of the rules for structure prediction needs now to be evaluated.

Next: Methods Up: Mining Knowledge of Protein Previous: Contents

Marcel Turcotte
1999-10-20