Next: Background Up: Mining Knowledge of Protein Previous: Objectives

Summary

The Bioinformatics Challenge - The expansion in size and complexity of many primary and secondary biological databases has generated a pressing need for automated methods of gleaning scientific principles from the collected information. This problem is well suited to be tackled by the emerging technologies within data mining and the richness of the biological information provides a stringent test for any algorithm with its requirement for obtaining rules that are understandable. The benefits of this approach needs to be made available to the wider community and the timely approach is to develop a Web server.

Proposed Approach - We propose to investigate this general problem by a specific case study - the use of a new program (PROGOL) for knowledge mining using inductive logic programming (Muggleton, 1995) to derive new principles governing protein topology from the rapidly expanding database of protein structures (Orengo et al., 1994). Specifically we propose to (1) establish a database of protein topology and functional encoded in PROLOG; (2) perform knowledge mining using PROGOL to obtain new structural principles; (3) identify required improvements in PROGOL; (4) evaluate the utility of the learnt rules for understanding and predicting protein architecture; (5) disseminate to the biological and computer science communities via a Web server.

Work Programme -The work be a collaboration between: the Imperial Cancer Research Fund (Dr Sternberg) , Oxford University Computing Laboratory (Dr Muggleton), Glaxo Bioinformatics (Drs Saqi and Sayle) and SmithKline Beecham Bioinformatics (Dr Rawlings). This is rooted in existing interactions: those of Sternberg and Muggleton in applying machine learning to structural biology (e.g. King et al., 1992; King et al., 1995; Muggleton et al., 1992; Sternberg et al., 1994); of Rawlings & Sternberg for the development of and knowledge mining from a protein topology database ( King et al., 1994); and structure prediction by threading (ICRF / Glaxo). The timescales are: Year 1 - Pilot study on the all-a subset of protein structures to identify any problems in the approach. By the year end rules should have learnt and evaluated Year 2 - Improvements to methodology; scaling up to include more protein data. Year 3 - Final run on entire protein data; evaluation of rules; dissemination of approach by development of Web server.

The Benefits - (1) New principles of protein architecture that further fundamental understanding and structure prediction will be rigorously derived from a database that is becoming unwieldy for experts. (2) Developments required in PROGOL will be identified that should improve its capacity for knowledge mining from positive-only examples. (3) The approach will be made available to the community via a Web server. (4) More generally, the system encodes a high level of complexity and the developed technology should be applicable to other topics not only in bioinformatics but also in many scientific disciples including drug design.

Next: Background Up: Mining Knowledge of Protein Previous: Objectives

Marcel Turcotte
1999-10-20