Next: Results Up: Mining Knowledge of Protein Previous: Overview

Methods

**Figure 1:** Learning protein folds concept using Progol
$\includegraphics[width=\textwidth]{ilp-web.eps}$

Figure 1 summarizes the steps involved in a learning task. Rules are sought to explain a target concept, here protein folds as defined in the SCOP classification. Positive and negative examples are identified, for example, domain d1hlb__ adopts a Globin fold, which we represent as fold('Globin', d1hlb__). The background knowledge provides additional information about the domains which is used in constructing new rules; for example we know that the second helix of domain d1hlb__ contains the amino acid proline.

Inductive Logic Programming describes a research area formed at the intersection of Logic Programming and Machine Learning. An ILP system takes examples E of a concept, such as protein fold, together with background knowledge B, such as secondary structure relationships, and learn a hypothesis H which explains E in terms of B; $B \wedge H \models E$ . The examples, background knowledge and hypothesis are represented as logic programs, i.e. sets of definite clauses having the general form head :- body; ``if body then head''. For example, the following hypothesis has been learnt for the Globin fold,

fold('Globin-like', D) :-
    adjacent(D, H1, H2, 1, h, h),
    has_pro(H2).

The body is a conjunction of predicates from the background knowledge, in the example above, adjacent(D, H1, H2, 1, h, h) is true if helices H1 and H2 are adjacent in domain D and has_pro(H2) is true if H2 contains a proline.

**Figure 2:** Step 1. Construction of the bottom clause.
$\fbox{\usebox{\savepar}}$

**Figure 3:** Step 2. General to specific search.
$\fbox{\usebox{\savepar}}$

The learning procedure involves two steps. First, a positive example is randomly selected and a set of relevant predicates is calculated, see Figure 2, this process is known as mode-directed inverse entailment [4]. The second step consists in constructing a hypothesis which maximize a measure of expression and compression, see Figure 3. The best such hypothesis is selected and the examples it covers are removed. The procedure resumes with step one and continues until no examples remain.

Next: Results Up: Mining Knowledge of Protein Previous: Overview

Marcel Turcotte
1999-10-20