ALGORITHM FOR AUTOMATIC DOMAIN ASSIGNMENT


The algorithm is based on dividing the chain segment to minimise the density of inter-domain contacts in the distance plot . Consider a chain segment from residue a to b ( for a complete chain of N residues a=1, b= N). For residues i and j , the entry in the distance matrix Dij is 1 if the carbon a- carbon a distance is less than a cut off distance d, or 0 otherwise. In a graphical representation of the matrix a point is plotted at ij if Dij =1. Figure 2 shows an actual example of a domain contact map for alcohol dehydrogenase (8adh, Eklund et al., 1976). As the distance matrix is symmetric only half is represented in Figure 2.

Consider that the chain is divided into two domain a to k and k+1 to b. Then the number of inter-domain contacts Ck, is given by:

The maximum number of possible inter-domain contacts, Mk is given by:

Mk = (k - a + 1) (b - k) The density of contact Sk is defined as Ck divided by the maximum and is:

The value of Sk is calculated for k = a + 15 to k = b - 15 . A potential cut to the chain is made at the minimum value of Sk. To determine whether the chain should be subdivided, the average (Aab) of Sk is then calculated by:

b

Aa,b = S Sk / (b - a + 1)

k = a

and a division made only if

Sk / Aa,b F

where F is a cut-off value to be determined empirically.

This yields a series of continuous chain segments each of which is a potential domain. However a discontinuous domains could be formed from two or more such continuous segments. These segments are clustered into domains by the following approach. The value of Sk is calculated between every pair of continuous segments and divided by A1,N . These values are stored in a matrix X as a percentage. The maximum entry in X is examined and if greater than F, then the corresponding two segments (say p and q) are assigned to the same domain. The values in X are updated to include the assignment of segments p and q to the same domain. This procedure is repeated until no further clustering of segments is allowed. In addition, any segment of less than 32 residues was merged with the next section along the chain that is longer than 31 residues. A segment of less than 32 residues at the C-terminus was merged with the nearest preceeding segment of more than 31 residues. As a result of this merging procedure, the algorithm will only identify domains longer than 31 residues.

A systematic method was used to measure the accuracy of program's assignments compared to the authors' assignments (Figure 3). The accuracy score Sc is (Ncorrect/Ntotal) x 100%. Ncorrect is the number of residues identified in the authors' assignment as belonging to a particular domain that were correctly assigned to that domain by the program. A linking region in the authors' assignment was always taken as being correctly identified by the program. Ntotal is the maximum value that could be obtained for Ncorrect. A domain assignment to a protein was only considered correct if the number of domains found by the program was equal to the number of author-assigned domains and the accuracy Sc 95%.

Runs of the program were performed for d ranging between 6.5 and 13 in steps of 0.5 and for F from 40 to 65% in steps of 5% to obtain the optimum values for these two adjustable parameters assessed by the accuracy . The values selected are d = 11 and F = 55%.