Suhail A Islam, Jingchu Luo and Michael J E Sternberg
Identification and analysis of domains in proteins
Protein Engineering Vol. 8 no.6 pp513-525, 1995

With the increase in the number of proteins of determined structure (Hobohm et al., 1992; Orengo et al., 1993; Hobohm & Sander, 1994 ) , there is a pressing requirement for a consistent and automatic computer algorithm to identify domains from coordinates. We have developed an algorithm, based on the approach of Sander (1981), that was originally used to identify the location of domains in 284 non-redundant chains (Hobohm et al., 1992). This provides the basis for an analysis of the structural features of protein domains.

In 1973, Wetlaufer (1973) surveyed 18 protein structures and highlighted that in many the chain folds into distinct structural regions called domains. Wetlaufer introduced the classification of domains into continuous, being formed from a single chain segment, or discontinuous, composed of two or more chain sections. In 1981 Richardson (1981) in her review identified about 100 domains. Her assignment was based on the concepts that the domain would be independently stable and/or could undergo rigid-body like movements with respect to the entire protein. However for most structures the domain assignment was " by analogy : whether the whole subunit or its part more closely resembled single-domain proteins". This definition yielded a conservative assignment as several protein with distinct lobes ( e.g. hen lysozyme and subtilisin) were classed as single-domain proteins. Here we follow Richardson's (1981) concept of a domain.

We consider that the Richardson concept has many uses in the analysis and prediction of protein structure. There are now several instances where structurally similar domains occur in different proteins in the absence of marked sequence similarity (Chothia, 1992). Possibly the most notable of such domains is the TIM- (b/a)8 barrel (Farber, 1993) These structural similarities have stimulated the development of algorithms to predict structure from sequence based on the identification of a domain fold by the inverse or threading approach (Bowie & Eisenberg, 1993; Jones et al., 1992). In addition, with the increase in the number of known structures, computer algorithms are required to establish if a newly determined fold is novel (Grindley et al., 1993; Holm & Sander, 1993; Orengo et al., 1993; Yee & Dill, 1993). The development of these algorithms requires databases of protein domains of known structure rather than entire polypeptide chains.

Several workers have developed algorithms to identify domains from coordinates. Liljas and Rossmann (1974) highlighted that visual inspection of a distance plot of inter-residue contacts can be used as there is a large number of contacts between residues forming a domain and few contacts between domains. Sander (1981) proposed an approach to quantify this observation relating domain identification to rigid- body motions. Samraoui and Sternberg (Samraoui, 1985) showed that Sander's approach can be applied to several proteins and they identified domains that concurred with Richardson's assignment. Wodak and Janin (1981) evaluated the interface area between two parts of the chain for every possible dissection and accepted a division at minima above a cut off. Recently Holm and Sander (1994) extended the earlier approach of Sander (1981) based on inter-domain dynamics to identify folding units in 330 representative protein structures.

Other approaches subject the protein to a series of dissections and generate a tree-like decomposition of the chain into several layers of continuous, compact units. For example Rose (1979) identified a disclosing plane that cut the protein chain into compact continuos segments. Subsequently, Zehfus (1994) reported an algorithm, extending earlier work by Zehfus and Rose (1986) that identified compact structures, and located discontinuous domains in four globular proteins.

An alternative approach to dissection is to consider that the elements of the fold are the secondary structures. A measure of the interaction between each pair of secondary structure is obtained from evaluating the inter-residue distances. Then the component secondary structures are clustered according to the inter-secondary structure distances. This approach has been implemented in the program QUANTA ( Molecular Simulations Inc., Waltham, Mass. USA) and by Sowdhamini and Blundell (1994).

A problem with many of these approaches is that the results have not been systematically compared to the assignment made by visual inspection. Accordingly, our aims are:

	i) To develop an automatic algorithm to identify domains.
	   The algorithm should not require any human judgement.

ii) To evaluate the accuracy of this algorithm against a set of authors' definitions derived as consistently as possible.

iii) To quantify structural features of the segregation of the chain into domains.