A protein structure based annotation of genomes

Ph.D. thesis, Arne Muller, 2002 (arne.muller@gmail.com), Cancer Research UK, University College London & Imperial College, London


A strategy for  protein structure and function based  annotation of genomes was developed, evaluated  and  applied  to the  proteins  of several  genomes including  the human  genome.
First the  performance of the  widely-used homology-based  sequence comparison program PSI-BLAST to detect distant homologous relationships (<20% sequence identity) was evaluated.  The benchmark is based on two sets of sequences from the  Structural  Classification Of  Proteins  (SCOP)  database  for which  the homologous relationships  are known.  About 40% of the test  proteome can be annotated  via remote homologies.   Common sources  of errors  are identified. PSI-BLAST is applied  to assign homologues of known  structure and function to proteins  of  M. genitalium and M. tuberculosis. From the benchmark, the number of missed assignmets; and the potential extent of new structural and functional families was estimated.

An automated proteome annotation system was developed to perform large scale annotations based on analyses such as PSI-BLAST. Computationally intensive analyses can be distributed across several computers. The system is based on a relational database serving as a back-end and a software interface as a front-end. Relational storage of results from different analyses permits straightforward evaluation of results and the comparison of annotations across genomes.

The above annotation system was applied to fourteen proteomes including the human proteome. The extent and reliability of structural and functionalannotation in these proteomes was evaluated and compared. About 40% of the human proteome can be assigned to protein folds. For 77% of the proteome there is some functional information, but only 26% of the proteome can be assigned to the standard sequence motifs that characterise function. There are substantial differences in the composition of membrane proteins between the proteomes in terms of their globular domains. Commonly occurring structural superfamilies are identified and compared across the proteomes. The frequencies of these superfamilies leads to the estimate that 98% of the  human proteome evolved by domain duplication, with four of the ten most duplicated superfamilies potentially specific for multi-cellular organisms. Occurrence of domains in repeats is more common in metazoa than in single-cellular organisms. Superfamily pairs co-occurring in the same protein sequence were analysed and compared across the proteomes. Structural superfamilies over- and under-represented in human disease genes were identified.

Download the thesis (217 pages, 4.8 MB) in pdf-format (hyperlinks in colour or black & white).

[ back to home-page ]