Ph.D. thesis, Arne Muller, 2002 (arne.muller@gmail.com), Cancer Research UK, University College London & Imperial College, London
Abstract
A strategy for protein structure and function based annotation
of genomes was developed, evaluated and applied to the
proteins of several genomes including the human
genome.
First the performance of the widely-used homology-based
sequence comparison program PSI-BLAST to detect distant homologous relationships
(<20% sequence identity) was evaluated. The benchmark is based
on two sets of sequences from the Structural Classification
Of Proteins (SCOP) database for which the
homologous relationships are known. About 40% of the test
proteome can be annotated via remote homologies. Common
sources of errors are identified. PSI-BLAST is applied
to assign homologues of known structure and function to proteins
of M. genitalium
and M. tuberculosis. From the benchmark, the number of missed
assignmets; and the potential extent of new structural and functional families
was estimated.
An automated proteome annotation system was developed to perform large scale annotations based on analyses such as PSI-BLAST. Computationally intensive analyses can be distributed across several computers. The system is based on a relational database serving as a back-end and a software interface as a front-end. Relational storage of results from different analyses permits straightforward evaluation of results and the comparison of annotations across genomes.
The above annotation system was applied to fourteen proteomes including the human proteome. The extent and reliability of structural and functionalannotation in these proteomes was evaluated and compared. About 40% of the human proteome can be assigned to protein folds. For 77% of the proteome there is some functional information, but only 26% of the proteome can be assigned to the standard sequence motifs that characterise function. There are substantial differences in the composition of membrane proteins between the proteomes in terms of their globular domains. Commonly occurring structural superfamilies are identified and compared across the proteomes. The frequencies of these superfamilies leads to the estimate that 98% of the human proteome evolved by domain duplication, with four of the ten most duplicated superfamilies potentially specific for multi-cellular organisms. Occurrence of domains in repeats is more common in metazoa than in single-cellular organisms. Superfamily pairs co-occurring in the same protein sequence were analysed and compared across the proteomes. Structural superfamilies over- and under-represented in human disease genes were identified.
Download the thesis (217 pages, 4.8 MB) in pdf-format (hyperlinks in colour or black & white).