Ph.D. thesis, Arne Muller, 2002: A protein structure based annotation of genomes

A protein structure based annotation of genomes

Ph.D. thesis, Arne Muller, 2002 (arne.muller@gmail.com), Cancer Research UK, University College London & Imperial College, London

Abstract

A strategy for protein structure and function based annotation of genomes was developed, evaluated and applied to the proteins of several genomes including the human genome.

First the performance of the widely-used homology-based sequence comparison program PSI-BLAST to detect distant homologous relationships (<20% sequence identity) was evaluated. The benchmark is based on two sets of sequences from the Structural Classification Of Proteins (SCOP) database for which the homologous relationships are known. About 40% of the test proteome can be annotated via remote homologies. Common sources of errors are identified. PSI-BLAST is applied to assign homologues of known structure and function to proteins of M. genitalium and M. tuberculosis. From the benchmark, the number of missed assignmets; and the potential extent of new structural and functional families was estimated.

An automated proteome annotation system was developed to perform large scale annotations based on analyses such as PSI-BLAST. Computationally intensive analyses can be distributed across several computers. The system is based on a relational database serving as a back-end and a software interface as a front-end. Relational storage of results from different analyses permits straightforward evaluation of results and the comparison of annotations across genomes.

The above annotation system was applied to fourteen proteomes including the human proteome. The extent and reliability of structural and functionalannotation in these proteomes was evaluated and compared. About 40% of the human proteome can be assigned to protein folds. For 77% of the proteome there is some functional information, but only 26% of the proteome can be assigned to the standard sequence motifs that characterise function. There are substantial differences in the composition of membrane proteins between the proteomes in terms of their globular domains. Commonly occurring structural superfamilies are identified and compared across the proteomes. The frequencies of these superfamilies leads to the estimate that 98% of the human proteome evolved by domain duplication, with four of the ten most duplicated superfamilies potentially specific for multi-cellular organisms. Occurrence of domains in repeats is more common in metazoa than in single-cellular organisms. Superfamily pairs co-occurring in the same protein sequence were analysed and compared across the proteomes. Structural superfamilies over- and under-represented in human disease genes were identified.

Download the thesis (217 pages, 4.8 MB) in pdf-format (hyperlinks in colour or black & white).

[ back to home-page ]