Arne's Homepage

Arne Muller

Graduate Student in Mike Sternbergs group at Cancer Research UK & Imperial College, London between 1998 - 2002.

I'm now working in Bioinformatics/Systems Toxicology at Novartis, email contact: forename.surname [AT] gmail com

Why I was here ...

In 1998 I got my Diploma in Biology (at Georg-August University of Goettingen, Germany). My special interest were proteins and their three dimensional structure. With some programming skills in C and some development on RasMol (thanks Roger!) I jumped into protein structure based bioinformatics. Why? Hmm ..., because I realized that I can be a software developer and a biologist at the same time - and I don't have to work in a wet-lab anymore ;-) . Anyway, after four years of bioinformatics I have to admit things are not much easier just because there is no protein to purify or cells to grow, so here we go!

Project descriptions:

I was working on structural genome annotation.

My latest project in Mike Sternberg's lab (2004) involved the development and maintenance of a database for structural and functional genome annotation. The generated annotation is compared across fully sequenced genomes. In particular the protein domain family compositions of genomes are compared in several different ways and contexts. The goals of the projects were:

A) General evolutionary insights, e.g. how protein families and superfamilies have evolved.

B) How domain families have been recruited and are used in a new a functional context (e.g. domain combinations, evolution of repeats within proteins, globular domains in trans membrane proteins, domains from disease genes).

C) Provide access to a broad database of structural and functional annotation.

D) Provide a research platform for projects within our lab. The database is interfaced by a high level object oriented perl API (perl is the language consensus in our group ;-), allowing for e.g. fast retrieval of pre calculated homologous sequences, alignments, and other features.

The analysis pipeline currently has a focus on protein sequences for which we perform several steps of analysis such as: Identification of transmembrane regions, coiled-coils, low complexity regions, Prosite-patterns, PFAM and SCOP domains, repeats, homologous sequences and secondary structure prediction. Structural information (fold classification) is assigned to sequences of the genomes via homology (using Blast, PSI-Blast and our in-house software 3D-PSSM).

The database is accessible via the web as 3D-GENOMICS.

My Ph.D. thesis "A protein structure based annotation of genomes" describes the above and other projects in detail and is available on the web.

Older projects:

I have worked on project that deals with PSI-BLAST in genome annotation. We have developed a benchmark that evaluates the performance of PSI-BLAST in terms of coverage and errors in genome annotation. Another part of the work is to identify ORFs with homologues of known structure for the genomes of Mycoplasma genitalium and Mycobacterium tuberculosis. Results and data of the project can be found found here ....

RasMol. Between 1996 and 1998 I've done some development on Roger Sayle's molecular viewer RasMol which were published as RasMol2.6b2x1 (eXtended RasMol). These changes have been taken up OpenRasMol, a project coordinated by Herbert Bernstein to integrate, maintain and develop the different derivatives of RasMol that have been around.

Software tools downloadable from this site.

For parts of my work I use PYTHON as programming language. Python is a high level object oriented scripting language. You can download the software package: a parser for BLAST/PSI-BLAST written in python (see below).

The parser reads in an output file from a BLAST/PSI-BLAST/tBlastN-run and represents it as a data structure. You can access the individual bits of information of the BLAST results. Please note, the parser is still not perfect and was exclusively developed for my own needs. The README of the package and the source code itself provides some documentation. The software may be part of the BIOPYTHON project and can be used under the terms of the BIOPYTHON license (also included in the README file).

NEW in version 1.2 (major changes):

in blastflt.py: clustering of redundant hits (choose your definition of what's redundant)
in blastflt.py: choose threshold for e-value and percent sequence identity to keep or skip hits
in blast.py: support of tblastn and blastn and output that contains results of several blast runs
new module 'misc.py' with some new class independent routines and class 'AlignmentContainer'

The file is in 'tar.gz' format (845925 bytes), it includes the following files:

README this file
fio.py source for buffered file input/output
misc.py other (useful) routines and classes
blast.py source for the blast parser
blastflt.py source for an example program using fio.py and blast.py
Rv3829c.blast example BLAST version 2.10.0 output
Rv3829c_blast.flt Rv3829c.blast processed with blastflt.py
Rv3829c.psiblast example PSI-BLAST version 2.10.0 output
Rv3829c_psiblast.flt Rv3829c.psiblast processed with blastflt.py

Download the blast parser:

version 1.2.4 (current version, 07/2003)
version 1.2.3
version 1.2
version 1.1.1
version 1.1
version 1.0

PLEASE REPORT BUGS AND SUGGESTIONS TO: arne d o t muller @ g mail.com

TeXMed - a BibTeX interface for PubMed

For most of my scientific writing I use LaTeX, unfortunately the PubMed literature database at NCBI does not provide any export filters for the BibTeX format that LaTeX uses for managing a bibliography. Therefore I've written a simple web-based interface to PubMed that allows to query PubMed as if you were directly on the PubMed server; you can select articles and store them in a "shopping basket", once you think "that's about it" you can export these articles in one go in BibTeX format.

Enter TeXMed here ...

An other PubMed to BibTeX site: http://www.pmbrowser.info/

Tutorial/Review: "An introduction into protein-sequence based annotation"

This is a sort of tutorial or review like document that is based on the introduction of my Ph.D. thesis. It coveres subjects like protein sequence databases, annotation procedures, sequence comparisons and sequence database searches, the use of protein structure in protein annotation and modelling and the sequence-structure-function relationship.

download the pdf file (52 pages excluding references, size is 3.7 MB) or download the LaTeX sources (tar.gz file, size is 2.5 MB)