MaxCluster Home | FAQ | Performance | Contact | Download

MaxCluster

A tool for Protein Structure Comparison and Clustering

Frequently Asked Questions

Why did you name your program MaxCluster?
Why did you develop MaxCluster?
Why should I use it?
Why are the TM scores lower for sequence-independant alignments than for sequence-dependant alignments of the same protein?
Which is the best clustering method?

Why did you name your program MaxCluster? [faq]

MaxCluster is an unimaginative fusion of MaxSub and Clustering.

Why did you develop MaxCluster? [faq]

I originally wrote MaxCluster to perform all-verses-all MaxSub calculations on large datasets. This was previously done in the lab by using the maxsub program of Siew et al. (2000). The reason for doing such a computation is that it enables discovery of which proteins contain the most structure in common with the rest of the dataset. This information can be used by clustering routines to pick representative - and hopefully good - models from a pool of structures.

As a standalone program maxsub is very fast and produces good structure alignments. However it only processes two structures. This results in a large amount of time spent reading PDB files during an all-verses-all calculation. For example a dataset of 1,000 proteins requires 499,500 comparisons: that's nearly a million PDB files to be read! By reading all the proteins once into memory, I achieved a huge speed gain. What used to take 20 minutes within a perl script can be done within a minute by MaxCluster.

Why should I use it? [faq]

There are many excellent programs for comparison of proteins. However they are best for comparison of two structures. MaxCluster has been written to deal with large lists of models and produce simple summary output.

A key feature of the program is the ability to compare a list of models against a reference structure, sort the models by score and output a superposition of the top n models againsts the reference.

Why are the TM scores lower for sequence-independant alignments than for sequence-dependant alignments of the same protein? [faq]

Short answer

The sequence-independant functionality is experimental. For more robust sequence-independant alignments I recommend:

Mammoth (Ortiz, et al., 2002)
TM-align (Zhang and Skolnick, 2005)
FATCAT [webserver only] (Ye and Godzik, 2004)

Long answer

The sequence-independant functionality was added it to allow quick superposition of model structures where the residue sequence numbering for the same coordinates was incorrect. This happens when using several different tools to process models as they often renumber the sequence in different ways. In this case, the current sequence-independant alignment works perfectly.

Which is the best clustering method? [faq]

MaxCluster is used in our lab to identify good models from a pool of thousands generated by our in-house de novo protein structure prediction system. I benchmarked the selection of good models using a diverse set of 30 proteins based on the rosetta all-atom decoy set (Tsai et al., 2003). I tested the ability of each clustering method to select (a) one model or (b) five models from the dataset. In the later case the best model of the five was used in the assessment. The methods were assessed using the ranks of the selected models (not the scores), where models were ranked using either TM-score or RMSD to native. The ranks of selected models were averaged across the 30 decoy sets. The purpose was to find the clustering method that most consistently picked good models, i.e. had the best average rank. Clustering methods were tested using decoy sets of 500, 1000 and 2000 models and TM-score, RMSD and URMSD as the distance metrics.

The benchmarking showed the following:

There are small but consistent differences between the clustering methods.
The TM-score distance metric produces the best clustering results.
3D-Jury is the best method for selecting 1 model from a population.
Restricted Nearest Neighbour (RNN) is the best clustering method for selecting 5 models from a population.
The 3D-Jury method is robust to changes in dataset size.
Clustering methods require close monitoring of their parameters to achieve optimum results on different dataset sizes.
The TM-score is a computationally expensive method compared to the RMSD distance metric, approximately 50-60 times slower for protein lengths of 50-100 residues.

In summary, for an automated system I would use TM-score RNN clustering to choose a small set of models from a population or 3D-Jury to choose one model. If the population size is over 2000 models, I would use RMSD as the distance metric as speed is more important. However for manual work I would draw a consensus from several methods used with different clustering thresholds and distance metrics.

References [faq]

Ortiz, A. R., Strauss, C. E., and Olmea, O. (2002). MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci, 11, 2606-21.
Siew, N., Elofsson, A., Rychlewski, L., and Fischer, D. (2000). MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776-85.
Tsai, J., Bonneau, R., Morozov, A.V., Kuhlman, B., Rohl, C.A., and Baker, D. (2003). An improved protein decoy set for testing energy functions for protein structure prediction. Proteins, 53, 76-87.
Ye, Y. and Godzik, A. (2004). FATCAT: a web server for flexible structure comparison and structure similarity searching. Nucleic Acids Res, 32, W582-W585.
Zhang, Y. and Skolnick, J. (2005). TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res, 33, 2302-2309.

MaxCluster Home | FAQ | Performance | Contact | Download