MaxCluster is an unimaginative fusion of MaxSub and Clustering.
I originally wrote MaxCluster to perform all-verses-all MaxSub calculations on large datasets. This was previously done in the lab by using the maxsub program of Siew et al. (2000). The reason for doing such a computation is that it enables discovery of which proteins contain the most structure in common with the rest of the dataset. This information can be used by clustering routines to pick representative - and hopefully good - models from a pool of structures.
As a standalone program maxsub is very fast and produces good structure alignments. However it only processes two structures. This results in a large amount of time spent reading PDB files during an all-verses-all calculation. For example a dataset of 1,000 proteins requires 499,500 comparisons: that's nearly a million PDB files to be read! By reading all the proteins once into memory, I achieved a huge speed gain. What used to take 20 minutes within a perl script can be done within a minute by MaxCluster.
There are many excellent programs for comparison of proteins. However they are best for comparison of two structures. MaxCluster has been written to deal with large lists of models and produce simple summary output.
A key feature of the program is the ability to compare a list of models against a reference structure, sort the models by score and output a superposition of the top n models againsts the reference.
The sequence-independant functionality is experimental. For more robust sequence-independant alignments I recommend:
The sequence-independant functionality was added it to allow quick superposition of model structures where the residue sequence numbering for the same coordinates was incorrect. This happens when using several different tools to process models as they often renumber the sequence in different ways. In this case, the current sequence-independant alignment works perfectly.
MaxCluster is used in our lab to identify good models from a pool of thousands generated by our in-house de novo protein structure prediction system. I benchmarked the selection of good models using a diverse set of 30 proteins based on the rosetta all-atom decoy set (Tsai et al., 2003). I tested the ability of each clustering method to select (a) one model or (b) five models from the dataset. In the later case the best model of the five was used in the assessment. The methods were assessed using the ranks of the selected models (not the scores), where models were ranked using either TM-score or RMSD to native. The ranks of selected models were averaged across the 30 decoy sets. The purpose was to find the clustering method that most consistently picked good models, i.e. had the best average rank. Clustering methods were tested using decoy sets of 500, 1000 and 2000 models and TM-score, RMSD and URMSD as the distance metrics.
The benchmarking showed the following:
In summary, for an automated system I would use TM-score RNN clustering to choose a small set of models from a population or 3D-Jury to choose one model. If the population size is over 2000 models, I would use RMSD as the distance metric as speed is more important. However for manual work I would draw a consensus from several methods used with different clustering thresholds and distance metrics.