MaxCluster Home | FAQ | Performance | Contact | Download

MaxCluster

A tool for Protein Structure Comparison and Clustering

Summary

The performance of the MaxCluster program was compared to the maxsub and the tmscore programs. Run-time analysis of the programs indicates that by pre-loading the reference structure into memory, the MaxCluster program achieves a ten-fold speed increase during list processing. This makes it 4 and 9 times faster than maxsub and tmscore respectively.

Analysis of the output scores for each program indicate that the MaxCluster program produces alignments of comparable quality to the maxsub program in default mode and the tmscore program when run using the TM-score mode.

The pre-loading feature, combined with its comparative structure alignment performance, make MaxCluster the program of choice for processing large protein datasets.

Performance Analysis

The MaxCluster program can perform sequence-dependant comparisons of large lists of model proteins. The largest subset of residues that can be superposed below a distance threshold can be found using a heuristic search algorithm. This subset is scored using the MaxSub score (Siew et al., 2000) and the TM-score (Zhang and Skolnick, 2004).

The performance of the MaxCluster program was compared to the maxsub program of Siew et al. (2000) and the tmscore program of Zhang and Skolnick (2004).

Test Data

The testing dataset was based on the rosetta all-atom small protein decoy set (Tsai et al, 2003). The original dataset contained 78 small proteins. To eliminate structural redundancy, an all-verses-all comparison was performed using the mammoth program of Ortiz et al. (2002). For any pair of structures with a mammoth Z-score greater than 4.5, one protein was manually selected for removal on the basis of model quality. This produced a testing dataset of 27 proteins with 50,478 decoy models:

PDB         AA        Type      Class    Domain         Fold	   Decoys
1uxd        43        2.1Å      a        d1uxd__        a.35       1897
1uba        45        1.7Å      a        d1dv0a_        a.5        1900
1gab        53        NMR       a        d1gab__        a.8        1899
1bw6        56        NMR       a        d1bw6a_        a.4        1901
1a32        65        NMR       a        d1a32__        a.16       1611
2ezh        65        1.8Å      a        d2ezh__        a.4        1894
1am3        70        2.45Å     a        d1a8o__        a.28       1899
1pou        71        1.7Å      a        d1pou__        a.35       1899
1kjs        74        NMR       a        d1kjs__        a.50       1894
1hyp        75        NMR       a        d1hyp__        a.52       1894
1nkl        78        1.8Å      a        d1nkl__        a.64       1899
1nre        81        1.5Å      a        d1nre__        a.13       1894
1cei        85        NMR       a        d1cei__        a.28       1898
5pti        58        1.25Å     ab       d5pti__        g.8        1854
1tif        59        NMR       ab       d1tif__        d.15       1850
2ptl        60        NMR       ab       d2ptl__        d.15       1836
1aa3        63        1.54Å     ab       d1aa3__        d.48       1866
1orc        64        NMR       ab       d1orc__        a.35       1884
1msi        66        1.8Å      ab       d1msi__        b.85       1895
1ctf        68        2.02Å     ab       d1ctf__        d.45       1923
1afi        72        NMR       ab       d1afi__        d.58       1825
5icb        75        1.6Å      ab       d1ig5a_        a.39       1871
2fow        76        1.8Å      ab       d2fow__        a.4        1835
1vcc        77        NMR       ab       d1vcc__        d.121      1858
1vif        60        NMR       b        d1vif__        b.34       1897
1tuc        61        NMR       b        d1tuc__        b.34       1895
1csp        67        1Å        b        d1csp__        b.40       1810

Where:
PDB	=	The protein PDB code
AA	=	Number of amino acids
Type	=	Structure determination method. Resolutions are provided for X-ray structures
Class	=	The structural class of the protein (visual inspection)
Domain	=	SCOP domain (updated PDB structures have been used where appropriate)
Fold	=	SCOP Fold
Decoys	=	Number of decoys

Method

For each of the protein sequences in the dataset, each decoy was compared to the native structure using either maxsub, tmscore or MaxCluster. maxsub and MaxCluster were run with a distance threshold of 3.5Å. tmscore was run with default parameters. MaxCluster was run using the -noalign option to suppress output of the superposition file. In addition the MaxCluster program was run in batch mode with the complete decoy set for each protein being passed as a list.

The complete run for each protein sequence was timed using the GNU time program, version 1.7. Six runs per protein were performed in succession and the time of the first run discarded. This was done to avoid the effects of caching as structures were read from the local hard disc. Timings were performed on a AMD Athlon XP 2600+ with 1GB of RAM running Red Hat Enterprise Linux WS release 4, kernel 2.6.9-42.0.3.EL. MaxCluster and tmscore were compiled using GCC 3.4.6 using the -O3 flag. maxsub was obtained as a linux binary from http://fischerlab.cse.buffalo.edu/maxsub/.

Speed

The timings for the 5 runs for each of the programs are provided in the file times.csv. The timings have been averaged across the five runs and are provided in the file times.av.csv. The averages of the total run-time, in seconds, are shown below:

PDB       TMscore   MaxSub    MaxCluster          MaxClusterTM
          Pair      Pair      Pair      List      Pair      List
1uxd      27.2      12.2      25.4      3.46      24.7      3.67
1uba      26.6      11.5      24.8      3.21      24.4      3.58
1gab      32.0      14.2      28.1      3.86      28.0      4.27
1bw6      40.4      17.5      42.0      4.73      42.4      5.32
1a32      34.7      16.3      33.5      4.57      33.7      5.35
2ezh      47.1      18.6      50.6      4.91      50.9      5.60
1am3      35.0      16.9      30.9      4.66      31.0      5.23
1pou      51.6      20.4      58.4      5.39      58.7      6.24
1kjs      57.3      22.1      59.7      5.79      60.2      7.15
1hyp      45.4      19.0      39.6      5.04      39.8      6.07
1nkl      55.9      21.1      58.0      5.51      58.4      6.33
1nre      48.1      18.9      57.0      5.25      57.3      5.93
1cei      60.0      23.0      52.3      6.18      53.0      7.27
5pti      32.9      12.7      33.9      3.56      33.9      3.98
1tif      37.5      16.2      32.2      4.33      32.3      5.01
2ptl      40.7      16.4      37.5      4.41      37.8      5.11
1aa3      37.5      15.5      37.2      4.14      37.4      4.61
1orc      32.7      14.4      27.3      4.05      27.5      4.50
1msi      33.0      13.6      30.2      3.89      30.0      4.45
1ctf      43.3      19.1      33.2      5.20      33.8      6.28
1afi      51.4      20.8      50.2      5.60      50.4      6.46
5icb      48.4      20.5      40.6      5.38      41.1      6.40
2fow      46.1      17.8      44.1      4.79      44.7      5.89
1vcc      45.3      19.0      46.3      5.35      46.7      6.22
1vif      24.9      11.6      23.2      3.46      22.9      3.79
1tuc      35.2      15.5      31.9      4.23      32.0      4.84
1csp      35.8      15.5      30.0      4.18      30.4      4.86

Where:
PDB	=	The protein PDB code
MaxClusterTM	=	MaxCluster run using the TM-score search engine
Pair	=	Pairwise comparison
List	=	List comparison

The timing data show that the programs are able to compute the pairwise comparisons for approximately 1,800 small protein structures in a time of 25-60 seconds for tmscore, 11-23 seconds for maxsub and 23-60 seconds for MaxCluster. However the MaxCluster program run in list mode can perform the same computation in 3.5-6 seconds, an approximate 8-fold speed increase. This indicates that the main bottleneck to run-time is reading the PDB structures from the disc.

The average times for maxsub and MaxCluster have been expressed relative to the timings for the tmscore program (times.rel.csv):

PDB       MaxSub     MaxCluster            MaxClusterTM
          Pair       Pair       List       Pair       List
1uxd      0.462      0.932      0.129      0.920      0.147
1uba      0.436      0.932      0.130      0.917      0.135
1gab      0.449      0.934      0.127      0.908      0.157
1bw6      0.397      0.853      0.111      0.859      0.166
1a32      0.416      0.859      0.118      0.856      0.164
2ezh      0.432      0.912      0.118      0.912      0.170
1am3      0.436      0.915      0.123      0.921      0.158
1pou      0.447      0.890      0.119      0.893      0.180
1kjs      0.443      0.911      0.119      0.914      0.204
1hyp      0.440      0.915      0.120      0.918      0.172
1nkl      0.453      0.927      0.121      0.941      0.177
1nre      0.435      0.893      0.118      0.901      0.158
1cei      0.437      0.904      0.122      0.904      0.194
5pti      0.418      0.921      0.115      0.926      0.0985
1tif      0.430      0.921      0.116      0.929      0.123
2ptl      0.411      0.915      0.111      0.919      0.118
1aa3      0.411      0.896      0.108      0.907      0.102
1orc      0.416      0.925      0.111      0.934      0.0991
1msi      0.412      0.957      0.113      0.970      0.0965
1ctf      0.403      0.983      0.111      0.992      0.133
1afi      0.397      1.04       0.111      1.05       0.134
5icb      0.421      1.05       0.111      1.05       0.132
2fow      0.399      1.02       0.105      1.03       0.115
1vcc      0.403      1.10       0.107      1.11       0.121
1vif      0.377      1.04       0.100      1.04       0.0678
1tuc      0.386      1.02       0.101      1.02       0.0845
1csp      0.383      0.995      0.103      1.00       0.0810
Av        0.420      0.947      0.115      0.950      0.137
SD        0.0225     0.0625     0.00802    0.0647     0.0364

Where:
PDB	=	The protein PDB code
MaxClusterTM	=	MaxCluster run using the TM-score search engine
Pair	=	Pairwise comparison
List	=	List comparison
Av	=	Average
SD	=	Standard Deviation

Relative timings show that the maxsub program is approximately twice as fast as the tmscore or MaxCluster program when run in a pairwise structure comparison test. However when run in list mode the MaxCluster program shows an approximate 4-fold increase in speed over maxsub and a 9-fold increase in speed over tmscore.

The MaxCluster program provides a search that can maximise the subset size or the TM-score. The timing data above show that the MaxSub search is marginally faster. This is because it does not require computation of the TM-score during each search step. However the main proportion of CPU time in the search algorithm is used to transform xyz coordinates and recompute pairwise distances. Hence the speed increase is approximately 10%.

Performance

Each of the tested programs contains its own search algorithm to find the best subset of residues shared by two structures. The maxsub program uses a search algorithm that attempts to maximise the number of residues that can be superposed below a given distance. This is assessed using the MaxSub score. The tmscore program uses a modified version of this score, the Template Modelling (TM) score, as the target function for the search. Thus the size of the subset is not necessarily maximised during the tmscore search. However the tmscore program provides the MaxSub score in the final output. The MaxCluster program provides a search that can maximise the subset size or the TM-score. In this trial both varients of the search were used.

A comparison was made between the MaxSub and TM-score output produced by each program. The complete score listing can be found in file scores.csv.gz. Note that the maxsub program sets the MaxSub score to zero if the structure alignment is of poor quality (low MaxSub score, low number of residues in the subset). These decoys have been ignored during the analysis of MaxSub scores.

The MaxSub score of the tmscore and MaxCluster program were compared relative to the score output by the maxsub program. In addition the number of pairs in the MaxSub were compared between the MaxCluster program and the maxsub program. The TM-score of the MaxCluster program was compared relative to the output of the tmscore program. Score were expressed as fractions and the relative score averaged for each protein in the dataset. A paired Student's t-test was performed to assess whether the differences between the scores were significant. The following table shows the average relative score and the t-test p-value for each protein in the dataset:

PDB      tm_ms               mc_ms                mc_p               mt_ms                mt_p               mc_tm               mt_tm
          Diff         p      Diff         p      Diff         p      Diff         p      Diff         p      Diff         p      Diff         p
1uxd      1.01  1.62e-12     0.999   7.07e-5     0.997  1.02e-10     0.976  1.62e-12      0.95  1.62e-12     0.974  5.55e-17         1         0
1uba      1.01  1.11e-16     0.998  1.42e-10     0.992  3.33e-16     0.962         0      0.93         0     0.969  1.62e-12         1  1.62e-12
1gab      1.02         0     0.995         0     0.988         0     0.954         0     0.912         0     0.974  5.55e-17         1         0
1bw6      1.02  1.62e-12     0.996  1.62e-12     0.992  1.62e-12      0.97  1.62e-12      0.94  1.62e-12     0.985         0         1    0.0789
1a32      1.02  1.62e-12     0.999   0.00422     0.998  0.000316     0.988  1.62e-12     0.962  1.62e-12     0.986         0     0.999  1.12e-14
2ezh      1.02         0     0.997   5.34e-5     0.993  1.83e-10     0.976  1.11e-16     0.945         0     0.984  1.62e-12     0.998  1.62e-12
1am3      1.02  5.55e-17     0.996  1.61e-15     0.992  1.11e-16     0.971  1.11e-16     0.944         0     0.976         0         1     0.228
1pou      1.03  1.11e-16         1     0.124     0.997   0.00069     0.983  1.67e-16     0.961         0     0.985         0     0.998         0
1kjs      1.02         0     0.998   1.66e-5     0.995   1.71e-8     0.974         0     0.951         0     0.985  1.62e-12     0.998  1.62e-12
1hyp      1.04         0     0.998     0.142         1      0.42     0.967         0     0.952         0     0.974  1.62e-12     0.992  1.62e-12
1nkl      1.03  1.62e-12     0.998    0.0037     0.998   0.00139     0.972  1.62e-12     0.948  1.62e-12     0.982  5.55e-17     0.998  3.33e-16
1nre      1.02         0     0.996   5.44e-8     0.993  5.85e-11     0.978         0     0.952         0     0.984  1.62e-12     0.998  1.62e-12
1cei      1.03         0     0.998    0.0576     0.998     0.146     0.971         0     0.959         0     0.976  1.62e-12     0.995  1.62e-12
5pti      1.02  1.63e-12     0.996    0.0498     0.993    0.0374     0.958  1.62e-12     0.928  1.62e-12     0.963  1.62e-12     0.998  3.24e-10
1tif      1.02         0     0.994  1.11e-16     0.987  1.67e-16     0.968  5.55e-17     0.932         0     0.986  1.62e-12     0.999  1.77e-12
2ptl      1.03  1.62e-12     0.995   9.93e-9     0.992  1.23e-10     0.962  1.62e-12     0.921  1.62e-12      0.98  1.62e-12     0.999   4.39e-7
1aa3      1.01         0     0.997  3.16e-10     0.992  5.55e-17     0.973         0     0.945  1.11e-16     0.987  1.62e-12         1     0.275
1orc      1.02  1.62e-12     0.997   1.83e-5     0.996  0.000359     0.971  1.62e-12      0.94  1.62e-12     0.978  1.62e-12     0.999   6.35e-6
1msi      1.03   8.98e-7         1     0.322         1     0.322     0.959   4.71e-9     0.929  6.87e-10     0.975         0     0.996  5.55e-17
1ctf      1.02         0     0.996  2.15e-10     0.995   4.58e-8     0.978  5.55e-17     0.947         0     0.986         0     0.999   2.07e-9
1afi      1.02         0     0.995  2.33e-15     0.987  1.11e-16     0.974         0     0.945         0     0.984  5.55e-17     0.998         0
5icb      1.03         0     0.998  0.000444     0.994   4.56e-7     0.971         0     0.948  5.55e-17      0.98         0     0.997  2.22e-16
2fow      1.02         0     0.997   7.85e-8     0.994   2.96e-9     0.973         0     0.944         0     0.983  1.11e-16     0.998  3.89e-16
1vcc      1.03  1.11e-16         1      0.36         1     0.462     0.974  1.67e-16     0.959         0     0.968  1.62e-12     0.991  1.62e-12
1vif      1.01  5.55e-17     0.999   0.00467     0.993  1.67e-16     0.981         0     0.953         0     0.963  5.55e-17     0.999      0.03
1tuc      1.02  1.62e-12     0.996   9.71e-9     0.989  1.62e-12      0.97  1.62e-12     0.938  1.62e-12     0.983         0     0.999  0.000459
1csp      1.02         0     0.997  0.000651     0.993    2.8e-6     0.963         0     0.933         0     0.977  1.62e-12     0.996  1.62e-12
Av        1.02               0.997               0.994               0.971               0.943               0.979               0.998
SD      0.0073             0.00184             0.00418             0.00774              0.0124             0.00709             0.00246

Where:
PDB	=	The protein PDB code
Diff	=	Relative difference
p	=	t-test p-value
tm_ms	=	tmscore verses maxsub score
mc_ms	=	MaxCluster verses maxsub
mc_p	=	MaxCluster verses maxsub pairs
mt_ms	=	MaxCluster TM mode verses maxsub
mt_p	=	MaxCluster TM mode verses maxsub pairs
mc_tm	=	MaxCluster verses tmscore
mt_tm	=	MaxCluster TM mode verses tmscore

The table was produced using the perl program assess_scores.pl.

The comparisons show that there are small but significant differences between the scores produced by each program. The tmscore program is able to produce alignments with MaxSub scores approximately 2% higher than the maxsub program. This is attributed to the different distance threshold used within their search engines. Both programs use a heuristic search to maximise the number of atoms in the subset below a set threshold. However the tmscore search is performed using a higher distance threshold than the 3.5Å used for the maxsub search. This identifies a transformation by considering more of the protein structure than the maxsub and consequently may subtly alter the orientation of the subset within 3.5Å to increase the MaxSub score.

In comparison the MaxCluster program produces alignment scores approximately 0.3% lower than the maxsub program. In addition the MaxSub size found by MaxCluster is approximately 0.6% smaller. This may be a reflection of MaxCluster's search algorithm which uses fewer starting seeds and is thus a less exhaustive search. Interestingly when run using a TM-score search the program produces alignments with a MaxSub score approximately 3% lower than MaxSub. This contrasts with the tmscore program which produces better scoring alignments. This may again be a reflection of the different distance thresholds used within the internal search engine.

A comparison between the MaxCluster and tmscore output shows that the TM-scores of MaxCluster are approximately 2% lower. This is due to the use of a search algorithm that maximises the subset within a set distance rather than the TM-score. However when run using the TM mode the MaxCluster program produces results only 0.2% worse than tmscore. Thus the modification to the search engine is producing better alignments.

These results show that the MaxCluster program produces very similar alignments to the maxsub program. This is not unexpected given that both programs implement a search algorithm to find the maximum number of pairs that can be superposed below a given input threshold. In contrast the tmscore program searches for an alignment with the best TM-score using its own distance threshold. Interestingly this results in an alignment with a higher MaxSub score. However since no information is output from the tmscore program regarding the number of residue pairs below the MaxSub scoring threshold of 3.5Å, an analysis of the size of the MaxSub set cannot be made.

Conclusion

The run-time analysis of the maxsub, tmscore and MaxCluster programs indicates that the maxsub program is approximately twice as fast as the other two programs for pairwise comparisons. However by pre-loading the reference structure into memory, the MaxCluster program achieves a 8-fold speed increase during list processing. This makes it 4 and 9 times faster than maxsub and tmscore respectively. Furthermore, pre-loading the model list will produce an even greater speed gain when performing all-verses-all comparisons for clustering.

Analysis of the output scores for each program indicate that there are marginal differences between the programs. The MaxCluster program produces alignments of comparable quality to the maxsub program and the tmscore program when run using the TM-score search algorithm.

The pre-loading feature, combined with its comparitive structure alignment performance, make MaxCluster the program of choice for processing large protein datasets.

References

Ortiz, A. R., Strauss, C. E., and Olmea, O. (2002). MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci, 11, 2606-21.
Siew, N., Elofsson, A., Rychlewski, L., and Fischer, D. (2000). MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776-85.
Tsai, J., Bonneau, R., Morozov, A.V., Kuhlman, B., Rohl, C.A. and Baker, D. (2003) An improved protein decoy set for testing energy functions for protein structure prediction. Proteins, 53, 76-87.
Zhang, Y. and Skolnick, J. (2004). Scoring function for automated assessment of protein structure template quality. Proteins, 57, 702-710.

MaxCluster Home | FAQ | Performance | Contact | Download