The performance of the MaxCluster program was compared to the maxsub and the tmscore programs. Run-time analysis of the programs indicates that by pre-loading the reference structure into memory, the MaxCluster program achieves a ten-fold speed increase during list processing. This makes it 4 and 9 times faster than maxsub and tmscore respectively.
Analysis of the output scores for each program indicate that the MaxCluster program produces alignments of comparable quality to the maxsub program in default mode and the tmscore program when run using the TM-score mode.
The pre-loading feature, combined with its comparative structure alignment performance, make MaxCluster the program of choice for processing large protein datasets.
The MaxCluster program can perform sequence-dependant comparisons of large lists of model proteins. The largest subset of residues that can be superposed below a distance threshold can be found using a heuristic search algorithm. This subset is scored using the MaxSub score (Siew et al., 2000) and the TM-score (Zhang and Skolnick, 2004).
The performance of the MaxCluster program was compared to the maxsub program of Siew et al. (2000) and the tmscore program of Zhang and Skolnick (2004).
The testing dataset was based on the rosetta all-atom small protein decoy set (Tsai et al, 2003). The original dataset contained 78 small proteins. To eliminate structural redundancy, an all-verses-all comparison was performed using the mammoth program of Ortiz et al. (2002). For any pair of structures with a mammoth Z-score greater than 4.5, one protein was manually selected for removal on the basis of model quality. This produced a testing dataset of 27 proteins with 50,478 decoy models:
PDB AA Type Class Domain Fold Decoys 1uxd 43 2.1Å a d1uxd__ a.35 1897 1uba 45 1.7Å a d1dv0a_ a.5 1900 1gab 53 NMR a d1gab__ a.8 1899 1bw6 56 NMR a d1bw6a_ a.4 1901 1a32 65 NMR a d1a32__ a.16 1611 2ezh 65 1.8Å a d2ezh__ a.4 1894 1am3 70 2.45Å a d1a8o__ a.28 1899 1pou 71 1.7Å a d1pou__ a.35 1899 1kjs 74 NMR a d1kjs__ a.50 1894 1hyp 75 NMR a d1hyp__ a.52 1894 1nkl 78 1.8Å a d1nkl__ a.64 1899 1nre 81 1.5Å a d1nre__ a.13 1894 1cei 85 NMR a d1cei__ a.28 1898 5pti 58 1.25Å ab d5pti__ g.8 1854 1tif 59 NMR ab d1tif__ d.15 1850 2ptl 60 NMR ab d2ptl__ d.15 1836 1aa3 63 1.54Å ab d1aa3__ d.48 1866 1orc 64 NMR ab d1orc__ a.35 1884 1msi 66 1.8Å ab d1msi__ b.85 1895 1ctf 68 2.02Å ab d1ctf__ d.45 1923 1afi 72 NMR ab d1afi__ d.58 1825 5icb 75 1.6Å ab d1ig5a_ a.39 1871 2fow 76 1.8Å ab d2fow__ a.4 1835 1vcc 77 NMR ab d1vcc__ d.121 1858 1vif 60 NMR b d1vif__ b.34 1897 1tuc 61 NMR b d1tuc__ b.34 1895 1csp 67 1Å b d1csp__ b.40 1810
|
For each of the protein sequences in the dataset, each decoy was compared to the native structure using either maxsub, tmscore or MaxCluster. maxsub and MaxCluster were run with a distance threshold of 3.5Å. tmscore was run with default parameters. MaxCluster was run using the -noalign option to suppress output of the superposition file. In addition the MaxCluster program was run in batch mode with the complete decoy set for each protein being passed as a list.
The complete run for each protein sequence was timed using the GNU time program, version 1.7. Six runs per protein were performed in succession and the time of the first run discarded. This was done to avoid the effects of caching as structures were read from the local hard disc. Timings were performed on a AMD Athlon XP 2600+ with 1GB of RAM running Red Hat Enterprise Linux WS release 4, kernel 2.6.9-42.0.3.EL. MaxCluster and tmscore were compiled using GCC 3.4.6 using the -O3 flag. maxsub was obtained as a linux binary from http://fischerlab.cse.buffalo.edu/maxsub/.
The timings for the 5 runs for each of the programs are provided in the file times.csv. The timings have been averaged across the five runs and are provided in the file times.av.csv. The averages of the total run-time, in seconds, are shown below:
PDB TMscore MaxSub MaxCluster MaxClusterTM Pair Pair Pair List Pair List 1uxd 27.2 12.2 25.4 3.46 24.7 3.67 1uba 26.6 11.5 24.8 3.21 24.4 3.58 1gab 32.0 14.2 28.1 3.86 28.0 4.27 1bw6 40.4 17.5 42.0 4.73 42.4 5.32 1a32 34.7 16.3 33.5 4.57 33.7 5.35 2ezh 47.1 18.6 50.6 4.91 50.9 5.60 1am3 35.0 16.9 30.9 4.66 31.0 5.23 1pou 51.6 20.4 58.4 5.39 58.7 6.24 1kjs 57.3 22.1 59.7 5.79 60.2 7.15 1hyp 45.4 19.0 39.6 5.04 39.8 6.07 1nkl 55.9 21.1 58.0 5.51 58.4 6.33 1nre 48.1 18.9 57.0 5.25 57.3 5.93 1cei 60.0 23.0 52.3 6.18 53.0 7.27 5pti 32.9 12.7 33.9 3.56 33.9 3.98 1tif 37.5 16.2 32.2 4.33 32.3 5.01 2ptl 40.7 16.4 37.5 4.41 37.8 5.11 1aa3 37.5 15.5 37.2 4.14 37.4 4.61 1orc 32.7 14.4 27.3 4.05 27.5 4.50 1msi 33.0 13.6 30.2 3.89 30.0 4.45 1ctf 43.3 19.1 33.2 5.20 33.8 6.28 1afi 51.4 20.8 50.2 5.60 50.4 6.46 5icb 48.4 20.5 40.6 5.38 41.1 6.40 2fow 46.1 17.8 44.1 4.79 44.7 5.89 1vcc 45.3 19.0 46.3 5.35 46.7 6.22 1vif 24.9 11.6 23.2 3.46 22.9 3.79 1tuc 35.2 15.5 31.9 4.23 32.0 4.84 1csp 35.8 15.5 30.0 4.18 30.4 4.86
|
The timing data show that the programs are able to compute the pairwise comparisons for approximately 1,800 small protein structures in a time of 25-60 seconds for tmscore, 11-23 seconds for maxsub and 23-60 seconds for MaxCluster. However the MaxCluster program run in list mode can perform the same computation in 3.5-6 seconds, an approximate 8-fold speed increase. This indicates that the main bottleneck to run-time is reading the PDB structures from the disc.
The average times for maxsub and MaxCluster have been expressed relative to the timings for the tmscore program (times.rel.csv):
PDB MaxSub MaxCluster MaxClusterTM
Pair Pair List Pair List
1uxd 0.462 0.932 0.129 0.920 0.147
1uba 0.436 0.932 0.130 0.917 0.135
1gab 0.449 0.934 0.127 0.908 0.157
1bw6 0.397 0.853 0.111 0.859 0.166
1a32 0.416 0.859 0.118 0.856 0.164
2ezh 0.432 0.912 0.118 0.912 0.170
1am3 0.436 0.915 0.123 0.921 0.158
1pou 0.447 0.890 0.119 0.893 0.180
1kjs 0.443 0.911 0.119 0.914 0.204
1hyp 0.440 0.915 0.120 0.918 0.172
1nkl 0.453 0.927 0.121 0.941 0.177
1nre 0.435 0.893 0.118 0.901 0.158
1cei 0.437 0.904 0.122 0.904 0.194
5pti 0.418 0.921 0.115 0.926 0.0985
1tif 0.430 0.921 0.116 0.929 0.123
2ptl 0.411 0.915 0.111 0.919 0.118
1aa3 0.411 0.896 0.108 0.907 0.102
1orc 0.416 0.925 0.111 0.934 0.0991
1msi 0.412 0.957 0.113 0.970 0.0965
1ctf 0.403 0.983 0.111 0.992 0.133
1afi 0.397 1.04 0.111 1.05 0.134
5icb 0.421 1.05 0.111 1.05 0.132
2fow 0.399 1.02 0.105 1.03 0.115
1vcc 0.403 1.10 0.107 1.11 0.121
1vif 0.377 1.04 0.100 1.04 0.0678
1tuc 0.386 1.02 0.101 1.02 0.0845
1csp 0.383 0.995 0.103 1.00 0.0810
Av 0.420 0.947 0.115 0.950 0.137
SD 0.0225 0.0625 0.00802 0.0647 0.0364
|
Relative timings show that the maxsub program is approximately twice as fast as the tmscore or MaxCluster program when run in a pairwise structure comparison test. However when run in list mode the MaxCluster program shows an approximate 4-fold increase in speed over maxsub and a 9-fold increase in speed over tmscore.
The MaxCluster program provides a search that can maximise the subset size or the TM-score. The timing data above show that the MaxSub search is marginally faster. This is because it does not require computation of the TM-score during each search step. However the main proportion of CPU time in the search algorithm is used to transform xyz coordinates and recompute pairwise distances. Hence the speed increase is approximately 10%.
Each of the tested programs contains its own search algorithm to find the best subset of residues shared by two structures. The maxsub program uses a search algorithm that attempts to maximise the number of residues that can be superposed below a given distance. This is assessed using the MaxSub score. The tmscore program uses a modified version of this score, the Template Modelling (TM) score, as the target function for the search. Thus the size of the subset is not necessarily maximised during the tmscore search. However the tmscore program provides the MaxSub score in the final output. The MaxCluster program provides a search that can maximise the subset size or the TM-score. In this trial both varients of the search were used.
A comparison was made between the MaxSub and TM-score output produced by each program. The complete score listing can be found in file scores.csv.gz. Note that the maxsub program sets the MaxSub score to zero if the structure alignment is of poor quality (low MaxSub score, low number of residues in the subset). These decoys have been ignored during the analysis of MaxSub scores.
The MaxSub score of the tmscore and MaxCluster program were compared relative to the score output by the maxsub program. In addition the number of pairs in the MaxSub were compared between the MaxCluster program and the maxsub program. The TM-score of the MaxCluster program was compared relative to the output of the tmscore program. Score were expressed as fractions and the relative score averaged for each protein in the dataset. A paired Student's t-test was performed to assess whether the differences between the scores were significant. The following table shows the average relative score and the t-test p-value for each protein in the dataset:
PDB tm_ms mc_ms mc_p mt_ms mt_p mc_tm mt_tm Diff p Diff p Diff p Diff p Diff p Diff p Diff p 1uxd 1.01 1.62e-12 0.999 7.07e-5 0.997 1.02e-10 0.976 1.62e-12 0.95 1.62e-12 0.974 5.55e-17 1 0 1uba 1.01 1.11e-16 0.998 1.42e-10 0.992 3.33e-16 0.962 0 0.93 0 0.969 1.62e-12 1 1.62e-12 1gab 1.02 0 0.995 0 0.988 0 0.954 0 0.912 0 0.974 5.55e-17 1 0 1bw6 1.02 1.62e-12 0.996 1.62e-12 0.992 1.62e-12 0.97 1.62e-12 0.94 1.62e-12 0.985 0 1 0.0789 1a32 1.02 1.62e-12 0.999 0.00422 0.998 0.000316 0.988 1.62e-12 0.962 1.62e-12 0.986 0 0.999 1.12e-14 2ezh 1.02 0 0.997 5.34e-5 0.993 1.83e-10 0.976 1.11e-16 0.945 0 0.984 1.62e-12 0.998 1.62e-12 1am3 1.02 5.55e-17 0.996 1.61e-15 0.992 1.11e-16 0.971 1.11e-16 0.944 0 0.976 0 1 0.228 1pou 1.03 1.11e-16 1 0.124 0.997 0.00069 0.983 1.67e-16 0.961 0 0.985 0 0.998 0 1kjs 1.02 0 0.998 1.66e-5 0.995 1.71e-8 0.974 0 0.951 0 0.985 1.62e-12 0.998 1.62e-12 1hyp 1.04 0 0.998 0.142 1 0.42 0.967 0 0.952 0 0.974 1.62e-12 0.992 1.62e-12 1nkl 1.03 1.62e-12 0.998 0.0037 0.998 0.00139 0.972 1.62e-12 0.948 1.62e-12 0.982 5.55e-17 0.998 3.33e-16 1nre 1.02 0 0.996 5.44e-8 0.993 5.85e-11 0.978 0 0.952 0 0.984 1.62e-12 0.998 1.62e-12 1cei 1.03 0 0.998 0.0576 0.998 0.146 0.971 0 0.959 0 0.976 1.62e-12 0.995 1.62e-12 5pti 1.02 1.63e-12 0.996 0.0498 0.993 0.0374 0.958 1.62e-12 0.928 1.62e-12 0.963 1.62e-12 0.998 3.24e-10 1tif 1.02 0 0.994 1.11e-16 0.987 1.67e-16 0.968 5.55e-17 0.932 0 0.986 1.62e-12 0.999 1.77e-12 2ptl 1.03 1.62e-12 0.995 9.93e-9 0.992 1.23e-10 0.962 1.62e-12 0.921 1.62e-12 0.98 1.62e-12 0.999 4.39e-7 1aa3 1.01 0 0.997 3.16e-10 0.992 5.55e-17 0.973 0 0.945 1.11e-16 0.987 1.62e-12 1 0.275 1orc 1.02 1.62e-12 0.997 1.83e-5 0.996 0.000359 0.971 1.62e-12 0.94 1.62e-12 0.978 1.62e-12 0.999 6.35e-6 1msi 1.03 8.98e-7 1 0.322 1 0.322 0.959 4.71e-9 0.929 6.87e-10 0.975 0 0.996 5.55e-17 1ctf 1.02 0 0.996 2.15e-10 0.995 4.58e-8 0.978 5.55e-17 0.947 0 0.986 0 0.999 2.07e-9 1afi 1.02 0 0.995 2.33e-15 0.987 1.11e-16 0.974 0 0.945 0 0.984 5.55e-17 0.998 0 5icb 1.03 0 0.998 0.000444 0.994 4.56e-7 0.971 0 0.948 5.55e-17 0.98 0 0.997 2.22e-16 2fow 1.02 0 0.997 7.85e-8 0.994 2.96e-9 0.973 0 0.944 0 0.983 1.11e-16 0.998 3.89e-16 1vcc 1.03 1.11e-16 1 0.36 1 0.462 0.974 1.67e-16 0.959 0 0.968 1.62e-12 0.991 1.62e-12 1vif 1.01 5.55e-17 0.999 0.00467 0.993 1.67e-16 0.981 0 0.953 0 0.963 5.55e-17 0.999 0.03 1tuc 1.02 1.62e-12 0.996 9.71e-9 0.989 1.62e-12 0.97 1.62e-12 0.938 1.62e-12 0.983 0 0.999 0.000459 1csp 1.02 0 0.997 0.000651 0.993 2.8e-6 0.963 0 0.933 0 0.977 1.62e-12 0.996 1.62e-12 Av 1.02 0.997 0.994 0.971 0.943 0.979 0.998 SD 0.0073 0.00184 0.00418 0.00774 0.0124 0.00709 0.00246
|
The table was produced using the perl program assess_scores.pl.
The comparisons show that there are small but significant differences between the scores produced by each program. The tmscore program is able to produce alignments with MaxSub scores approximately 2% higher than the maxsub program. This is attributed to the different distance threshold used within their search engines. Both programs use a heuristic search to maximise the number of atoms in the subset below a set threshold. However the tmscore search is performed using a higher distance threshold than the 3.5Å used for the maxsub search. This identifies a transformation by considering more of the protein structure than the maxsub and consequently may subtly alter the orientation of the subset within 3.5Å to increase the MaxSub score.
In comparison the MaxCluster program produces alignment scores approximately 0.3% lower than the maxsub program. In addition the MaxSub size found by MaxCluster is approximately 0.6% smaller. This may be a reflection of MaxCluster's search algorithm which uses fewer starting seeds and is thus a less exhaustive search. Interestingly when run using a TM-score search the program produces alignments with a MaxSub score approximately 3% lower than MaxSub. This contrasts with the tmscore program which produces better scoring alignments. This may again be a reflection of the different distance thresholds used within the internal search engine.
A comparison between the MaxCluster and tmscore output shows that the TM-scores of MaxCluster are approximately 2% lower. This is due to the use of a search algorithm that maximises the subset within a set distance rather than the TM-score. However when run using the TM mode the MaxCluster program produces results only 0.2% worse than tmscore. Thus the modification to the search engine is producing better alignments.
These results show that the MaxCluster program produces very similar alignments to the maxsub program. This is not unexpected given that both programs implement a search algorithm to find the maximum number of pairs that can be superposed below a given input threshold. In contrast the tmscore program searches for an alignment with the best TM-score using its own distance threshold. Interestingly this results in an alignment with a higher MaxSub score. However since no information is output from the tmscore program regarding the number of residue pairs below the MaxSub scoring threshold of 3.5Å, an analysis of the size of the MaxSub set cannot be made.
The run-time analysis of the maxsub, tmscore and MaxCluster programs indicates that the maxsub program is approximately twice as fast as the other two programs for pairwise comparisons. However by pre-loading the reference structure into memory, the MaxCluster program achieves a 8-fold speed increase during list processing. This makes it 4 and 9 times faster than maxsub and tmscore respectively. Furthermore, pre-loading the model list will produce an even greater speed gain when performing all-verses-all comparisons for clustering.
Analysis of the output scores for each program indicate that there are marginal differences between the programs. The MaxCluster program produces alignments of comparable quality to the maxsub program and the tmscore program when run using the TM-score search algorithm.
The pre-loading feature, combined with its comparitive structure alignment performance, make MaxCluster the program of choice for processing large protein datasets.