MaxCluster Home | FAQ | Performance | Contact | Download

MaxCluster

A tool for Protein Structure Comparison and Clustering

Introduction
Features
Structure Comparison
Clustering
Example Output
Example Alignments
- Pairwise Structural Superposition
- Multiple Structural Superposition
Options
File Formats
Performance
Download
Contact
References

Introduction [toc]

MaxCluster is a command-line tool for the comparison of protein structures. It provides a simple interface for a large number of common structure comparison tasks. A key feature of the program is the ability to process thousands of structures, either against a single reference protein or in an all-verses-all comparison.

Features [toc]

Sequence-dependent or sequence-independent alignment
RMSD and URMSD comparison
Relative RMSD and Relative URMSD comparison
MaxSub and TM-score search algorithm
Global Distance Test (GDT) scoring
Single structure comparison
Single structure verses multiple structure processing
All-verses-all multiple structure comparison
Single, average and maximum linkage hierachical clustering
Nearest-Neighbour clustering algorithm
3D-Jury ranking method
Accepts .bz2 and .gz compressed input PDB files
Reads and writes all-verses-all comparison scores
Produces rasmol script alignment files
Adjustable output logging level

Structure Comparison [toc]

MaxCluster computes several comparison scores between a set of matched residue pairs from two protein structures. Protein records must use the PDB file format and only ATOM records are read by the program.

Sequence Dependant Alignment

Default matching is performed using the protein sequence. The matching is made using the residue position and insertion code fields of the PDB ATOM records. Default matching is done using only the residue C_α atoms. Optionally matching can be performed using the backbone heavy atoms (N, C_α, C, O and C_β).

Sequence Independant Alignment

A sequence-independent alignment is computed using an implementation of the local backbone similarity matching method of the mammoth program[1]. The backbone is converted into a set of i to i+1 unit vectors between consecutive C_α atoms. The unit-vectors are used to build a similarity matrix between the backbone of one protein and the other. For each backbone position, seven consecutive unit-vectors are joined to create a heptamer. These are compared to the heptamers from the other protein backbone using a URMSD fit to create a similarity score between zero (no similarity) and ten (identical):

URMSD_reference = sqrt( 2.0 - (2.84 / sqrt(7.0)) )

if (urmsd > URMSD_reference)
    score = 0.0
else
    score = 10.0 * ((URMSD_reference - urmsd) / URMSD_reference)

This process is repeated for all consecutive heptamers thus creating an all-verses-all similarity matrix. Dynamic programming is used to find the optimal path through this local similarity matrix to provide a residue equivalency map. Dynamic programming is performed with a gap opening penalty of -7.0 and a gap extension penalty of -0.45.

RMSD and Relative RMSD

The Root Mean Squared Deviation (RMSD) is calculated as:

              N
rmsd = sqrt( Sum[ di * di ] / N )
              i
// Where:
//   di  = Distance between matched pair i
//   N   = Number of matched pairs

The RMSD is calculated using the superposition between matched pairs that results in the lowest RMSD score. This superposition is computed using the Kabsch rotation matrix[2,3].

The Relative RMSD (RRMSD[4]) is computed using the observed RMSD divided by the average RMSD of two proteins of equivalent radius of gyrations:

rrmsd = rmsd / AverageRMSD(RgA, RgB, pairs)

// Where:
//   RgA   = Radius of Gyration of protein A
//   RgB   = Radius of Gyration of protein B
//   pairs = Number of matched pairs

// To calculate AverageRMSD:

// Calculate the AverageAlignedCorrelationCoefficient, C(N),
// for an RMSD superposition of random proteins over length
// pairs.
Cn = 0.42 +
    ( (0.05 * -(pairs-1)) * exp(-(pairs-1) / 4.7) ) +
    (  0.63 * exp(-(pairs-1) / 37.0) )

// Now calculate the expected rmsd given the C(N)
AverageRMSD = sqrt(RgA*RgA + RgB*RgB - 2.0*Cn*RgA*RgB)

The RRMSD can be assigned a z-score[5] using the standard deviation of RRMSD values observed between random proteins of equivalent size:

z = (rrmsd - 1.0) / StdDevRRMSD(pairs)

// To calculate standard deviation of RRMSD:
StdDevRRMSD =  0.09 +
    1.16 * exp(-(pairs-1) / 1.6) +
    0.25 * exp(-(pairs-1) / 36.0)

URMSD and Relative URMSD

The Unit-vector RMSD (URMSD[6]) is calculated using unit-vectors between consecutive C_α atoms. Each vector is aligned at the origin thus creating a set of points on a unit sphere for each structure. These are then aligned to optimise the RMSD between the two sets of points. Since the maximum distance between any two corresponding points is 2 the URMSD has a range between 0 and sqrt(2). In contract to the RMSD, the URMSD is robust to structural deviations far from the core of the model since any point can only contribute a maximum of 2 to the sum total.

The relative URMSD is computed using the observed URMSD divided by the expected UMRSD between an equivalent sized set of random unit-vectors:

rurmsd = urmsd / sqrt(2.0 - 2.84/sqrt(pairs))

MaxSub and TM-score

The MaxCluster program provides a search algorithm to identify the maximal subset (MaxSub) of paired residues that can be superimposed within a given distance threshold, d. This subset is assessed using the MaxSub score[7] and Template Modelling (TM) score[8].

The search algorithm is a modification of the MaxSub search of Siew et al. (2000). The search begins with an initial seed of contiguous residues. These are then superposed and the structures rotated using their superposition. Pairwise distances are calculated and any residues falling within a set distance threshold are added to the seed. This process is iterated four times using a distance threshold of ¼d, ½d, ¾d and d where d is the distance threshold. The process is then iterated at a distance of d until no more residues can be added to the seed.

The search algorithm uses initial seeds of length 4, ¼N, ½N and N residues where N is the number of residues in the protein. Seeds of length L are chosen at 1 .. L, 1+½L .. L+½L, . . . , N-L .. N. Note that incrementing the start position by half the seed length introduces a dramatic speed gain due to the reduction in the total number of seeds that must be processed. The seed with the largest MaxSub or highest TM-score is used for the final superposition.

The MaxSub score is calculated as:

          1    M
MaxSub = ---  Sum [ 1 / { 1 + (di^2 / d^2) } ]
          N    i
// Where:
//   di  = Distance between identical residues i
//   d   = Distance threshold
//   M   = The number of residues in the MaxSub
//   N   = The number of residues in the experimental structure

The MaxSub score has a range between 0 and 1, where 1 is an identical pair of structures. To avoid insignificant alignment scores, the original MaxSub program set the score to zero if the number of pairs in the MaxSub was below 25 or the number of pairs in the MaxSub was below 40 and the MaxSub score was below 0.125. This feature has been removed in the MaxCluster program. However a command line option exists to allow this functionality (-noscore flag).

The original MaxSub score used a distance threshold of 3.5 Ångstroms. In addition, the score is calculated using only the pairs within the MaxSub although it is normalised by the length of the experiment structure. Consequently (a) the score will be different depending on the order of the proteins on the command line, (b) the score between random proteins is size-dependant and (c) the score will be the same for two different model proteins with identical structure within the MaxSub but different structure outside the MaxSub. These issues were by Zhang and Skolnick (2004) with the Template Modelling (TM) score.

The TM-score for the superposition is calculated as:

            1    N
TM-score = ---  Sum [ 1 / { 1 + (di^2 / d^2) } ]
            L    i
// Where:
//   di  = Distance between identical residues i
//   d   = Distance threshold
//   N   = The number of residue pairs
//   L   = The number of residues in the experimental structure

The TM-score has a range between 0 and 1 and is comparable across proteins of varying lengths due to the use of a size dependant distance threshold. The distance threshold is calculated as:

d = 1.24 x cube_root(N-15) - 1.8
// Where:
//   N   = The number of residue pairs

The expected TM-score value for a random pair of proteins is 0.17. The MaxSub and TM-scores have the advantage over RMSD that short sections of proteins that do not align correctly have less influence on the final score reducing the chance for skewed assessment.

Sequence Independant Iterative Optimisation

When performing a sequence independant alignment, the initial pairwise mapping between two proteins may not result in the best structural alignment. The mapping may have to be adjusted following initial alignment so that more suitable residues are paired. MaxCluster impliments an iterative optimisation process using the method of TM-align[9]. Using an alignment between two structures, the distance between residues can be used to create a scoring matrix:

if (dij > dmax)
    matrix(i, j) = 0.0;
else
    matrix(i, j) = (1.0 / (1.0 + (dij^2 / d^2)))

// Where:
//   dij  = Distance between residue i (structure A) and residue j (structure B)
//   d    = Distance threshold
//   dmax = The maximum distance allowed for comparable residues

dmax = 1.5 * N^0.3 + 3.5

// Where:
//   N    = The number of residues in the smallest structure

Given the scoring matrix, dynamic programming is used to remap the residues between structures A and B. A Smith-Waterman algorithm is employed with a gap opening penalty of -0.6 and gap extension penalty of 0.

With a new mapping between the two structures, the MaxSub search is repeated to find the optimal superposition. The process of MaxSub search and remapping can be iterated until convergence and the final mapping and alignment score are output. The iterative process can be followed by increasing the logging level of MaxCluster allowing all the mapping stages to be displayed.

Global Distance Test (GDT) score

The MaxCluster program can assess a structural alignment using the Global Distance Test (GDT) score[10]. This scoring system is provided as part of the LGA package of Zemla (2003) and is the preferred method of structural assessment used in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment. The score is computed as:

GDT score = 100 * (C1 + C2 + C3 + C4) / 4N

// Where:
//   C1   = Count of number of residues superposed below (threshold/4)
//   C2   = Count of number of residues superposed below (threshold/2)
//   C3   = Count of number of residues superposed below (threshold
//   C4   = Count of number of residues superposed below (2*threshold)
//   N    = Total number of residues

// GDT_TS (Total Score)   : threshold = 4
// GDT_HA (High Accuracy) : threshold = 2

The GDT score has a value of 0-100. A random superposition between unrelated structures will have a score of approximately 10-20.

MaxCluster uses the superposition generated using the MaxSub search engine to calculate the GDT score. By default the distance threshold used for the GDT score is 4 Ångstroms (GDT_TS score). However this can be altered on the command line.

Clustering [toc]

The MaxCluster program provides several different methods for selecting representative structures from a set of proteins: Hierarchical clustering, Nearest Neighbour clustering and 3D-Jury. The clustering methods require as input an all-verses-all matrix of comparison scores. This can be generated using the MaxCluster program or can be provided using an alternative comparison program. In this case the program provides the ability to adjust the clustering thresholds to suit the scale of the comparison score.

It should be noted that the MaxSub and TM score have a value of 0-1 where 1 is a perfect match. The clustering routines require the distance metric to specify zero for a perfect match and positive distances thereafter. In order to use the same clustering routines the MaxCluster program subtracts the MaxSub/TM score from 1. This should be remebered when setting the clustering thresholds on the command line where a low initial value of 0.2 means that structures must have a MaxSub of 0.8 to be part of the same cluster.

Hierarchical Clustering

Hierarchical clustering is performed by MaxCluster using the C clustering library[11].

Clustering begins by joining the two closest data points to create a node. This joining process is repeated, combining the closest items or nodes in pairwise fashion, until all items are members of the same node. A tree structure is then formed by tracing all joining events used to combine items. This tree can be split into n clusters where n is equal to or less than the number of original data points. This is done by ignoring the top n-1 linking events in the tree. The number of clusters, n, is determined using a threshold specifying the maximum allowed distance between two nodes for a joining event. This can be preconfigured or determined dynamically where the threshold is identified at run-time by iteratively incrementing from a starting value until the largest cluster reaches a desired size or a maximum threshold is reached.

Three types of hierarchical clustering are available within the MaxCluster program. They only differ in the method used to calculate distance between nodes:

Single Linkage (SL) defines the distance as the shortest distance between any two members of the two nodes to be combined.
Maximum Linkage (ML) defines the distance as the longest distance between any two members of the the two nodes to be combined. This is also known as Complete Linkage clustering.
Average Linkage (AL) defines the distance using the average of all pairwise distances between members of the two nodes.

Note: Cluster centroids can be used to determine the distance between nodes during the clustering process. This is named centroid linkage clustering but confusingly can also be known as average linkage clustering. This is not available within the MaxCluster program.

Clustering results in n clusters, each of which may contain a variable number of structures. Representatives are then selected for each cluster. This is done by calculating the protein with the lowest average distance to all other members of the cluster. This is labelled as the cluster centroid and the average distance defines the spread (or error) of the cluster. Cluster centroids are ranked by their cluster size, and in the case of equal size, by the cluster spread.

Due to their nature, there are speed differences between the different clustering methods. Each method requires comparison of all the pairwise distances between datapoints to determine the distance between nodes. This must be done for all pairwise node combinations for the closest nodes to be identified for the next joining event. However Maximum Linkage clustering only requires that a single distance between node members is greater than the current closest distance for the pair to be eliminated from the search for the next joining event. For this reason ML clustering is faster than SL or AL clustering, although the exact speed gains will be dependant on the data being clustered.

Nearest-Neighbour Clustering

The Nearest Neighbour (NN) clustering algorithm in MaxCluster is based on the method of Shortle et al. (1998)[12].

Nearest Neighbour clustering constructs clusters using the distance between a single data point, the centroid, and all the others. Two proteins are considered part of the same cluster if they are closer than a cut-off threshold, that is they are near-neighbours. For each protein in the dataset, the number of neighbours closer than the cut-off are included in that protein's cluster. Thus in contrast to hierarchical clustering, proteins may be assigned to multiple clusters. The threshold is empirically derived in an iterative procedure identical to the hierarchical clustering search that specifies a minimum initial cluster size. The threshold is then fixed for the remaining procedure. The item with the most neighbours is picked as the cluster centre and the cluster is removed from the pool. This involves removing the centroid and its closest n neighbours. When n is less than the cluster total this is referred to as Restricted Nearest Neighbour (RNN) clustering. The process is repeated until clusters contain less than a minimum number of members.

3D-Jury

The previous clustering methods are generic in nature, requiring only a measure of distance between data points to be clustered. 3D-Jury is an alternative clustering method specific to protein model selection[13]. The method uses the MaxSub score to compare protein models. The original method employs a MaxSub search using a distance threshold of 3.5 Ångstroms and structures are considered matched if they share a minimum of 20 pairs of residues at a MaxSub score above 0.2. For each protein in the dataset, matched structures are identified and the number of pairs in their MaxSub is added to the total count for that protein. The actual value of the MaxSub score is discarded. Structures are then ranked using the total count of MaxSub pairs. In this manner the protein that shares the most structure with all the other structures in the dataset will become the top ranked model. It should be noted that there are no clusters of similar models. 3D-Jury therefore provides a method of identifying the most frequent structure found within a model set but not a means of identifying different, frequently occurring structures, for example two different folds within a population.

The original 3D-Jury method employed the MaxSub score for the algorithm. However MaxCluster provides the functionality to use either the TM-score or the MaxSub score for the 3D-Jury method.

Example Output [toc]

Single Structure Comparison - MaxSub

[alex@oak] % maxcluster 5znf.pdb pdbs2/model.147.pdb
 _______________________________________________________________________________

                                     MaxCluster

                     Protein structure comparison and clustering

 _______________________________________________________________________________

Iter 1: Prediction size=  30
Iter 1: Experiment size=  30
Iter 1: Pairs=  12, RMSD= 0.531, MAXSUB=0.345. Len=  30. gRMSD= 4.238, TM=0.435

Alignment   ' = mapped, | = maxsub

Prediction  KTYQCQYCEY RSADSSNLKT HIKTKHSKEK
            '''''''''' ''''|||||| ||||||''''
Experiment  KTYQCQYCEY RSADSSNLKT HIKTKHSKEK

 _______________________________________________________________________________

Where:
Pairs	=	Number of pairs in the MaxSub
RMSD	=	RMSD of the MaxSub atoms
MAXSUB	=	MaxSub score
Len	=	Number of matched pairs
gRMSD	=	Global RMSD using the MaxSub superposition
TM	=	TM-score

Single Structure Comparison - RMSD

[alex@oak] % maxcluster 5znf.pdb pdbs2/model.147.pdb -rmsd
RMSD= 3.674 (Pairs=  30, rRMSD=0.512 ( -2.42)), URMSD= 0.718 (rURMSD=0.590)

Where:
RMSD	=	RMSD of superposition
Pairs	=	Number of matched pairs
rRMSD	=	Relative RMSD (z-score in brackets)
URMSD	=	URMSD of superposition
rURMSD	=	Relative URMSD (z-score in brackets)

List Processing - MaxSub

[alex@oak] % maxcluster -e 5znf.pdb -l model3.list
INFO  : Reading PDB list file 'model2.list'
INFO  : Successfully read 10 / 10 PDBs from list file 'model2.list'
INFO  : Successfully read 10 Chain structures
INFO  : Processing PDB 5znf.pdb vs. List model3.list
INFO  : CPU time = 0.06 seconds
INFO  :  10. 5znf.pdb vs. pdbs2/model.165.pdb   Pairs=  13, RMSD= 0.662, MAXSUB=0.342, TM=0.371, MSI=0.342
INFO  :   9. 5znf.pdb vs. pdbs2/model.128.pdb   Pairs=  12, RMSD= 0.498, MAXSUB=0.352, TM=0.371, MSI=0.352
INFO  :   8. 5znf.pdb vs. pdbs2/model.119.pdb   Pairs=  12, RMSD= 0.520, MAXSUB=0.345, TM=0.374, MSI=0.345
INFO  :   7. 5znf.pdb vs. pdbs2/model.179.pdb   Pairs=  11, RMSD= 0.403, MAXSUB=0.333, TM=0.376, MSI=0.333
INFO  :   6. 5znf.pdb vs. pdbs2/model.118.pdb   Pairs=  11, RMSD= 0.342, MAXSUB=0.342, TM=0.379, MSI=0.342
INFO  :   5. 5znf.pdb vs. pdbs2/model.144.pdb   Pairs=  12, RMSD= 0.456, MAXSUB=0.356, TM=0.390, MSI=0.356
INFO  :   4. 5znf.pdb vs. pdbs2/model.150.pdb   Pairs=  12, RMSD= 0.457, MAXSUB=0.358, TM=0.393, MSI=0.358
INFO  :   3. 5znf.pdb vs. pdbs2/model.191.pdb   Pairs=  12, RMSD= 0.532, MAXSUB=0.343, TM=0.394, MSI=0.343
INFO  :   2. 5znf.pdb vs. pdbs2/model.125.pdb   Pairs=  12, RMSD= 0.318, MAXSUB=0.376, TM=0.397, MSI=0.376
INFO  :   1. 5znf.pdb vs. pdbs2/model.147.pdb   Pairs=  12, RMSD= 0.531, MAXSUB=0.345, TM=0.435, MSI=0.345

Where:
Pairs	=	Number of pairs in the MaxSub
RMSD	=	RMSD of the MaxSub atoms
MAXSUB	=	MaxSub score
TM	=	TM-score
MSI	=	MaxSub Iteration score (see experimental features)

List Processing - RMSD

[alex@oak] % maxcluster -e 5znf.pdb -l model3.list -rmsd
INFO  : Reading PDB list file 'model3.list'
INFO  : Successfully read 10 / 10 PDBs from list file 'model3.list'
INFO  : Successfully read 10 PDB structures
INFO  : Processing PDB 5znf.pdb vs. List model3.list
INFO  : CPU time = 0.00 seconds
INFO  : 10. 5znf.pdb vs. pdbs2/model.125.pdb   RMSD= 8.286 (Pairs=  30, rRMSD=1.183 (  0.91)), URMSD= 1.070 (rURMSD=0.879)
INFO  :  9. 5znf.pdb vs. pdbs2/model.144.pdb   RMSD= 7.924 (Pairs=  30, rRMSD=1.130 (  0.64)), URMSD= 1.081 (rURMSD=0.888)
INFO  :  8. 5znf.pdb vs. pdbs2/model.118.pdb   RMSD= 7.846 (Pairs=  30, rRMSD=1.126 (  0.63)), URMSD= 1.116 (rURMSD=0.917)
INFO  :  7. 5znf.pdb vs. pdbs2/model.165.pdb   RMSD= 6.666 (Pairs=  30, rRMSD=0.931 ( -0.34)), URMSD= 0.997 (rURMSD=0.819)
INFO  :  6. 5znf.pdb vs. pdbs2/model.150.pdb   RMSD= 5.809 (Pairs=  30, rRMSD=0.840 ( -0.79)), URMSD= 0.918 (rURMSD=0.754)
INFO  :  5. 5znf.pdb vs. pdbs2/model.179.pdb   RMSD= 5.485 (Pairs=  30, rRMSD=0.813 ( -0.93)), URMSD= 0.883 (rURMSD=0.726)
INFO  :  4. 5znf.pdb vs. pdbs2/model.119.pdb   RMSD= 5.114 (Pairs=  30, rRMSD=0.745 ( -1.27)), URMSD= 0.892 (rURMSD=0.733)
INFO  :  3. 5znf.pdb vs. pdbs2/model.191.pdb   RMSD= 4.997 (Pairs=  30, rRMSD=0.742 ( -1.28)), URMSD= 0.841 (rURMSD=0.691)
INFO  :  2. 5znf.pdb vs. pdbs2/model.128.pdb   RMSD= 4.646 (Pairs=  30, rRMSD=0.680 ( -1.58)), URMSD= 0.779 (rURMSD=0.640)
INFO  :  1. 5znf.pdb vs. pdbs2/model.147.pdb   RMSD= 3.674 (Pairs=  30, rRMSD=0.529 ( -2.34)), URMSD= 0.718 (rURMSD=0.590)

Where:
RMSD	=	RMSD of superposition
Pairs	=	Number of matched pairs
rRMSD	=	Relative RMSD (z-score in brackets)
URMSD	=	URMSD of superposition
rURMSD	=	Relative URMSD

Multiple Structure Comparison - MaxSub

[alex@oak] % maxcluster -l model3.list -C 1 -P 10
INFO  : Reading PDB list file 'model3.list'
INFO  : Successfully read 10 / 10 PDBs from list file 'model3.list'
INFO  : Successfully read 10 Chain structures
INFO  : Processed 40 of 45 MAXSUBs
INFO  : CPU time = 0.02 seconds
INFO  : ======================================
INFO  : 3D-Jury (Threshold: >    10 pairs @ > 0.200)
INFO  : ======================================
INFO  : Rank     Model    Pairs       File
INFO  :     1 :        1      151       pdbs2/model.165.pdb
INFO  :     2 :        7      145       pdbs2/model.150.pdb
INFO  :     3 :        6      144       pdbs2/model.144.pdb
INFO  :     4 :        9      142       pdbs2/model.125.pdb
INFO  :     5 :        4      137       pdbs2/model.179.pdb
INFO  :     6 :        8      137       pdbs2/model.191.pdb
INFO  :     7 :       10      137       pdbs2/model.147.pdb
INFO  :     8 :        3      135       pdbs2/model.119.pdb
INFO  :     9 :        5      131       pdbs2/model.118.pdb
INFO  :    10 :        2      129       pdbs2/model.128.pdb
INFO  : ======================================
INFO  : Pairwise single linkage clustering
INFO  : ======================================
INFO  : Hierarchical Tree
INFO  : ======================================
INFO  : Node     Item 1   Item 2      Distance
INFO  :     0 :        6        1        0.476  pdbs2/model.144.pdb  pdbs2/model.165.pdb
INFO  :    -1 :        7        4        0.484  pdbs2/model.150.pdb  pdbs2/model.179.pdb
INFO  :    -2 :        9        2        0.576  pdbs2/model.125.pdb  pdbs2/model.128.pdb
INFO  :    -3 :       -2        0        0.598
INFO  :    -4 :       10       -3        0.615  pdbs2/model.147.pdb
INFO  :    -5 :       -1       -4        0.618
INFO  :    -6 :        8       -5        0.620  pdbs2/model.191.pdb
INFO  :    -7 :        3       -6        0.626  pdbs2/model.119.pdb
INFO  :    -8 :        5       -7        0.629  pdbs2/model.118.pdb
INFO  : ======================================
INFO  : 1 Clusters @ Threshold  0.800 (0.8)
INFO  : ======================================
INFO  : Item     Cluster
INFO  :     1 :        1                        pdbs2/model.165.pdb
INFO  :     2 :        1                        pdbs2/model.128.pdb
INFO  :     3 :        1                        pdbs2/model.119.pdb
INFO  :     4 :        1                        pdbs2/model.179.pdb
INFO  :     5 :        1                        pdbs2/model.118.pdb
INFO  :     6 :        1                        pdbs2/model.144.pdb
INFO  :     7 :        1                        pdbs2/model.150.pdb
INFO  :     8 :        1                        pdbs2/model.191.pdb
INFO  :     9 :        1                        pdbs2/model.125.pdb
INFO  :    10 :        1                        pdbs2/model.147.pdb
INFO  : ======================================
INFO  : Centroids
INFO  : ======================================
INFO  : Cluster  Centroid  Size        Spread
INFO  :     1 :        1       10        0.566  pdbs2/model.165.pdb
INFO  : ======================================

3D-Jury output:
Rank	=	Jury rank
Model	=	Original model input number
Pairs	=	Cumulative MaxSub pair total
File	=	PDB file

Hierarchical tree:
Node	=	Node ID
Item 1	=	First item joined to form node
Item 2	=	Second item joined to form node
Distance	=	Distance between items

Cluster assignment:
Item	=	Item ID
Cluster	=	Assigned cluster (zero is no cluster)

Centroids:
Cluster	=	Cluster ID
Centroid	=	Centroid ID
Size	=	Cluster size
Spread	=	Cluster spread

[alex@oak] % maxcluster -l model3.list -C 4 -is 3 -rmsd
INFO  : Reading PDB list file 'model3.list'
INFO  : Successfully read 10 / 10 PDBs from list file 'model3.list'
INFO  : Successfully read 10 PDB structures
INFO  : All vs. All RMSD
INFO  : Processed 40 of 45 RMSDs
INFO  : CPU time = 0.00 seconds
INFO  : ======================================
INFO  : Nearest Neighbour clustering
INFO  : ======================================
INFO  : Centroids
INFO  : ======================================
INFO  : Cluster  Centroid  Size        Spread
INFO  :     1 :        1        3        2.203  pdbs2/model.165.pdb
INFO  : ======================================
INFO  : 1 Clusters @ Threshold  4.000 (Assigned 3 / 10)
INFO  : ======================================
INFO  : Item     Cluster
INFO  :     2 :        0                        pdbs2/model.128.pdb
INFO  :     4 :        0                        pdbs2/model.179.pdb
INFO  :     5 :        0                        pdbs2/model.118.pdb
INFO  :     6 :        0                        pdbs2/model.144.pdb
INFO  :     8 :        0                        pdbs2/model.191.pdb
INFO  :     9 :        0                        pdbs2/model.125.pdb
INFO  :    10 :        0                        pdbs2/model.147.pdb
INFO  :     1 :        1                        pdbs2/model.165.pdb
INFO  :     3 :        1                        pdbs2/model.119.pdb
INFO  :     7 :        1                        pdbs2/model.150.pdb
INFO  : ======================================

Centroids:
Cluster	=	Cluster ID
Centroid	=	Centroid ID
Size	=	Cluster size
Spread	=	Cluster spread

Cluster assignment:
Item	=	Item ID
Cluster	=	Assigned cluster (zero is no cluster)

Example Alignments [toc]

The following images were produced using the superposition output of MaxCluster. The output is a script suitable for viewing in rasmol[14]. Although C_α traces are shown, the output script contains all the original input coordinates.

Pairwise Structural Superposition

[alex@oak] % rasmol -script align.rsm

Pairwise Structural Superposition

Superposition of the zinc-finger protein 5znf with a model produced using a de novo modelling protocol. The image shows the experiment in blue and the prediction in green. Residues in the MaxSub are shown as thick tubes in a darker shade.

Multiple Structural Superposition

[alex@oak] % rasmol -script super.rsm

Multiple Structural Superposition

Superposition of the zinc-finger protein 5znf with 3 models produced using a de novo modelling protocol. The image shows the experiment in white using thick tubes. The predictions are shown with thin tubes using a rainbow color scheme from N-terminus (blue) to C-terminus (red).

Options [toc]

Pairwise comparisons:

% maxcluster -e experiment.pdb -p prediction.pdb

// Quick command line - the experiment PDB is assumed to be second on the command line
% maxcluster prediction.pdb experiment.pdb

// Quick command line with distance threshold of 3.5
% maxcluster prediction.pdb experiment.pdb 3.5

List processing:

% maxcluster -e pdb1 -l list

All-verses-all comparisons:

% maxcluster -l list

In the following options n refers to an unsigned integer and f refers to a floating-point number. Argument values in brackets [] are compulsory, argument values in parentheses {} are optional.

General

-h	Print help
-help	Print extended help
-V	Print version
-version	Print version
-L [n]	Logging level
-log [file]	Log file Redirects stdout to log file

Structure Comparison

-e [file]	Experiment PDB file
-p [file]	Prediction PDB file
-l [file]	PDB file list
-d [f]	The distance cut-off for MaxSub search Defaults to auto-calibrate using TM-score size-dependant distance threshold
-N [n]	The normalisation length for TM-score Default is length of experiment
-rmsd	Perform RMSD fit
-gdt {f}	Perform Global Distance Test (GDT) at ¼x, ½x, 1x and 2x distance cut-off The GDT score is performed using the MaxSub superposition obtained with the search distance threshold specified by the -d option The GDT score threshold is optionally specified as a parameter (default = 4)
-av	Print average for all-vs-all comparison
-nosort	Do not perform sort in list mode
-TSF [f]	Set the MaxSub threshold scale factor (TSF) for auto-calibration The MaxSub search uses a size-dependant distance for comparable atoms. This can be too low when comparing small models. Use the TSF to increase the distance threshold. Default = 1
-rankMS	Rank MaxSub results by MaxSub Default is TM-score unless iterations > 1 (see MaxSubDom) This option will change the clustering from TM-score to MaxSub score
-noscore	Do not produce MaxSub for poor alignments (mimics maxsub program)
-bb	Perform RMSD fit using backbone atoms (N, C_α, C, O and C_β)
-urmsd	Generate rotation matrix for output superposition using URMSD fit
-maxRMSD	Find MaxSub set with RMSD below threshold The default is to find maximum set with all pairwise distances below the threshold. Using this option the RMSD of the entire set is below the threshold. This allows for a larger subset because distances may exceed the threshold. This is a slower search.
-TM	Find MaxSub set with highest TM-score This is a slower search as the TM-score must be calculated for all superpositions. In contrast the MaxSub search only requires a count of residues below a set distance.
-seed [n]	Create seed using ( mapped pairs / 2^i ) where { i=0 .. n } (default = 2)
-all	Print all atoms for -super option (default is C-alpha)
-super [n]	Print superposition file for chain vs. list (top n models) Default = 0
-nourmsd	Disable URMSD in RMSD mode (faster processing)
-noalign	Disable superposition output to align.rsm This feature is enabled by default for convenience. Disable this for speed purposes.
-lws [n]	Line wrap size for alignment output (default = 60)

Sequence-independant mode:
-in	Sequence independant mode
-matrix	Print the pairwise scoring matrices produced for sequence-independant alignment to file Files are named align.matrix.[i].[n] where: [i] is the MaxSub iteration (see MaxSubDom) [n] is the MaxSub optimisation iteration
-opt [n]	Mamimum number of iterative optimisations (default = 20)
-open [f]	Gap opening penalty for iterative optimisation (default = -0.6)
-aff [f]	Gap affine penalty for iterative optimisation (default = 0)

Clustering

-C [n]	Cluster method: 0 - No clustering 1 - Single linkage 2 - Average linkage 3 - Maximum linkage 4 - Neighbour pairs (min size) 5 - Neighbour pairs (absolute size) (default = 5)
-T [f]	Initial clustering threshold
-Tm [f]	Maximum clustering threshold Use when dynamically configuring clustering threshold
-a [f]	Clustering threshold adjustment The clustering threshold is adjusted by this amount until either the maximum threshold or the intial cluster size is reached.
-is [n]	Initial cluster size Clustering threshold is adjusted until the initial cluster reaches this size
-ms [n]	Minimum cluster size Clustering continues until no more clusters of this size can be formed

3D-Jury:
-s [f]	3D-Jury score threshold
-P [n]	3D-Jury pair threshold

Output:
-R [file]	Print distance results to file in a format readable by -M / -F option
-Rl [file]	Print lite distance results to file readable by -M option (MaxSub mode only)
-Ru [file]	Print UMRSD distance results to file readable by -F option (RMSD mode only)

Input:
-F [file]	Cluster distance file
-M [file]	MaxSub file

Experimental features

MaxCluster is a program under development. Consequently there are several experimental features available. These have been left in the program as they provide additional functionality that may be of use.

Contact Maps

Contact maps are used to specify the residues in a protein structure that are touching. Given a pair of structures for the same sequence, the contact maps can be compared for similarity. Contacts are identified if the C_β (C_α for Glycine) atoms of two residues are within a set distance. Residue contacts closer than a set sequence separation are ignored.

MaxCluster has the ability to compare the contact map of a reference structure against a single or a list of model structures using the -contact option in RMSD mode. The output shows the percentage of native contacts found in the model (i.e. the number of contacts in the model that are also found in the native):

[alex@oak] % maxcluster -e 5znf.pdb -p pdbs2/model.119.pdb -rmsd -contact
RMSD= 5.114 (Pairs=  30, rRMSD=0.721 ( -1.38)), URMSD= 0.892 (rURMSD=0.733), NC= 62.50% (5/8)

Options:

-contact	Perform % native contacts calculation
-dist [f]	Distance for CB-CB contact (default = 7.5)
-gap [n]	Minimum sequence separation for contact residues (default = 5)

MaxSubDom

MaxSubDom (MaxSub Domain) is an iterated version of MaxSub. It was invented by Daniel Fischer (coinventor of MaxSub, Siew et al. (2000)) as a measure to assess multiple-domain structures.

When modelling multiple domain proteins, it is often possible to predict both domains to high accuracy but the orientation of the domains may be incorrect. Thus an algorithm that aligns the entire protein will superpose neither domain correctly and an algorithm that align subsets (e.g. MaxSub) will superpose one domain and the other will be totally misaligned. This problem can be circumnavigated by running MaxSub, extracting the set of atoms in the subset and running MaxSub again on teh remaining atoms.

MaxCluster provides the ability to iterate MaxSub searches in a MaxSubDom mode. Following each iteration the atoms in the MaxSub are removed from the structures and the search repeated. This continues for a set number of iterations or until the MaxSub fails certain criteria. The result is scored using the cumulative total of the MaxSub scores for each iteration. Alternatively the result can be scored using MaxCluster's own MaxSub Iterations (MSI) score:

MSI = 0
for 1 .. N
    MSI += MaxSub / i

// where
//   MaxSub  = MaxSub score of iteration i
//   N       = Total number of iterations

The MSI score is equivalent to the area under the cumulative total of the successive MaxSubs. This favours models where a larger MaxSub score is achieved in early iterations. The following diagram illustrates the difference between the two scores:

             Iteration               MSI                   MaxSubDom
                                    0.475                    0.55
             1     2
            0.4   0.15         (0.4   +  0.55) / 2         0.4 + 0.15
     1-|
       |
       |                                ------               ------
MaxSub |    ___                ______   |    |               |    |
 Score |                       |    |   |    |               |    |
       |                       |    |   |    |               |    |
       |          ---          |    |   |    |               |    |
     0-|                       |    |   |    |               |    |


             Iteration               MSI                   MaxSubDom
                                    0.425                    0.55
             1     2
            0.3   0.25         (0.3   +  0.55) / 2         0.3 + 0.25
     1-|
       |
       |                                ------               ------
MaxSub |                                |    |               |    |
 Score |    ___                ______   |    |               |    |
       |          ---          |    |   |    |               |    |
       |                       |    |   |    |               |    |
     0-|                       |    |   |    |               |    |

Both cases receive the same total using the MaxSubDom score. However the case with the higher MaxSub score for the first alignment scores higher using the MSI score. It should be noted that:

The MSI score will not give a score of 1 if two domains in the model have the correct structure but are incorrectly orientated
Under the current system there is no penalty for the degree of translation required to align the two MaxSub orientations

When comparing two structures, the MaxCluster program provides the ability to output the coordinates of the model superposed on the reference structure for each of the iteration MaxSubs.

Options:

-O [n]	MaxSubDom output options: 0 - Output MaxSub atoms (default) 1 - Output all atoms spanning MaxSub 2 - Output all atoms
-mS [f]	Minimum score for MaxSub iteration (default = 0)
-mP [f]	Minimum pairs for MaxSub iteration (default = 4)
-i [n]	MaxSubDom iterations (default = 1)

File Formats [toc]

MaxCluster provides the facility to write/read the all-verses-all distance matrix to/from file. This allows repeated clustering analysis to be done on the same data. The file format used by the program is very simple fixed-width format. It consists of:

A size record specifying the matrix size
chain records specifying the PDB file names
distance or transformation records specifying the RMSD/MaxSub results

The size and chain records are the same for RMSD or MaxSub results files. The results records are different for the RMSD, MaxSub or MaxSub lite output files. Comments are allowed in the files by starting the line with a # character.

Results files are written using the -R option. Because the results files are different for MaxSub or RMSD analysis there are two loading options. MaxSub results file must be loaded with the -M option and distance matrix files (RMSD/URMSD) must be loaded with the -F option.

The records are explained below.

Size Record

SIZE : <N>

Where:
N    = The size of the distance matrix

e.g.
SIZE : 10

Chain Record

PDB  : <ID> <Filename>

Where:

ID       = The ID of the PDB file
Filename = The name of the PDB file

e.g.
PDB  :      1   pdbs2/model.165.pdb

Distance Record

Output when using the -rmsd option to use RMSD distance metric. The same record format is used by the URMSD results file (-Ru).

DIST : <ID> <ID> <Distance>

Where:

ID       = The ID of point 1 (must be between 1 and Size)
ID       = The ID of point 2 (must be between 1 and Size)
Distance = The distance between point 1 and 2 (must be zero or above)

e.g.
DIST : 1 2 2.3

MaxSub Transformation Record

Contain the MaxSub scores are transformation matrix for the superposition. This record contains part of the information necessary to perform the 3D-Shotgun method for building consensus coordinate models. This functionality is now depreciated. To omit the transformation matrix use the MaxSub Lite output (-Rl option).

TRANS: <ID> <ID> <MaxSub> <Pairs> <CoM1 x y z> <CoM2 x y z> <rotmat ...>

Where:

ID           = The ID of molecule 1 (must be between 1 and Size)
ID           = The ID of molecule 2 (must be between 1 and Size)
MaxSub       = The MaxSub score between molecule 1 and 2 (must be 0 - 1)
MaxSub       = The MaxSub score between molecule 2 and 1 (must be 0 - 1)
Pairs        = The MaxSub pairs between molecule 1 and 2 (must be above zero)
<CoM1 x y z> = Center of mass of molecule 1 - three coords
<CoM2 x y z> = Center of mass of molecule 2 - three coords
<rotmat ...> = Rotation matrix to transform molecule 2 onto 1
               Use as:
               xtmp = coord[0] - CoM2[0];
               ytmp = coord[1] - CoM2[1];
               ztmp = coord[2] - CoM2[2];
               coord[0] = xtmp*R[0] + ytmp*R[1] + ztmp*R[2] + CoM1[0];
               coord[1] = xtmp*R[3] + ytmp*R[4] + ztmp*R[5] + CoM1[1];
               coord[2] = xtmp*R[6] + ytmp*R[7] + ztmp*R[8] + CoM1[2];
}
e.g.

1 2 0.9 0.9 9 30.1 25.7 81.9 30.5 26.6 23.8 0.281 -0.679 0.679 -0.648 -0.655 -0.387 0.708 -0.331 -0.624

MaxSub Positions Record

Contain the residue pairs that are in the MaxSub. This record contains part of the information necessary to perform the 3D-Shotgun method for building consensus coordinate models. This functionality is now depreciated. To omit the positions record use the MaxSub Lite output (-Rl option).

POS  : <ID> <ID> <Positions ...>

Where:

ID           = The ID of molecule 1 (must be between 1 and Size)
ID           = The ID of molecule 2 (must be between 1 and Size)
Positions    = Zero-indexed residue pairs that are found in the MaxSub
               (When PDB files are read by the program they are renumbered
               for MaxSub processing from 0 onwards)

e.g.
POS  : 1 2   2:2 3:3 4:4 6:6 7:7 9:9

MaxSub Lite Record

Output when using the '-Rl' option.

TRANS: <ID> <ID> <MaxSub> <Pairs>

Where:

ID           = The ID of molecule 1 (must be between 1 and Size)
ID           = The ID of molecule 2 (must be between 1 and Size)
MaxSub       = The MaxSub score between molecule 1 and 2 (must be 0 - 1)
MaxSub       = The MaxSub score between molecule 2 and 1 (must be 0 - 1)
Pairs        = The MaxSub pairs between molecule 1 and 2 (must be above zero)

e.g.

1 2 0.9 0.9 9

Example Distance Matrix File

###################################
# Maxcluster Distance Matrix File
###################################
SIZE : 2
##########################
# Chain records
# PDB  : >ID< >Filename<
##########################
PDB  :      1   pdbs2/model.165.pdb
PDB  :      2   pdbs2/model.128.pdb
###################################
# Distance records
# DIST : >ID1< >ID2< >Distance<
###################################
DIST :      1      2     7.0162

Example MaxSub Transformation File

##########################
# Maxcluster MaxSub File
##########################
# Size record
SIZE : 2
##########################
# Chain records
# PDB  : <ID> <Filename>
##########################
PDB  :      1   pdbs2/model.165.pdb
PDB  :      2   pdbs2/model.128.pdb
##########################
# Transformation records
# TRANS: <ID1> <ID2> <MaxSub 1-2> <MaxSub 2-1> <Pairs> <CoM 1> <CoM 2> <RotMat>
##########################
TRANS:      1      1       1.0000     1.0000   30     21.291   77.870    1.015     21.291   77.870    1.015      1.000    0.000 0.000    0.000    1.000    0.000    0.000    0.000    1.000
TRANS:      1      2       0.2953     0.2953   10     25.717   77.873   -1.284     65.534  -26.830   77.395     -0.507    0.782 0.362   -0.181    0.314   -0.932   -0.843   -0.538   -0.018
TRANS:      2      2       1.0000     1.0000   30     62.002  -29.267   75.081     62.002  -29.267   75.081      1.000    0.000 0.000    0.000    1.000    0.000    0.000    0.000    1.000
##########################
# Position records
# POS  : <ID> <ID> <Positions ...>
##########################
POS  :      1      2    14:14 15:15 16:16 17:17 18:18 19:19 20:20 21:21 22:22 23:23

Example MaxSub Transformation File Lite

##########################
# Maxcluster MaxSub File
##########################
# Size record
SIZE : 2
##########################
# Chain records
# PDB: >ID< >Filename<
##########################
PDB:      1   pdbs2/model.165.pdb
PDB:      2   pdbs2/model.128.pdb
##########################
# Maxsub records
# MS : >ID1< >ID2< >1-2< >2-1< >Pairs<
##########################
MS :      1          1  1.000   1.000     30
MS :      1          2  0.295   0.295     10
MS :      2          2  1.000   1.000     30

Performance [toc]

The performance of the MaxCluster program was compared to the maxsub and the tmscore programs. Run-time analysis of the programs indicates that by pre-loading the reference structure into memory, the MaxCluster program achieves a ten-fold speed increase during list processing. This makes it 4 and 9 times faster than maxsub and tmscore respectively.

Analysis of the output scores for each program indicate that the MaxCluster program produces alignments of comparable quality to the maxsub program in default mode and the tmscore program when run using the TM-score mode.

The pre-loading feature, combined with its comparitive structure alignment performance, make MaxCluster the program of choice for processing large protein datasets.

More details can be found in the Performance section.

Download [toc]

MaxCluster is available as a pre-compiled binary from the download area.

Contact [toc]

MaxCluster was developed by Alex Herbert at the Structural Bioinformatics Group, Imperial College, London.
e-mail: alex.herbert03 [at] alumni.imperial.ac.uk

If you have a question, see if it has been asked before in the Frequently Asked Questions.

References [toc]

Ortiz, A. R., Strauss, C. E., and Olmea, O. (2002). MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci, 11, 2606-21.
Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 32, 922-923.
Kabsch, W. (1978). A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 34, 827-828.
Betancourt, M.R. and Skolnick J. (2001) Universal similarity measure for comparing protein structures. Biopolymers, 59, 305-309.
Kihara, D. and Skolnick, J. (2003) The PDB is a Covering Set of Small Protein Structures. J Mol Biol, 334, 793-802.
Kedem, K., Chew, L. P., and Elber, R. (1999). Unit-vector RMS (URMS) as a tool to analyze molecular dynamics trajectories. Proteins, 37, 554-64.
Siew, N., Elofsson, A., Rychlewski, L., and Fischer, D. (2000). MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776-85.
Zhang, Y. and Skolnick, J. (2004). Scoring function for automated assessment of protein structure template quality. Proteins, 57, 702-710.
Zhang, Y. and Skolnick, J. (2005) TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res, 33, 2302-2309.
Zemla, A. (2003) LGA: A method for finding 3D similarities in protein structures Nucleic Acids Res, 31, 3370-4.
de Hoon, M. J. L., Imoto, S., Nolan, J., and Miyano, S. (2004). Open source clustering software. Bioinformatics, 20, 1453-1454.
Shortle, D., Simons, K. T., and Baker, D. (1998). Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci U S A, 95, 11158-62.
Ginalski, K., Elofsson, A., Fischer, D., and Rychlewski, L. (2003). 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics, 19, 1015-8.
Sayle, R. A. and Milner-White, E. J. (1995). RASMOL: biomolecular graphics for all. Trends Biochem Sci, 20, 374.