To download, use "git clone http://www.sbg.bio.ic.ac.uk/~sm2708/~sm2708/software/git/benchmarking.git/"
The purpose of this set of tools is the reliable benchmarking of protein structure prediction methods. This is done by taking a set of sequences from the PDB, building HMMs for each input and then clustering the PDB files into sets that are "unreachable" via an HMM-HMM search. These clusters may then be used as a training, test and validation set.
There are several tools included in this benchmarking suite for each step of the workflow. In short, it is necessary to generate an HMM for each input sequence, generate a network of e-values between each HMM and then cluster the HMMs. Also included are some diagnostic tools for evaluating how effective the clustering method is at partitioning the network of HMM-HMM matches.
The following steps will show how to generate the clusters.
- Generate HMMs
The script generate_hmms.pl will, unsurprisingly, generate an HMM for each input sequence. See "generate_hmms.pl"" in " for a full list of options. Since you'll probably be running lots of jobs in parallel, you'll probably want to use the
--queueoption, which allows multiple processes to safely read from a single file containing a list of PDB codes.
Sample invocation, assuming that hhblits is installed in
$hhsuite/binand psipred is installed at
./generate_hmms.pl --prefix $hh_suite \ --psipred-bin $psipred/bin \ --psipred-data $psipred/data \ --threads 4 --hmm-out hmms --sequence-db nr20_12Aug11 --queue pdb_queue
Assuming that this command runs successfully, the directory hmms should now be filled with .a3m, .hhm and .hhr files. The A3M files are sequence alignments, HHM files are representations of the HMM generated and HHR files are reports from hhblits detailing the process.
- Generate network
After the HMMs have been generated, a network detailing HMM matches must be generated. Each node in this network will be an HMM and each edge will connect to each other HMM that can be reached via an hhsearch against the database of all HMMs. The distance assigned to each edge will be the e-value, as reported by hhsearch.
- 1. Create hhsearch library
An hhsearch library simply consists of several HMMs concatenated together. We shall be inventive and call our library hmms.lib
cat hmms/*.hhm > hmms.lib
- 2. Create list of HMMs
To actually generate the network of HMMs, we need a list of all HMMs to scan against hmms.lib. You may need to use absolute paths in this file, depending on your setup.
ls hmms/*.hhm > hmm_list
- 3. Run generate_network.pl
The following commands will fill the directory reports with reports from hhsearch, and output the network information into the file network.
cp hmm_list hmm_queue ./generate_network.pl --prefix $hh_suite \ --threads 1 \ --repdir reports \ --queue hmm_queue \ --db hmms.lib \ --output network
Notice that we first made a copy of hmm_list. This is because generate_network.pl will destroy the file passed as the
If generate_network.pl is being run in parallel, each instance will only process a subset of HMMs. In this case, you should add the
--noprocessflag, which will cause generate_network.pl to only generate the reports. Then, in a single process, re-run generate_network.pl without the
--noprocessflag. Unless the
--clobberflag is specified, generate_network.pl will not overwrite the reports and so it should take very little time to generate the output file.
The output file will have the following (tab-separated) format:
HMMA HMMB distance HMMA HMMC distance . . .
The first two records specify two nodes in the network of HMMs and the distance is the distance betwen the two nodes.
- Cluster network
The network can be clustered using any standard method that can accept a sparse distance matrix. I have chosen to use the tool mcl for Markov clustering. Markov clustering is able to cope well with sparse distance matrices, unlike methods such as k-means or hierarchical clustering.
mcl network --abc -o clusters
--abcoption to mcl tells it that the format is in an "A B C" tab-separated format. This is the same format as produced by generate_network.pl.
Once the network has been clustered, the effectiveness of the clustering must be evaluated. We wish to be sure that each cluser is well-isolated from each other cluster. The tool provided to do this is check_clusters.pl. For details, including file formats, see "check_clusters.pl". A brief overview is given here.
The check_clusters.pl program performs an all-versus-all search. It takes as input the file containing the cluster definitions (in our case, as produced by mcl). The file is processed cluster-by-cluster: for each cluster, an HMM library is produced consisting of all HMMs in that cluster. Then, all HMMs not in that cluster are searched against that database and their e-values recorded.
This quickly becomes infeasible because it results in an excessive number of searches. However, a good estimation of the overall partitioning can be found by random sampling. Here, the GNU tool shuf is used to generate a random set of clusters. The checks may then be run parallel.
shuf clusters | head -n 100 > clusters_short cp clusters_short clusters_short_q ./check_clusters.pl --hmm hmms --clusters clusters_short \ --queue clusters_short_q --out cluster_similarities
The file cluster_similarities will now contain a line for each cluster. Each line will begin with the name of the first HMM in that cluster, and the remaining tab-separated records will contain the e-value from a search against that cluster.
From the file cluster_similarities, many things may be done. For example, to generate a long list of e-values that can be plotted into a histogram by R, use the following perl one-liner:
perl -F'\t' -na -E 'shift @F; say join("\n", @F)' cluster_similarities
Cluster Size Versus Hits
The tool size_vs_hits.pl is supplied to produce a scatter plot with the size of a cluster against the number of foreign (from outside the cluster) HMMs that can reach the cluster with a hit under a specified e-value cutoff.
./size_vs_hits.pl cluster_similarities clusters 0.01
The script colour_graph.pl can generate a DOT graph from a network file. If supplied the cluster file as well, each node in the graph will be assigned a colour depending on its cluster. The cluster colours will be chosen randomly.
./colour_graph.pl network clusters
For processing large graphs, the sfdp(1) program is relatively quick. You may wish to scale the length of the edges to avoid overlaps; do this by applying a multiplier to "$3" in the scripts above.