Download

To download, use "git clone http://www.sbg.bio.ic.ac.uk/~sm2708/~sm2708/software/git/benchmarking.git/"

Purpose

The purpose of this set of tools is the reliable benchmarking of protein structure prediction methods. This is done by taking a set of sequences from the PDB, building HMMs for each input and then clustering the PDB files into sets that are "unreachable" via an HMM-HMM search. These clusters may then be used as a training, test and validation set.

Workflow

There are several tools included in this benchmarking suite for each step of the workflow. In short, it is necessary to generate an HMM for each input sequence, generate a network of e-values between each HMM and then cluster the HMMs. Also included are some diagnostic tools for evaluating how effective the clustering method is at partitioning the network of HMM-HMM matches.

The following steps will show how to generate the clusters.

Generate HMMs

The script generate_hmms.pl will, unsurprisingly, generate an HMM for each input sequence. See "generate_hmms.pl"" in " for a full list of options. Since you'll probably be running lots of jobs in parallel, you'll probably want to use the --queue option, which allows multiple processes to safely read from a single file containing a list of PDB codes.

The sequence that is actually used as input to hhblits will be found using BMM::Files's iopen() method, so ensure that your paths and databases are set up properly in BMM::Paths.

Sample invocation, assuming that hhblits is installed in $hhsuite/bin and psipred is installed at $psipred:

  ./generate_hmms.pl --prefix $hh_suite \
      --psipred-bin $psipred/bin \
      --psipred-data $psipred/data \
      --threads 4
      --hmm-out hmms
      --sequence-db nr20_12Aug11
      --queue pdb_queue

Assuming that this command runs successfully, the directory hmms should now be filled with .a3m, .hhm and .hhr files. The A3M files are sequence alignments, HHM files are representations of the HMM generated and HHR files are reports from hhblits detailing the process.

Generate network

After the HMMs have been generated, a network detailing HMM matches must be generated. Each node in this network will be an HMM and each edge will connect to each other HMM that can be reached via an hhsearch against the database of all HMMs. The distance assigned to each edge will be the e-value, as reported by hhsearch.

1. Create hhsearch library

An hhsearch library simply consists of several HMMs concatenated together. We shall be inventive and call our library hmms.lib

    cat hmms/*.hhm > hmms.lib

2. Create list of HMMs

To actually generate the network of HMMs, we need a list of all HMMs to scan against hmms.lib. You may need to use absolute paths in this file, depending on your setup.

    ls hmms/*.hhm > hmm_list

3. Run generate_network.pl

The following commands will fill the directory reports with reports from hhsearch, and output the network information into the file network.

     cp hmm_list hmm_queue
    ./generate_network.pl --prefix $hh_suite \
        --threads 1 \
        --repdir reports \
        --queue hmm_queue \
        --db hmms.lib \
        --output network

Notice that we first made a copy of hmm_list. This is because generate_network.pl will destroy the file passed as the --queue option.

If generate_network.pl is being run in parallel, each instance will only process a subset of HMMs. In this case, you should add the --noprocess flag, which will cause generate_network.pl to only generate the reports. Then, in a single process, re-run generate_network.pl without the --noprocess flag. Unless the --clobber flag is specified, generate_network.pl will not overwrite the reports and so it should take very little time to generate the output file.

The output file will have the following (tab-separated) format:

   HMMA     HMMB    distance
   HMMA     HMMC    distance
             .
             .
             .

The first two records specify two nodes in the network of HMMs and the distance is the distance betwen the two nodes.

Cluster network

The network can be clustered using any standard method that can accept a sparse distance matrix. I have chosen to use the tool mcl for Markov clustering. Markov clustering is able to cope well with sparse distance matrices, unlike methods such as k-means or hierarchical clustering.

   mcl network --abc -o clusters

The --abc option to mcl tells it that the format is in an "A B C" tab-separated format. This is the same format as produced by generate_network.pl.

Diagnostics

Once the network has been clustered, the effectiveness of the clustering must be evaluated. We wish to be sure that each cluser is well-isolated from each other cluster. The tool provided to do this is check_clusters.pl. For details, including file formats, see "check_clusters.pl". A brief overview is given here.

Cluster e-values

The check_clusters.pl program performs an all-versus-all search. It takes as input the file containing the cluster definitions (in our case, as produced by mcl). The file is processed cluster-by-cluster: for each cluster, an HMM library is produced consisting of all HMMs in that cluster. Then, all HMMs not in that cluster are searched against that database and their e-values recorded.

This quickly becomes infeasible because it results in an excessive number of searches. However, a good estimation of the overall partitioning can be found by random sampling. Here, the GNU tool shuf is used to generate a random set of clusters. The checks may then be run parallel.

    shuf clusters | head -n 100 > clusters_short
    cp clusters_short clusters_short_q
    ./check_clusters.pl --hmm hmms --clusters clusters_short \
        --queue clusters_short_q --out cluster_similarities

The file cluster_similarities will now contain a line for each cluster. Each line will begin with the name of the first HMM in that cluster, and the remaining tab-separated records will contain the e-value from a search against that cluster.

From the file cluster_similarities, many things may be done. For example, to generate a long list of e-values that can be plotted into a histogram by R, use the following perl one-liner:

    perl -F'\t' -na -E 'shift @F; say join("\n", @F)' cluster_similarities

Cluster Size Versus Hits

The tool size_vs_hits.pl is supplied to produce a scatter plot with the size of a cluster against the number of foreign (from outside the cluster) HMMs that can reach the cluster with a hit under a specified e-value cutoff.

    ./size_vs_hits.pl cluster_similarities clusters 0.01

Graphviz (DOT)

The file produced by generate_network.pl may easily be converted into DOT format so that it can be processed by Graphviz tools such as dot(1) and neato(1).

The script colour_graph.pl can generate a DOT graph from a network file. If supplied the cluster file as well, each node in the graph will be assigned a colour depending on its cluster. The cluster colours will be chosen randomly.

    ./colour_graph.pl network clusters

For processing large graphs, the sfdp(1) program is relatively quick. You may wish to scale the length of the edges to avoid overlaps; do this by applying a multiplier to "$3" in the scripts above.