3D-PSSM Help Page

1. Introduction

The 3D-PSSM server is a designed to take a PROTEIN sequence of interest to you, and attempt to predict its 3-dimensional structure and its probable function. We have a library of known protein structures onto each of which your sequence is "threaded" and scored for compatibility. We use a variety of scoring components: 1D-PSSMs (sequence profiles built from relatively close homologues), 3D-PSSMs (more general profiles containing more remote homologues - see Methods), matching of secondary structure elements, and propensities of the residues in your query sequence to occupy varying levels of solvent accesibility.

2. Submitting your query

If you select "Recognise a Fold" from the home page menu, you are presented with a submission form. You need to:

Enter your E-mail address (PLEASE MAKE SURE IT IS CORRECT)
Type in an optional description of your query sequence
Paste or type in or upload your query sequence.
Click on the "Submit" button at the bottom of the page

Your query sequence must be LESS THAN 800 RESIDUES. I am sorry for this limitation, but as the time taken for a 3D-PSSM job increases rapidly with query sequence length, this is currently necessary to allow reasonably equal use of the server by each user. As there are almost no domains in our fold library longer than 800 residues, a query sequence longer than 800 residues would EITHER not constitute a single domain that could be found in our library (hence making the search futile) OR the query sequence contains more than one domain. If you have a long query sequence, and you are not certain if it is a single domain or not, then splitting your sequence into smaller chunks (using your best guess as to the location of domain boundaries) is advised. A mechanism to automatically determine domain boundaries is currently under investigation. The format of your query sequence should adhere to one of the following:

Single Sequence

>optional description
SETVPPAPAASAAPEKPLAGKKAKKPAKAAAASKKKPAGPS
VSELIVQAASSS

Pre-made 3D-PSSM Probe

>test_seq
>test_DSC_SS
*
DC 0.882
WC 0.673
SC 0.671
FC 0.592
LH 0.877
GH 0.877
EH 0.877
FH 0.877
LH 0.877
EH 0.905
EH 0.893
VH 0.745
HH 0.621
KH 0.422
HC 0.457
SC 0.457
TC 0.457
VC 0.700
IC 0.890
GC 0.821
*

CLUSTAL Multiple Sequence Alignment

Note this is in the beta-testing phase. If you have repeated problems with submitting clustal alignments let me know (E-Mail Lawrence Kelley) Also, it should be noted that the first sequence in the clustal alignment is the one treated as the query sequence (in the case below this is sequence AAF01449).


CLUSTAL W (1.8) multiple sequence alignment


AAF01449            -----------------------------------MEVAYRFSQPHLEWNSYGHWRSSIA
P.aeruginosa        -------------------------------------------MPSNALWLRADQLSSVS
S25660              MRLLRFCCVLDHLICFTSPVNTFLRYNAFTLCNGEFGMSHPALTQLRALRYCK-EIPALD
S.paratyphi         -------------------------------------MSHPALTQLRALRYFD-AIPALE
S.enteritdis        -------------------------------------MSHPALTQLRALRYFD-AIPALE
S.typhi             -------------------------------------MSHPALTQLRALRYFD-AIPALE
S.typhimurium       -------------------------------------MSHPALTXXXALRYFD-AIPALE
K.pneumoniae        -------------------------------------MSHPALTRLRALRYFA-VMPSLP
Y.pestis            -------------------------------------MFIGDASILKPIQWCATEHPELP
AAF41147            ---------------------------------------MEHLFGKWLPDLPAAISDGIS
CAB84215            ---------------------------------------MEHLFEEWLPDLPADVSDGIG
N.gonorrheae        ---------------------------------------MEHLFGKWLPDLPAPVSDGID
S.putrefaciens      ----------------------------------MNVTSLSFPYGESIQWFCADNTKNLP
                                                                              : 

AAF01449            LAGFGRPWVYARSVISHCDVEGSDSALLQLGNIPLGSLLFGEN-------------PYKR
P.aeruginosa        LHGHDRPWVFARSVAARSALEGSGFDLALLGTRSLGELLFSDS-------------AFER
S25660              LCADGEPWLAGRTVVPVSTLSGPELALQKLGKTPLGRYLFTSS-------------TLTR
S.paratyphi         LCADGEPWLAGRTVVPESTLCGPEQVLQHLGKTPLGRYLFTSS-------------TLTR
S.enteritdis        ------------------------------------------------------------
S.typhi             LCADGEPWLAGRTVVPESTLCGPEQVLQHLGKTPLGRYLFTSS-------------TLTR
S.typhimurium       LCADGEPWLAGRTVVPESTLCEPEQVLQHLGKTPLGRYLFTSS-------------TLTR
K.pneumoniae        LNADGEPWLAGRTVARESTLCGPELALQQLGQTPLGRYLFTSS-------------TLTR
Y.pestis            LFGDNVPWLLGRTVIPEETLSGPDRALVDLGTLPLGRYLFSGD-------------ALTR
AAF41147            LKLDRIPVVEARSEC--RIGSAFWQNILDCGTRPLGERLFQAD------------LEGAR
CAB84215            LKLDGIPVVAARSEC--RIGSAFWQNILDCGTRPLGERLFQAD------------LEGAR
N.gonorrheae        LKLDGTAVVQARSAC--SVGSAFWQNILDCGTRPLGERLFQAD------------LEGAR
S.putrefaciens      LCLDDVPWVFARTLIPQSLLSTRQADFLGLGTRPLGELLFSQDSFVPGRIEIARFATNSR
                                                                                

AAF01449            SEIEVCRYPDACNASSRPA
P.aeruginosa        GPIEVCRYPAAGLPAEVRA
S25660              DFIEIGRD-----------
S.paratyphi         DFIEIGRD-----------
S.enteritdis        -------------------
S.typhi             DFIEIGRD-----------
S.typhimurium       DFIEIGRD-----------
K.pneumoniae        DFIEIGRD-----------
Y.pestis            DYIHVGRQ-----------
AAF41147            SAFEFAVA------GEGCG
CAB84215            SAFEFAVF------GEGCG
N.gonorrheae        SAFEFAVS------SEGCG
S.putrefaciens      LAHLAQSL------AQNVE

(Further help on Advanced Options )

3. Understanding your results

Please note there is another help file about INTERPRETING YOUR RESULTS which may be of use.

E-mail

The results of scanning your query sequence against our fold library will be returned to you by E-mail, usually within 10-20 minutes. The E-mail will look something like this

This mail contains the scores and alignments for the 20 most probable matches in our library. In addition, there is a PSI-Pred secondary structure prediction, and MOST IMPORTANTLY two hyperlinks. One is a link to a more interactive and more informative HTML version of your results, and the other is a link to a file containing these HTML results in a tarred,gzipped format.

As an example of such HTML results, look here for some novel assignment made to the Mycoplasmum genitalium genome.
HTML - Enhanced Presentation of Results

The HTML version of the results is separated into 2 frames. The top frame contains information on the matches found, the confidence with which the match is made (E-value), a button to view the multiple sequence alignment for your query sequence (resulting from a scan of the protein sequence database using PSI-Blast), and links to specific alignments and rudimentary models.

The second frame, in the bottom of the window, is the secondary structure prediction (by PSI-Pred), colour-coded to illustrate the confidence with which each residue's secondary structure class is predicted.
4. Examples

View some pre-submitted results

5. Citing Us

Please cite us when using 3D-PSSM results in your work.

(1)CAFASP-1: Critical Assessment of Fully Automated Structure Prediction Methods
Fischer, D., Barret, C., Bryson, K., Elofsson, A., Godzik, A., Jones, D., Karplus, K.J., Kelley, L.A., Maccallum, R.M., Pawowski, K., Rost, B., Rychlewski, L. and Sternberg, M.J.
Proteins: Structure, Function and Genetics, Suppl 3:209-217 (1999)

(2) Recognition of Remote Protein Homologies Using Three-Dimensional Information to Generate a Position Specific Scoring Matrix in the program 3D-PSSM
L.A. Kelley, R. Maccallum and M.J.E. Sternberg
RECOMB 99, Proceedings of the Third Annual Conference on Computational Molecular Biology
Pages 218-225
Editors: Sorin Istrail, Pavel Pevzner, Michael Waterman
Publisher: The Association for Computing Machinery, New York, New York 10036
April 1999

6. Methods
1D-profile generation

i) Start with the sequence of the domain from the master protein (A0) of known structure in a superfamily.

ii) Search this master sequence against NRPROT using 20 iterations of PSI-BLAST with an expectation for including a sequence in the iteration (H) of 0.0005 and a theoretical expectation value (ET) of a hit < 0.0005. Note: PSI-Blast may find and subsequently lose homologous sequences during the iteration process. For this reason, all intermediate sequences, that is all sequences found between the first and last iteration, are stored and recombined at the end of the scan. In addition, we protect against "drifting" of the PSI-Blast PSSM by monitoring the loss of closely homologous sequences from one iteration to the next. Parameters are dynamically altered if such drifting is detected, and this prevents PSI-Blast from iteratively incorporating more and more erroneous sequences.

iii) The alignment generated by PSI-Blast is used explicitly. In cases where many (>200) sequences are retrieved, a variety of criteria are used to reduce the alignment to something more manageable. Sequences are removed that: (a) contain ‘X’ characters, (b) overlap less than 75% of the query, (c) are >80% identical to other sequences in the alignment.

iv) Using this multiple alignment, generate a 1D-PSSM using the method described for PSI-BLAST.

iv) Repeat for all master proteins in our fold library.
3D-profile generation

i) Perform a three-dimensional structural superposition using the SAP program (Orengo et al., 1992; Taylor and Orengo, 1989) between the master structure A0 and all other proteins within the same superfamily. Only structures that superpose with a weighted root mean square deviation < 6.0Å to the master structure are considered. Initially, the closest fitting (lowest RMS) structure is added to the alignment (A0 and B0 for example). A search is performed for the next candidate alignment. The alignment with the lowest RMS to EITHER A0 OR B0 is then used. Similarly, the resultant multiple structural alignment is built in a hierarchical fashion, progressively adding alignments that are closest to an existing member of the alignment. This ensures that at all times we are augmenting the alignment with the most confident available structural alignment. Only residues with a SAP equivalence score > 0 are considered in the alignment. The program SAP was obtained from http://mathbio.nimr.mrc.ac.uk/.

ii) Use the residue equivalences from the structural alignment to augment the 1D-profile of A0 with 1D-profiles from B0, C0... . Note that this is at a residue by residue level. This yields a profile with sequences (A0, A1, A2,..AnA, B0, B1, B2,..BnB, C0, C1,C2,..CnC).

iii) Repeat for all master proteins in our fold library
Secondary Structure matching

For each library entry, a three-state secondary structure assignment (Coil, Helix, Strand) is made on a per residue basis using STRIDE (Frishman & Argos, 1995). The three states were formed by the following grouping: (3₁₀ helices with alpha helices), (bridges with beta-strands), (turns and pi-helices with coil). Query sequences had their secondary structures predicted by PSI-Pred (Q3 77%)(Jones, D. T. (1999)). A simple scoring function for matching secondary structure types between two residues was implemented where matching identical secondary structure types gives a score of +1, and otherwise —1.
Solvation Potential

Solvation potential is modelled using the approach of Jones et al. (1992). The potential is a term for scoring the preference of an amino acid to occupy a specific structure position with a given exposure. This pseudopotential is derived from our set of representative protein structures by setting the frequency of the occurrence of an amino acid type with a specific degree of residue burial in relation to the occurrence of all other amino acid types with this degree of burial. The degree of burial of a residue is defined as the ratio between its solvent accessible surface area (as calculated by DSSP; Kabsch & Sander, 1983) and its overall surface area. 21 bins in 5% accessibility increments are used, ranging from 0% (buried) to 100%(exposed). The coarseness of this potential means cross-validation is unnecessary.
Bi-directional scoring

It is known that matching a query sequence to a template PSSM is not the same as matching a template sequence with a query PSSM. Often homologies can be detected in one direction and not in the other. To account for this, each query sequence is scanned against the sequence library using PSI-Blast. A 1D-profile was generated in exactly the same way as the 1D-profiles were generated for the library sequences.
Searching the probe against the 3D-PSSM library

For each probe, the 3D-PSSM library is scanned using the global dynamic programming algorithm that was developed for our fold recognition algorithm FOLDFIT (Russell et al., 1998). The score for a match between a residue in the probe and a residue in the library sequence is calculated as the sum of the secondary structure, solvation potential and PSSM scores. Three passes of dynamic programming are performed for each query-library sequence match. Each pass differs in the PSSM used for the scoring, with secondary structure and solvation being held constant.

Pass 1: Library sequence is matched to the query PSSM.

Pass 2: Query sequence is matched to the library 1D-PSSM

Pass 3: Query sequence is matched to the library 3D-PSSM

The final score is simply the maximum of the scores from the three passes. An affine gap penalty of 10 to open and 1 per gap extension is used based on preliminary trials. End gaps were also penalised.

The significance of a match is evaluated by fitting a linear relationship between log(number of hits up to a score) against log(total score). Only the top end of the distribution is used and the possibility of the correct hit contributing to the tail of the distribution considered by removing the top scoring hit and all consecutive entries belonging to the same superfamily. The top end of the distribution is defined using a penalty function algorithm as described in (Kelley and Sutcliffe, 1996). The probability of obtaining a match with that score by chance is converted to a theoretical error rate per query (ET).

References

Frishman,D & Argos,P. (1995) Knowledge-based secondary structure assignment. Proteins: structure, function and genetics, 23, 566-579.

Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992). A new approach to fold recognition. Nature 358: 86-89, 1992.

Kabsch, W., & Sander, C. (1983) Dictionary of protein secondary structure — pattern-recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577-2637.

Kelley, L.A., Gardner, S.P. & Sutcliffe, M.J. (1996) An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally-related subfamilies. Protein Eng. 9, 1063-1065.

Orengo, C. A., Brown, N. P. & Taylor, W. R. (1992). Fast structure alignment for protein databank searching. Proteins: Structure, Function, and Genetics 14, 139-167.

Russell, R. B., Saqi, M. A. S., Bates, P. A., Sayle, R. A. & Sternberg, M.J. E. (1998). Recognition of analgous and homologous folds - Assessment of prediction success and associated alignment accuracy using empirical matrices. Prot Eng 11, 1-9.

Taylor, W. R. & Orengo, C. A. (1989). Protein structure alignment. J. Mol.Biol. 208, 1-22.

7. Frequently Asked Questions

How can I view my models (on-line and off-line)?
For Unix and Linux users we recommend downloading Rasmol (http://www.umass.edu/microbio/rasmol/index2.htm). Then you can set up the MIME type in your Netscape browser so the command: "rasmol -script %s" is exeectued either on filenames with a .rasmol suffix or when the header "application/x-rasmol" is received.

For PC users there are several options

1. Download CHIME from http://www.mdlchime.com/chime/ You have to register but its free. This is basically a Rasmol plug-in for your browser. It will render the molecule within your browser. Only disadvantage is its a bit hard to manipulate the model as there is no command-line language like with ordinary rasmol. However, you can rotate, colour, select atoms etc..
2. Get RasTop http://www.bernstein-plus-sons.com/software/RasTop_1.3.1/RasTop.zip I've just found a new thing on the net that I downloaded and installed and it worked a treat. It is a GUI front end to rasmol. You can either right click on the model gif image and choose "Save Target As.." and then load the saved file into RasTop, or you can download all your results from the link on the results page and then all the models will be present where you uncompress them in somedirectory/models/bignumber.model. Then: 1) go to File->Load Rasmol Script 2) Find the directory where the blahblah.rasmol files are (in the models directory off the directory where you downloaded the results). 3) Change the "Files of Type" selection to "All files" 4) Double click on the .rasmol file you want.
3. Run Rasmol from the command line This is a hassle and becomes very difficult when you have directory names with spaces in them. However, you don't need to download anything other than Rasmol for the PC this way. Go to Start->Run in windows. Browse to the location of your rasmol executable and select it. Then add to the command line " -script nameoffile.rasmol" So you get a command line looking something like: "C:\Downloaded Executables\raswin.exe" -script c:\3dpssm\models\e25e806b645f2.jobc1awla_.rasmol

Missing Results If you have made a submission to the server but have not received any results, this is usually because of an error in the E-mail address supplied with the submission. Please E-mail me (Lawrence Kelley) and I can see if I can find your submission.

Installing the server locally A mechanism for installing 3D-PSSM on your local intranet, or mirroring the service elsewhere on the internet is currently under investigation. Please let me know if this would be particularly valuable.
Why are there X's in my query sequence? X's are placed in your query by the program 'seg' which is by defualt run at the beginning of the 3D-PSSM job. The purpose of 'seg' is to find low-complexity or coiled-coil regions of your query sequence and mask them out with X's. This can be vital when searching the sequence database with PSI-Blast (as we do). Low-complexity or coiled-coil regions can match many many sequences in the database, thus causing "matrix drift", permitting large numbers of un-related sequences to be dragged into the sequence profile of your query.
How can I get rid of the X's in my query sequence? In the "Advanced Options" on the submission form there is the option: "Filter your query for low-complexity regions?". If you choose "No" this will prevent 'seg' (see above) from being run and hence no X's will be placed in your query. However, THIS MAY CAUSE YOUR JOB NOT TO FINISH OR BE KILLED.If your sequence genuinely contains large amounts of low-complexity or coiled coil regions, the search of the sequence database may take so long that your job will be killed. Alternatively, the results you gain may be highly misleading, as the sequence profile for your query may have become highly distorted. USE THIS AT YOUR OWN DISCRETION.

8. Advanced Options

Currently there are 3 sections in the advanced options:

Global-local
This option is "No" by default. With this setting, end-gaps are penalised for both the query sequence and library sequence. This is helpful when your query sequence is a single domain, as there is quite a strong correlation between length of domain and structure. This provides us with the so-called "length-effect" whereby high rates of recognition can be achieved with end gap penalisation when both query and library structure are known to constitute 1 domain only

When "Yes" is selected, a slightly different dybamic programming algorithm is used to align your query sequence to our fold library entries. Global-local refers to a mechanism of alignment where end-gaps in the query sequence are not penalised, whereas end-gaps in the library structure are. Our library structures are largely composed of domains, and hence the entire sequence of a library entry should be aligned within the query sequence, whereas the query sequence may contain more than one domain.

If you have a long query (so probability suggests it contains more than one domain) or if you know from some other source that your query is likely to be multi-domain, select "Yes". If, however, you think your query is a single domain, select "No". If in doubt, you can always submit your query twice using each options and see what results you get.

I could automate this last step, but with a consequent doubling (approximately) in time spent processing the sequence. This may be feasible in the future.

Filter your query for low-complexity regions

By default this option is set to "Yes". This means, by default, your sequence is passed through a program called "seg" which searches for regions of your query sequence that are low-complexity or coiled coil regions. This is done because low-complexity and coiled coil sequences can generate vast numbers of spurious hits in PSI-Blast, thus hugely increasing processing time for no real gain, and possibly losing important sequence signals elsewhere in your query. The low-complexity regions are masked out by X's. If you see a row of X's in your results, this probably means seg detected and masked out those regions in your query.

Seg is not perfect, and sometimes it will mask out regions of a sequence you may know (or think) are not low-complexity or coiled coil. If you find a critical part of your sequence being masked out by X's in this way, you can turn off the pre-filter by selecting "No".

Run only 3 iterations of PSI-Blast

This option permits you to limit the number of iterations of PSI-Blast run on your sequence. Sometimes, due to low-complexity, large helical content, richness in charged residues, or just bad luck, PSI-Blast may drift causing it to pull in unrelated sequences on further iterations. To limit this problem I've added this option. If you are getting hits to long helical sequences, or if in the multiple sequence alignment you see lots of "heavy chain myosin" or sequences apparently unrelated to your query, try this option by picking "yes".

9. Version History

Under construction

E-value Key (% Certainty)
95%	90%	80%	70%	50%

1. Introduction

2. Submitting your query

3. Understanding your results

4. Examples

5. Citing us

6. Methods

7. Frequently Asked Questions

8. Advanced Options

9. Version History

1. Introduction

2. Submitting your query

Single Sequence

Pre-made 3D-PSSM Probe

CLUSTAL Multiple Sequence Alignment

3. Understanding your results

E-mail

HTML - Enhanced Presentation of Results

Downloading Results

Multiple Sequence Alignment

PROSITE motifs

Top 20 Structural Hits

4. Examples

5. Citing Us

6. Methods

7. Frequently Asked Questions

8. Advanced Options

9. Version History