3D-PSSM Help Page |
The 3D-PSSM server is a designed to take a PROTEIN sequence of interest to you, and attempt to predict its 3-dimensional structure and its probable function. We have a library of known protein structures onto each of which your sequence is "threaded" and scored for compatibility. We use a variety of scoring components: 1D-PSSMs (sequence profiles built from relatively close homologues), 3D-PSSMs (more general profiles containing more remote homologues - see Methods), matching of secondary structure elements, and propensities of the residues in your query sequence to occupy varying levels of solvent accesibility.
If you select "Recognise a Fold" from the home page menu, you are presented with a submission form. You need to:
Your query sequence must be LESS THAN 800 RESIDUES. I am sorry for this limitation, but as the time taken for a 3D-PSSM job increases rapidly with query sequence length, this is currently necessary to allow reasonably equal use of the server by each user. As there are almost no domains in our fold library longer than 800 residues, a query sequence longer than 800 residues would EITHER not constitute a single domain that could be found in our library (hence making the search futile) OR the query sequence contains more than one domain. If you have a long query sequence, and you are not certain if it is a single domain or not, then splitting your sequence into smaller chunks (using your best guess as to the location of domain boundaries) is advised. A mechanism to automatically determine domain boundaries is currently under investigation.
The format of your query sequence should adhere to one of the following:
>optional description SETVPPAPAASAAPEKPLAGKKAKKPAKAAAASKKKPAGPS VSELIVQAASSS
>test_seq >test_DSC_SS * DC 0.882 WC 0.673 SC 0.671 FC 0.592 LH 0.877 GH 0.877 EH 0.877 FH 0.877 LH 0.877 EH 0.905 EH 0.893 VH 0.745 HH 0.621 KH 0.422 HC 0.457 SC 0.457 TC 0.457 VC 0.700 IC 0.890 GC 0.821 *
CLUSTAL W (1.8) multiple sequence alignment AAF01449 -----------------------------------MEVAYRFSQPHLEWNSYGHWRSSIA P.aeruginosa -------------------------------------------MPSNALWLRADQLSSVS S25660 MRLLRFCCVLDHLICFTSPVNTFLRYNAFTLCNGEFGMSHPALTQLRALRYCK-EIPALD S.paratyphi -------------------------------------MSHPALTQLRALRYFD-AIPALE S.enteritdis -------------------------------------MSHPALTQLRALRYFD-AIPALE S.typhi -------------------------------------MSHPALTQLRALRYFD-AIPALE S.typhimurium -------------------------------------MSHPALTXXXALRYFD-AIPALE K.pneumoniae -------------------------------------MSHPALTRLRALRYFA-VMPSLP Y.pestis -------------------------------------MFIGDASILKPIQWCATEHPELP AAF41147 ---------------------------------------MEHLFGKWLPDLPAAISDGIS CAB84215 ---------------------------------------MEHLFEEWLPDLPADVSDGIG N.gonorrheae ---------------------------------------MEHLFGKWLPDLPAPVSDGID S.putrefaciens ----------------------------------MNVTSLSFPYGESIQWFCADNTKNLP : AAF01449 LAGFGRPWVYARSVISHCDVEGSDSALLQLGNIPLGSLLFGEN-------------PYKR P.aeruginosa LHGHDRPWVFARSVAARSALEGSGFDLALLGTRSLGELLFSDS-------------AFER S25660 LCADGEPWLAGRTVVPVSTLSGPELALQKLGKTPLGRYLFTSS-------------TLTR S.paratyphi LCADGEPWLAGRTVVPESTLCGPEQVLQHLGKTPLGRYLFTSS-------------TLTR S.enteritdis ------------------------------------------------------------ S.typhi LCADGEPWLAGRTVVPESTLCGPEQVLQHLGKTPLGRYLFTSS-------------TLTR S.typhimurium LCADGEPWLAGRTVVPESTLCEPEQVLQHLGKTPLGRYLFTSS-------------TLTR K.pneumoniae LNADGEPWLAGRTVARESTLCGPELALQQLGQTPLGRYLFTSS-------------TLTR Y.pestis LFGDNVPWLLGRTVIPEETLSGPDRALVDLGTLPLGRYLFSGD-------------ALTR AAF41147 LKLDRIPVVEARSEC--RIGSAFWQNILDCGTRPLGERLFQAD------------LEGAR CAB84215 LKLDGIPVVAARSEC--RIGSAFWQNILDCGTRPLGERLFQAD------------LEGAR N.gonorrheae LKLDGTAVVQARSAC--SVGSAFWQNILDCGTRPLGERLFQAD------------LEGAR S.putrefaciens LCLDDVPWVFARTLIPQSLLSTRQADFLGLGTRPLGELLFSQDSFVPGRIEIARFATNSR AAF01449 SEIEVCRYPDACNASSRPA P.aeruginosa GPIEVCRYPAAGLPAEVRA S25660 DFIEIGRD----------- S.paratyphi DFIEIGRD----------- S.enteritdis ------------------- S.typhi DFIEIGRD----------- S.typhimurium DFIEIGRD----------- K.pneumoniae DFIEIGRD----------- Y.pestis DYIHVGRQ----------- AAF41147 SAFEFAVA------GEGCG CAB84215 SAFEFAVF------GEGCG N.gonorrheae SAFEFAVS------SEGCG S.putrefaciens LAHLAQSL------AQNVE(Further help on Advanced Options )
The results of scanning your query sequence against our fold library will be returned to you by E-mail, usually within 10-20 minutes. The E-mail will look something like this
This mail contains the scores and alignments for the 20 most probable matches in our library. In addition, there is a PSI-Pred secondary structure prediction, and MOST IMPORTANTLY two hyperlinks. One is a link to a more interactive and more informative HTML version of your results, and the other is a link to a file containing these HTML results in a tarred,gzipped format.
As an example of such HTML results, look here for some novel assignment made to the Mycoplasmum genitalium genome.
The HTML version of the results is separated into 2 frames. The top frame contains information on the matches found, the confidence with which the match is made (E-value), a button to view the multiple sequence alignment for your query sequence (resulting from a scan of the protein sequence database using PSI-Blast), and links to specific alignments and rudimentary models.
The second frame, in the bottom of the window, is the secondary structure prediction (by PSI-Pred), colour-coded to illustrate the confidence with which each residue's secondary structure class is predicted.
Downloading Results
The top frame is where most of the important information resides. At the top of the page is a link to allow you to download these results for viewing off-line. PLEASE NOTE: Your results will only reside on the server for 5 days. You must download them or resubmit your sequence if you want to view the results after this time.
Multiple Sequence Alignment
Below this is a button which, when pressed, will launch a separate window for you to view the multiple sequence alignment produced by PSI-Blast for your query sequence if any homologues could be found. (Note: If PSI-Blast could not find any homologous sequences, this button will not be clickable and will say "Multiple Alignment not applicable".)The multiple sequence alignment window can tell you several things about your sequence. From the pattern of the alignment you can get some idea as to the possible domain boundaries of your sequence. This can be useful if you want to study each domain separately by resubmitting sub-regions of your query sequence to the server. In addition you can quickly see if PSI-Blast has detected any homologies to known structures by searching for the "pdb" identifier in the names of the homologous sequences. Also, you may get a better idea of the function of your query by looking at the descriptions of the homologous sequences found. Finally, you may retrieve more detailed information about each of the homologues found by clicking on the links which take you to the NCBI.
PROSITE motifs
Below the multiple alignment button is a list of any PROSITE motifs found. It will probably be rare to find PROSITE matches in your sequence, but it is there just in case. You can click on the PROSITE accession number to get more details, and to determine whether the motif is highly specific, or rather general.
Top 20 Structural Hits
Below the mutliple sequence alignment button is a table of information regarding the top 20 highest scoring structural matches to your query in our library. Each entry, or row, contains several pieces of information (from left to right):
KRENDHQ | WC | MILV | FY | ASTGP |
E-value Key (% Certainty) | ||||
  95%   |   90%   |   80%   |   70%   |   50%   |
View some pre-submitted results
Please cite us when using 3D-PSSM results in your work.
(1)CAFASP-1: Critical Assessment of Fully Automated Structure Prediction Methods
Fischer, D., Barret, C., Bryson, K., Elofsson, A., Godzik, A., Jones, D., Karplus, K.J., Kelley, L.A., Maccallum, R.M., Pawowski, K., Rost, B., Rychlewski, L. and Sternberg, M.J.
Proteins: Structure, Function and Genetics, Suppl 3:209-217 (1999)
(2) Recognition of Remote Protein Homologies Using Three-Dimensional Information to Generate a Position Specific Scoring Matrix in the program 3D-PSSM
L.A. Kelley, R. Maccallum and M.J.E. Sternberg
RECOMB 99, Proceedings of the Third Annual Conference on Computational Molecular Biology
Pages 218-225
Editors: Sorin Istrail, Pavel Pevzner, Michael Waterman
Publisher: The Association for Computing Machinery, New York, New York 10036
April 1999
1D-profile generation
i) Start with the sequence of the domain from the master protein (A0) of known structure in a superfamily.
ii) Search this master sequence against NRPROT using 20 iterations of PSI-BLAST with an expectation for including a sequence in the iteration (H) of 0.0005 and a theoretical expectation value (ET) of a hit < 0.0005. Note: PSI-Blast may find and subsequently lose homologous sequences during the iteration process. For this reason, all intermediate sequences, that is all sequences found between the first and last iteration, are stored and recombined at the end of the scan. In addition, we protect against "drifting" of the PSI-Blast PSSM by monitoring the loss of closely homologous sequences from one iteration to the next. Parameters are dynamically altered if such drifting is detected, and this prevents PSI-Blast from iteratively incorporating more and more erroneous sequences.
iii) The alignment generated by PSI-Blast is used explicitly. In cases where many (>200) sequences are retrieved, a variety of criteria are used to reduce the alignment to something more manageable. Sequences are removed that: (a) contain X characters, (b) overlap less than 75% of the query, (c) are >80% identical to other sequences in the alignment.
iv) Using this multiple alignment, generate a 1D-PSSM using the method described for PSI-BLAST.
iv) Repeat for all master proteins in our fold library.
3D-profile generation
i) Perform a three-dimensional structural superposition using the SAP program (Orengo et al., 1992; Taylor and Orengo, 1989) between the master structure A0 and all other proteins within the same superfamily. Only structures that superpose with a weighted root mean square deviation < 6.0Å to the master structure are considered. Initially, the closest fitting (lowest RMS) structure is added to the alignment (A0 and B0 for example). A search is performed for the next candidate alignment. The alignment with the lowest RMS to EITHER A0 OR B0 is then used. Similarly, the resultant multiple structural alignment is built in a hierarchical fashion, progressively adding alignments that are closest to an existing member of the alignment. This ensures that at all times we are augmenting the alignment with the most confident available structural alignment. Only residues with a SAP equivalence score > 0 are considered in the alignment. The program SAP was obtained from http://mathbio.nimr.mrc.ac.uk/.
ii) Use the residue equivalences from the structural alignment to augment the 1D-profile of A0 with 1D-profiles from B0, C0... . Note that this is at a residue by residue level. This yields a profile with sequences (A0, A1, A2,..AnA, B0, B1, B2,..BnB, C0, C1,C2,..CnC).
iii) Repeat for all master proteins in our fold library
Secondary Structure matching
For each library entry, a three-state secondary structure assignment (Coil, Helix, Strand) is made on a per residue basis using STRIDE (Frishman & Argos, 1995). The three states were formed by the following grouping: (310 helices with alpha helices), (bridges with beta-strands), (turns and pi-helices with coil). Query sequences had their secondary structures predicted by PSI-Pred (Q3 77%)(Jones, D. T. (1999)). A simple scoring function for matching secondary structure types between two residues was implemented where matching identical secondary structure types gives a score of +1, and otherwise 1.
Solvation Potential
Solvation potential is modelled using the approach of Jones et al. (1992). The potential is a term for scoring the preference of an amino acid to occupy a specific structure position with a given exposure. This pseudopotential is derived from our set of representative protein structures by setting the frequency of the occurrence of an amino acid type with a specific degree of residue burial in relation to the occurrence of all other amino acid types with this degree of burial. The degree of burial of a residue is defined as the ratio between its solvent accessible surface area (as calculated by DSSP; Kabsch & Sander, 1983) and its overall surface area. 21 bins in 5% accessibility increments are used, ranging from 0% (buried) to 100%(exposed). The coarseness of this potential means cross-validation is unnecessary.
Bi-directional scoring
It is known that matching a query sequence to a template PSSM is not the same as matching a template sequence with a query PSSM. Often homologies can be detected in one direction and not in the other. To account for this, each query sequence is scanned against the sequence library using PSI-Blast. A 1D-profile was generated in exactly the same way as the 1D-profiles were generated for the library sequences.
Searching the probe against the 3D-PSSM library
For each probe, the 3D-PSSM library is scanned using the global dynamic programming algorithm that was developed for our fold recognition algorithm FOLDFIT (Russell et al., 1998). The score for a match between a residue in the probe and a residue in the library sequence is calculated as the sum of the secondary structure, solvation potential and PSSM scores. Three passes of dynamic programming are performed for each query-library sequence match. Each pass differs in the PSSM used for the scoring, with secondary structure and solvation being held constant.
Pass 1: Library sequence is matched to the query PSSM.
Pass 2: Query sequence is matched to the library 1D-PSSM
Pass 3: Query sequence is matched to the library 3D-PSSM
The final score is simply the maximum of the scores from the three passes. An affine gap penalty of 10 to open and 1 per gap extension is used based on preliminary trials. End gaps were also penalised.
The significance of a match is evaluated by fitting a linear relationship between log(number of hits up to a score) against log(total score). Only the top end of the distribution is used and the possibility of the correct hit contributing to the tail of the distribution considered by removing the top scoring hit and all consecutive entries belonging to the same superfamily. The top end of the distribution is defined using a penalty function algorithm as described in (Kelley and Sutcliffe, 1996). The probability of obtaining a match with that score by chance is converted to a theoretical error rate per query (ET).
References
Frishman,D & Argos,P. (1995) Knowledge-based secondary structure assignment. Proteins: structure, function and genetics, 23, 566-579.
Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992). A new approach to fold recognition. Nature 358: 86-89, 1992.
Kabsch, W., & Sander, C. (1983) Dictionary of protein secondary structure pattern-recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577-2637.
Kelley, L.A., Gardner, S.P. & Sutcliffe, M.J. (1996) An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally-related subfamilies. Protein Eng. 9, 1063-1065.
Orengo, C. A., Brown, N. P. & Taylor, W. R. (1992). Fast structure alignment for protein databank searching. Proteins: Structure, Function, and Genetics 14, 139-167.
Russell, R. B., Saqi, M. A. S., Bates, P. A., Sayle, R. A. & Sternberg, M.J. E. (1998). Recognition of analgous and homologous folds - Assessment of prediction success and associated alignment accuracy using empirical matrices. Prot Eng 11, 1-9.
Taylor, W. R. & Orengo, C. A. (1989). Protein structure alignment. J. Mol.Biol. 208, 1-22.
How can I view my models (on-line and off-line)?
For Unix and Linux users we recommend downloading Rasmol
(http://www.umass.edu/microbio/rasmol/index2.htm). Then you can set up
the MIME type in your Netscape browser so the command:
"rasmol -script %s" is exeectued either on filenames with a .rasmol
suffix or when the header "application/x-rasmol" is received.
For PC users there are several options
Missing Results If you have made a submission to the server but have not received any results, this is usually because of an error in the E-mail address supplied with the submission. Please E-mail me (Lawrence Kelley) and I can see if I can find your submission.
Installing the server locally A mechanism for installing 3D-PSSM on your local intranet, or mirroring the service elsewhere on the internet is currently under investigation. Please let me know if this would be particularly valuable.
Why are there X's in my query sequence? X's are placed in your query by the program 'seg' which is by defualt run at the beginning of the 3D-PSSM job. The purpose of 'seg' is to find low-complexity or coiled-coil regions of your query sequence and mask them out with X's. This can be vital when searching the sequence database with PSI-Blast (as we do). Low-complexity or coiled-coil regions can match many many sequences in the database, thus causing "matrix drift", permitting large numbers of un-related sequences to be dragged into the sequence profile of your query. How can I get rid of the X's in my query sequence? In the "Advanced Options" on the submission form there is the option: "Filter your query for low-complexity regions?". If you choose "No" this will prevent 'seg' (see above) from being run and hence no X's will be placed in your query. However, THIS MAY CAUSE YOUR JOB NOT TO FINISH OR BE KILLED.If your sequence genuinely contains large amounts of low-complexity or coiled coil regions, the search of the sequence database may take so long that your job will be killed. Alternatively, the results you gain may be highly misleading, as the sequence profile for your query may have become highly distorted. USE THIS AT YOUR OWN DISCRETION.Currently there are 3 sections in the advanced options:
Global-local
This option is "No" by default. With this setting, end-gaps are penalised for both the query sequence and library sequence. This is helpful when your query sequence is a single domain, as there is quite a strong correlation between length of domain and structure. This provides us with the so-called "length-effect" whereby high rates of recognition can be achieved with end gap penalisation when both query and library structure are known to constitute 1 domain only
When "Yes" is selected, a slightly different dybamic programming algorithm is used to align your query sequence to our fold library entries. Global-local refers to a mechanism of alignment where end-gaps in the query sequence are not penalised, whereas end-gaps in the library structure are. Our library structures are largely composed of domains, and hence the entire sequence of a library entry should be aligned within the query sequence, whereas the query sequence may contain more than one domain.
If you have a long query (so probability suggests it contains more than one domain) or if you know from some other source that your query is likely to be multi-domain, select "Yes". If, however, you think your query is a single domain, select "No". If in doubt, you can always submit your query twice using each options and see what results you get.
I could automate this last step, but with a consequent doubling (approximately) in time spent processing the sequence. This may be feasible in the future.
Filter your query for low-complexity regions
By default this option is set to "Yes". This means, by default, your sequence is passed through a program called "seg" which searches for regions of your query sequence that are low-complexity or coiled coil regions. This is done because low-complexity and coiled coil sequences can generate vast numbers of spurious hits in PSI-Blast, thus hugely increasing processing time for no real gain, and possibly losing important sequence signals elsewhere in your query. The low-complexity regions are masked out by X's. If you see a row of X's in your results, this probably means seg detected and masked out those regions in your query.
Seg is not perfect, and sometimes it will mask out regions of a sequence you may know (or think) are not low-complexity or coiled coil. If you find a critical part of your sequence being masked out by X's in this way, you can turn off the pre-filter by selecting "No".
Run only 3 iterations of PSI-Blast
This option permits you to limit the number of iterations of PSI-Blast run on your sequence. Sometimes, due to low-complexity, large helical content, richness in charged residues, or just bad luck, PSI-Blast may drift causing it to pull in unrelated sequences on further iterations. To limit this problem I've added this option. If you are getting hits to long helical sequences, or if in the multiple sequence alignment you see lots of "heavy chain myosin" or sequences apparently unrelated to your query, try this option by picking "yes".
Under construction