Phyre2 Workshop

Phyre2 Biochem Soc Workshop 2015

Tutorial Links

Link to results	Walkthrough/interpretation
For an introduction to the basic methodology of Phyre2, you can visit the help page which includes a short video. For more detail on interpreting results from Phyre2 you can look at the on-line interpretation help. For far greater detail there is the recent paper on Phyre2 explaining both the method and navigating results. Please feel free to submit your own sequence of interest to Phyre2 here. However, this can often take a while (30 mins - >2 hours). So below are some pre-calculated examples for you to examine.
Much of what is illustrated in these examples are problems and pitfalls rather than shining examples of success as I thought this would be more useful in general.

0. Basic introduction	Here is a trivial example just to get used to the Phyre2 interface using a globin-fold protein. 1. When you open the page you will see information about the job in the top right and a link to download an off-line copy of the results. Scroll down to the first section entitled "Summary" 2. Here you can see an image of the model built by Phyre2 coloured in a rainbow from blue to red from the N-terminus to the C-terminus. To the right you can see information about the template used to build the model (in this case the PDB code and information from SCOP) as well as the confidence in the model (100%) and the coverage of the input sequence (also 100%). Below this you will see a link "Interactive 3D view in JSmol". Click this link to launch the JSmol viewer in the box where the protein image was. Rotate the model. When finished, click "Close JSmol" beneath the image. 3. Scroll down until you see the heading "Secondary structure and disorder prediction". Click the link labelled "Show". This will reveal information about the input sequence (residues coloured by amino acid class) and the positions of predicted helices and disordered regions. In addition you will see small blue dots. Hover over one of these to reveal "Heme binding site". These are residues in the globin that bind to the heme group. This information is taken from the Conserved Domain Database. 4. Scroll down to the heading "Detailed template information". Here you will see a table containing a ranked list of models produced by Phyre2 for the input sequence. Each row represents a model based on a different template. The template used is shown in the second column. The "Alignment" column graphically summarises which region of the input sequence has been modelled (not very interesting in this case, as all templates match the input sequence for its complete length). 5. You will see that the top hit is the trivial case where the input sequence (which already has a solved structure) has been matched with itself (100% i.d.). Below that, at rank 2, you see a template that shares 53% identity followed by one with 24% identity. Each of these templates is assigned 100% confidence. This reflects that these templates are confidently homologous to the input sequence, despite the variable sequence identity. Phyre2 can routinely identify templates and build models of good accuracy even when sequence identity is as low as 15%. 6. Choose a rank to look at and click the associated "Alignment" button in column. This will take you to a page showing detailed information about the alignment between the input sequence and the template. Here you can scroll down to the image of the model and again use the JSmol viewer to examine the model. This covers the basic interpretation of Phyre2 results. In steps below we'll look at more in-depth tools and more problematic cases.

1. Playing with Phyre Investigator	Here is a simple example for Phyre investigator to get you used to the interface. 1. Scroll down to the Detailed template information section. You can see the investigator buttons on the right hand side except for rank 3 which says "view investigator results". This is because I've already run that analysis. Please don't press the other investigator buttons on the tutorial examples as multiple people running the analysis is likely to cause a mess. 2. Click on the View investigator results for the rank 3 hit c3pt8B_. This will take you to the Investigator interface. The screen is divided into 3 main horizontal sections: the Info box, The 3D structure and Analyses section, and at the bottom, the Sequence view 3. In the Analyses section, click the Quality tab and below that click the 'ProQ2 quality assessment' button. The structure will be coloured mainly orange and yellow. Look at the key to the left. This indicates most of the structure is towards the 'Good' end of the spectrum. Look at the text box near the top of the page. It gives a brief summary of what this analysis (ProQ2) does. 4. Move your mouse down towards the Sequence view area. Note how as you hover over residues in the sequence view, the corresponding residue in the 3D structure is highlighted. Clicking on a residue causes that position to 'spacefill'. You can clear that by clicking the 'clear selection' button just above the sequence view. 5. Also, hovering over a position in the sequence view displays two bar graphs on the right portion of the middle section. These graphs display the preference of a residue type in the sequence profile ('Sequence Profile' graph) and the likelihood a mutation to one of the 20 amino acids will have a phenotypic effect ('Mutations' graph). 6. In the 'Analyses' section are 3 tabs: Quality, Function, and CDD. Under the 'Quality' tab you can investigate a number of features. Try clicking the 'Ramachandran Analysis' button. A few residues will be colored green and red in the 3D structure. Also a new row will appear for the sequence analysis section. Corresponding residues will appear to those highlighted in the structure. The 'Bad' and 'Allowed' residues only appear in the loop regions. So probably not much to worry about 7.Clicking the 'Disorder' button shows similarly that loop regions and the termini are the only regions with any significant disorder 8.Let's look at the CDD tab. This tab only appears if information from the Conserved Domain Database is available for your sequence. In this case it has detected a Heme-binding site. First click 'clear selection'. Now for each residue colored red in the sequence view, click to spacefill. You should have about 11 residues in spacefill mode, coloured red. As you click on each residue, have a look at the 'Mutations' graph. In almost all cases you can see that mutating the residue to anything other than that in the query sequence is likely to have a phenotypic effect 9.Go to the 'Function' tab. Click through 'conservation', 'pocket detection' and 'mutational sensitivity', reading the text in the Info box for each analysis. Notice how the heme-binding site residues correlate well with these features. 10.Finally, click the protindb interface button to see those residues known to form an interface in the template structure. There is too much to go through exhaustively, but I hope this gets you started.

2. A Bad Result	1.First note the low confidence (41%) and low coverage (16%). Immediately you know you aren't going to learn too much from this. Note the PhyreAlarm icon. This pops up in such cases of low confidence and coverage. 2.Go down to the Sequence Analysis section. Click the button for PSI-Blast Pseudo-multiple sequence alignment. That opens a new window. One can see plenty of homologous sequences, which is good. It means the secondary structure prediction will be pretty accurate and the hidden Markov model for the sequence should be quite powerful. But the lack of any confident hits suggest maybe this is a new fold or just really remote from anything we have a structure for. 3.Look at the Secondary structure and disorder prediction. Click Show'. No significant disorder, confidently all alpha helices (SS confidence mainly red). Notice the gold helices? That indicates transmembrane helices. Click 'Hide' to close the secondary structure prediction panel. 4.Scroll down to Domain analysis and click Show. Only short blue and green matches, all well below any useful confidence threshold. 5.Click Hide to hide the domain analysis. Scroll down to the Detailed template information. One can see that the rank 3 and 4 hits have red boxes highlighting the 40% identity between the query protein and the template. But then look how short they are. One can often get high sequence identities purely by chance from short alignments. 6.Scroll to the very bottom of the web page. You can choose to Hide the Detailed template information if you like to make this easier. You'll see the Transmembrane helix prediction section. 7.It Looks like all we can get from this run is possibly a useful TM topology prediction. The image indicates the extracellular and cytoplasmic sides of the helices and their start and stop positions. This is probably a good candidate for PhyreAlarm. Maybe a new structure will come out in the weeks ahead that we can build a model on.

3. An Orphan	1.This protein gives us a borderline interesting confidence value (85%) but very low coverage (35%). Again maybe a job for PhyreAlarm. 2.Scroll to the Sequence analysis section. Click the View PSI-Blast Pseduo.... button. You get an empty window. This sequence appears to be an orphan, i.e. it hasn't even got sequence homologues. This should make you cautious about the secondary structure prediction. Close the empty window. 3.Scroll to the Secondary structure and disorder prediction section and click Show. Looking at the secondary structure prediction, you can see that only some of the elements are confidently predicted (red in the SS confidence rows). 4.Scroll to Domain analysis and click Show. This shows that the only semi-confident matches are all in roughly the same region of the protein. 5.Hide the Domain analysis and scroll down to Detailed template information. There are many hits to clearly very similar structures. (In the real world you should use the superposition tool at this point. But PLEASE DO NOT do that now - it was designed for a single user so problems are bound to happen when 60 people do it at once!). 6.Look in the Template Information column of the main table. They are almost all hits to peptide deformylase. Although borderline confidence, the number of hits to this type of protein and their structural similarity to one another would certainly convince me that 'peptide formylase' was at least an interesting lead for this protein. 7.Click on the Alignment button for the rank 2 hit, d1xeoa1. One can see a red boxed E residue. This is a catalytic residue in the template structure. But it is aligned to an A in the query protein. Check the alignments for hits rank 3 and 4 as well. In each case an A in the query is aligned to a known E active site residue in the template. At this point you would delve deeper, looking into the papers on peptide formylase, checking whether having an A at that position would disrupt catalytic activity. We'll leave that analysis to the reader!

4. A Multi-domain protein	Here is a more confident case (100%) but with a substantial number of missing residues (only 61% coverage). In the summary section it claims that by looking at the domain analysis section, more of the protein could be modelled (90%) with high confidence. Maybe a job for intensive mode? 1.Have a look at the Domain analysis. There is a longer hit at rank 14. If you hover over that hit in domain analysis (c2nvuB_) a pop-up will display some info. In the pop-up you can that this template covered 511 residues (aligned) and 548 residues (modelled). The difference between these numbers is accounted for by modelled loops. But it says it has 75% identity. 2.Scroll down to the Detailed Template results. We see lots of very high identity (>90%) hits, but they only cover just over half of the protein. 3.Scroll down to hit rank 14 (c2nvuB_) and click on the Alignment button. There are certainly some quite large insertions and deletions present near the C-terminus (highlighted in dark red and orange respectively). 4.Finally go back to the main page, up to Domain analysis, and scroll way down the domain analysis window. You can see that down at rank 128 some red (high confidence) hits are present for the C-terminal domain. Phyre2 by default only models the top 20 hits. But intensive mode looks deeper down the list to find more matches across the user sequence Overall this looks like a great case for intensive mode.

5. The Dangers of Intensive	Things can sometimes go wrong sadly... 1.The model in the summary panel looks pretty awful. This is 'spaghettification'. Why did it happen? Looking at the blue/red/orange bar to the right of the model, we seem to have at least two confidently predicted regions (red), connected by a weaker match (yellow) and a small amount of template-free ab initio modelling (blue) at the N-terminus. 2. Look at the Secondary structure and disorder prediction. There are a lot of question marks (disorder). In fact near the bottom of this section it reports 41% disorder. This is at least part of the problem. 3. Scroll down to Domain analysis. Ranks 1-14 are all confident matches to the N-terminal region and then we start to see some confident matches to the C-terminal region. Going further down to rank 38 we see a yellow moderate match to the central region of the query sequence. 4. Scroll to the Detailed template information. Now we can start to see the problem. The top hit is a compact beta sheet and a helix with 56% identity. But below it is a 97% identical structure that looks quite different. The initial helix and first strand have unfolded from the rest of the sheet, leaving a non-compact structure with fewer internal contacts. But its 97% identical, so Phyre2 chose that structure as a template. Open, non-compact structures with few constraints behave badly when input to a restraint-based program such as Poing! So this explains why the N-terminal region is spaghetti in the final model. 5.But what about the C-terminal region? Scroll to the very top of the web page again, back to the summary panel. You will see: '60% of residues modelled at >90% confidence (Details)'. Click on Details and you will be taken to the very bottom of the results page to the Multi-template and ab initio information section. This shows what template based models were chosen to enter the multi-template modelling and what regions of the query sequence they covered. 6.You can see that templates d2b4cg1 and c3j5mI were used to cover the C-terminal region. Clicking on the template identifiers takes you to their row in the detailed template information table. You can see that these models (templates) are non-compact partially disordered structure. This fits with the disorder prediction earlier. They are also quite different from each other which can be seen if you do a structural superposition between them (Please don't though as it may mess things up for others at the workshop). So, quite a few things went wrong here. But most of this was caused by the automated template selection heuristics I've built into Phyre2. I continue to work on this aspect of Intensive mode to encourage it to make sensible decisions. But I think the best approach is to allow users to do that selection themselves. And that is why I hope to have that functionality in the next version of Phyre2. In the meantime, before I manage to add the above functionality, I suggest the following: If you end up with results like this, I would stick to Normal mode and using what you think are the best models for the individual regions of the protein without trying to connect them all together.

6. A better intensive run	Here's a slightly better case for intensive mode. 1.Click the Interactive 3D view in JSmol link. Maybe that N-terminal blue alpha-helix (which was built ab initio) probably shouldn't be where it is. It should probably pack better - but ab initio is tricky! Also there appears to be some tangling in the red C-terminus. This is usually caused by disagreements between the input templates in that region. 2. In the summary section click on the link called 'Details' below the confidence key. This takes you to the bottom of the page of results to the Multi-template and ab initio information table. This table shows you which templates were used, what regions of your sequence they covered, and their confidence. 3.In particular, note that template d1svma_ (bottom of the list) covers a significant extra region of the query protein at the N-terminus, but is missing a sizeable segment at the C-terminus. Luckily the other templates cover this region well already. This is where using multiple-templates as a 'patchwork' can improve model coverage. In this case, use of intensive has managed to model an extra 60+ residues. Its also had a fair go at the missing first 20 residues that have no template. The secondary structure prediction says this should be helix and intensive (or rather the Poing system) has attempted to build a helix for these residues and pack them against the rest of the structure. However, whenever ab initio modelling is concerned, please take results with a large pinch of salt. We are working very hard on methods to avoid 'spaghettification', tangling from inconsistent templates, and better methods of template selection, including user-defined selections. These new approaches should be incorporated into Phyre2 by the end of Summer 2015. Intensive mode often creates excellent full length models that cannot be achieved by normal mode. The examples presented here are designed to illustrate how the process can occasionally go wrong, how to detect the problem and diagnose the cause.

Running SuSPect to analyse mutations
7. Exploring a disease mutation	We now consider a Phyre2 example understanding a disease-associated variant linked to diabetes 1.The query was the human ATP-sensitive inward rectifier potassium channel 11 (UniProt ID Q14654). A variant Arg 201 His is associated with neonatal diabetes. 2.Scroll down to detailed template information and view template 1. Click on "View Investigator Results" [Please do not run other Investigators as this causes problems with several simultaneous requests!]. 3.In sequence viewer at bottom scroll till you have residue 201 R. Click on the R. You will see on the far right display under mutation that nearly every variant is predicted as disease causing. You can also see the location of the variant in red on the structure.

8. Running SuSPect	1.On the left box you can input a UniProt ID and the variant. Delete the example text an put in Q14654 R201H and run SuSpect. Tick the box return structure.

9. Inspecting SuSPect output	1.View the page. You will see a score of 87 in the range 0 for benign and 100 for pathogenic. Above 50 is a pathogenic prediction. 2.If you click on the orange 87, you see the features that lead to the prediction. 3.Click on the structure to see the location of the wild type residue.