Why read this if you can simply download the files, right?
Well, I've seen too many dumb errors with processing PROSS results and gene ordering that could have been prevented by reading this page. Even the easiest experiment with PROSS designs will take you x1,000 longer than reading this. So...
This page discusses the following topics/questions:
- Result files description
- Visualizing the results, expected trends and when to omit mutations
- Design 9 is special and should be avoided in most cases
- Selection of sequences for experimental validation
- Proceeding toward experimental validation
- Troubleshooting 1: a low number of mutations (<6-7% in design 9)
- Troubleshooting 2: a high number of mutations (>15% in design 9)
- Troubleshooting 3: a low number of sequences in the multiple alignment (<50)
- Troubleshooting 4: a low number of sequences in a specific segment (<10)
- MSA figure: what useful information can I derive from it?
Result files description
At the bottom of your online results page are two download buttons.
One downloads a single file with the designed sequences only, and a second downloads all results. We strongly recommend to either download all results or fully examine your online results page which contains the same info (note that it will be available for only 2 months)
The complete results directory will include the following files:
- A single file containing the amino-acid sequences of all nine designs (file name: all_designs_full_seq.fasta). Carefully check that the sequences are correct before ordering genes, especially if your input structure had missing amino acids (i.e., that did not crystallize), and make sure that these were added to the final sequences.
- A model of each design (.pdb)
- A model of the input structure after relaxation in PROSS (file suffix: _ref.pdb). Minor conformational changes are expected compared to the original input.
- Files for visualizing the mutations in PyMOL (.pml). One file per design and an additional file for all designs (all_designs.pml). The pml files require PyMOL 1.74 or newer. See next point for details.
- A figure showing the sequence distribution in the multiple sequence alignment (MSA) that was generated by PROSS and used to derive phylogenetic constraints to guide design (file name: msa_quality.png). Details below.
- A table showing all the mutations in each design (file name: list_of_mutations.tsv). Openable in excel-like softwares. Super useful for comparing designs.
- A whole bunch of files with the following suffices: .js , .json , .md , and the file msa_viewer.fasta. All these files are irrelevant and are used by us for presenting your online results page.
The online results page may contain additional text messages/warnings about your query that are not included in the downloaded files.
For experimental validation, you only need the sequence file.
However, we strongly recommend checking the messages about your query in the results page, particularly the ones related to alignment quality. Re-run your query if needed. See also Troubleshooting 3 and 4 later in this page.
We also recommend using our online viewer or the .pml files to inspect the designs prior to any gene order.
Visualizing the results, expected trends and when to omit mutations
2 ways of viewing the results:
- Our online NGL viewer - click on the view buttons in the online results page.
The viewer will upload with the help instructions. Click on the x sign to close the help and reveal the designed model. By using the mouse you can move and rotate the protein as well as zoom in and out. Hovering over protein positions will present their residue number and identity. The menu on the right side allows you to control some aspects of the view, for instance, present or hide mutated amino acids and other objects by clicking on the eye icon.
The protein backbone is shown as a grey cartoon. Mutated positions are shown in sticks (wild type and mutated amino acids are colored in grey and green respectively). On the right side, you can click on the "line" object to show all the other positions as thin sticks.
- PyMOL visualization (free PyMOL version).
**The pml files require PyMOL 1.74 or newer. If you have an older PyMOL version, you may load the raw PDBs or use our online NGLviewer.
Launching the pml session: to inspect a design in PyMOL, simply double click on the relevant .pml file. Alternatively, you may first launch PyMOL and then drag the pml file on to PyMOL, but in this case you need to first drag the .pdb file(s) that represent the design model(s) and the wild type model (suffix: ref.pdb). Only then drag the .pml file.
What you see: a pml file ending with "_8.pml" for example, will show the mutations in design 8 in yellow sticks on the background of the wild type protein shown in green cartoon.
This pml will contain four objects: one with the suffix "_ref" that represents the input structure after relaxation (green cartoon). Another with the suffix "design_8" representing the output design (this object is unselected as default. A click will show the cartoon in cyan with the mutated position in yellow). A 3rd and 4th objects with the suffix "_muts". These represent selections of the mutated positions (try them).
When viewing the structures you would expect to see some/all of the following features:
- Most mutations will be on the surface and only a small fraction in the core.
- Surface charge and polarity will typically increase. In addition, charge distribution may change and in some cases the protein pI might change dramatically. New salt bridges and surface hydrogen bonds may be observed.
- You may observe some mutations to Proline on loops or helices N-termini/kinks.
- Some mutations improve secondary structure propensities. Others may be involved in better helix capping.
- Mutations in the core either improve packing / eliminate unsatisfied hydrogen bond donors or acceptors / make new hydrogen bonds / improve the secondary structure propensity / alleviate repulsion.
The most permissive design has the highest number of mutations and the other designs will typically be sub-sets of this design. Therefore, it is usually enough to examine thoroughly the permissive design. Nevertheless, we recommend using the mutation table provided to you both in the online page and in the downloaded files for easy comparison between designs. By examining the table columns you can quickly see whether there is a mutation(s) in one of the stricter designs that did not appear in the most permissive one.
Consider removing mutations from ordered genes:
- If a given mutation does not make sense to you because of any reason. Each of PROSS mutations is independently stabilizing, and hence, a single mutation removal here and there should still leave you with a stabilizing combination.
- mutations at the vicinity of active sites, especially if you decided to run free of constraints.
- Consider removing core mutations from large to significantly smaller amino acids.
Design 9 is special and should be avoided in most cases
As of October 2019, PROSS provides nine stabilized models. Designs 2-8 are equivalent to the 7 designs provided in the original release. In the new release, Design 9 is the most permissive one (contains the highest number of mutations) while Design 1 is the strictest and both were not generated previously. Design 9 has a higher false-positive rate than the one we report in the PROSS publication and was added to the online release for extreme cases in which the designs have a low number of mutations and this cannot be solved by adjusting the input parameters. If you plan to test only 3-4 designs and Design 8 has a satisfying number of mutations (>8%), avoid Design 9. Of course, if you can test all nine designs, you are welcome to do that and we will appreciate your feedback :)
Selection of sequences for experimental validation
!!!Before reading the next text, make sure that you read the above point on Design 9!!!
We recommend testing 3-4 designs experimentally. Select the ones that are the most different from one another:
- Include the most permissive design that still meets the false positive rate that we reported in PROSS original publication (Design 8). If this design is already too permissive (>10-12% mutations) consider taking Design 7 instead and read Troubleshooting 2 below in addition. If Design 8 is highly constrained (<6-8% mutations) consider testing Design 9 instead, or rerunning PROSS with adjusted parameters.
- Select an additional design from the strictest ones (with >2% mutations)
- For the 3rd and 4th designs pick two designs in the middle range that differ from the other ones and from each other by at least five mutations.
- If you can't find four designs that are different enough than one another consider testing a smaller number of designs and vice versa.
- All mutations are predicted to be independently stabilizing. Therefore, if a specific mutation does not make sense to you at all, you can usually exclude it with no worries. However, if it is close to other mutations you may want to re-consider or exclude the near ones also.
Proceeding toward experimental validation
- Select designs (usually 3-5) for experimental testing. In the above paragraph, I address the question of how to select them.
- For each selected design, align its amino acid sequence (provided in the results directory) with the WT sequence. The main point you want to verify is that no residue is missing. Sometimes there is missing density in crystal structures leading to gaps in the primary sequence. PROSS aims to detect these gaps and complement them however, in some cases, this is impossible due to incorrect information in the rcsb website.
The bottom line: YOU MUST VERIFY THE AMINO ACID SEQUENCES BEFORE ORDERING. ERRORS WILL BE PAINFUL (in money and time)
- Back-translate the amino acid sequences to DNA sequences (optimize for E.coli expression if you plan to bacterially express). I use the following website: dnaworks which allows you to optimize for different organisms and exclude in advance undesired sequences. Another optional website is EMBOSS
- Order the full genes (we strongly recommend to order full genes rather than inserting mutations one by one). Among the companies providing this service are Twist (requires large quantities), IDT (oligos order) and Genscript.
- Calculate the pI values of the ordered designs. If these are significantly different than WT consider changing the buffer pH. pI calculator
- Express the selected designs and the wild type protein. Then use an appropriate assay to examine whether the PROSS designs are more stable than the wild type protein with respect to the problem that made you use PROSS originally. You can take ideas for useful assays from our original PROSS publication as well as other works using PROSS that are listed on our main page.
Troubleshooting 1: a low number of mutations in PROSS designs
If Design 9 has <6-7% of its sequence mutated try one of the following:
- Increasing alignment diversity (see Troubleshooting 3 below).
- Using less or no active site constraints.
Constraining active site regions using the PROSS constrain options typically has a wider effect than you expect. Therefore it could be useful to avoid constraints if an initial constrained run yielded limited designs. However, in such a case we recommend that you post-inspect the final designs to make sure that there are no mutations in the active sites. PROSS mutations are independent and hence, you may omit active-site mutations from the experimentally tested constructs.
Reminder (see an above point about design 9): I used Design 9 as a reference point for troubleshooting but it is recommended for testing only when Design 8 has <6-8% mutations.
Troubleshooting 2: a high number of mutations in PROSS designs
If Design 9 has >13-15% (using Talaris2014*) or >18% (using Ref2015*) of its sequence mutated, we suspect that you are using a bad model. If you indeed used a model, try generating another 1-2 models using other softwares and submit them also to PROSS. When ordering genes, include only mutations that appeared in all models and in any case, omit mutations at long loop regions.
If you ran PROSS free of active-site constraints, such a high mutated fraction may indicate that you should rerun the target with constraints.
*Talaris2014 and Ref2015 are energy functions used by PROSS for the atomistic design steps. In your submission, you had to select one of them. The default is Talaris2014.
Troubleshooting 3: a low number of sequences in the multiple sequence alignment (<50)
PROSS generates a multiple sequence alignment (MSA) that is then used to guide the atomistic design. In standard cases the number of sequences in the MSA ranges from 200-1000 sequences. This number is reported as a message in the results online page. If the number of sequences in the MSA is < 50 we suggest to try and rerun PROSS with alternative MSA parameters to increase the alignment diversity (details below). If parameter adjustment does not yield a significantly improved MSA, you may go ahead and test the designs but note that they are considered less reliable. Specifically, if the number of sequences is <15-20 we recommend mutating back to wild type mutations from or to proline/cysteine/tryptophane.
Proteins susceptible to alignment diversity problems are:
- proteins for which an extremely high number of similar sequences are available in databases (for instance, proteins of popular viral targets) Read the help about the max_targets alignment parameter.
- chimeric proteins. Read the help about the coverage alignment parameter.
- proteins with a short evolutionary history. Read the help about the min_id alignment parameter.
- heavily engineered ones (not suited for design by PROSS)
Read also the point below "MSA figure: what useful information can I derive from it?"
Troubleshooting 4: a low number of sequences in a specific segment (<10)
As explained in Troubleshooting 3, PROSS generates an MSA for the complete protein. However, PROSS *does not* use all the sequence information at all target positions. Instead, PROSS divides the protein into structurally related segments and for each segment, it evaluates which sequences are similar enough to the target protein to guide design. As a result, while the MSA may include hundreds of sequences, in some regions PROSS may find most of them not suitable to guide design and hence, mutations at these segments become less reliable. In the online results page, we report about such segments under the alignment quality analysis title.
If one or few of your segments is reported to have <10 homologous sequences, we recommend mutating back to wild type the non-conservative mutations (e.g., at the core, large2small, mutations from or to proline/cysteine/tryptophane).
Keep in mind that predictions in these segments are not too reliable and consider mutating them completely back to wild type if a 1st round of experimental validation shows bad or ambiguous results.
A comment about the segment numbering - the numbers in the warnings are the same as in the PDB file. However, PROSS analysis also accounts for the N- and C-termini that often do not appear in the crystal structure. As a result, if your query had a warning about a segment at one of the termini, you may see numbers that do not appear in the PDB file including negative numbers.
MSA figure: what useful information can I derive from it?
The online results page and the downloaded results directory both contain a figure called msa_quality.png
The right panel in the figure shows the sequence identity distribution of homologs in the alignment. Ideally, the homologs will be nicely distributed all across the range from 99% to 35% sequence identity to wild type. However, if almost all the sequences share super high identity to the target (>60%), consider re-running PROSS with a higher max_targets value (trials in steps of +2000 are reasonable). If instead, there are not too many sequences and most have low sequence identity to the target (35%-40%), consider lowering the min_id value (by steps of ~-3%).
The left panel shows the same but for the distribution of homolog coverage (i.e, what is the length of the aligned segment between the target protein and the homolog sequence as a % of target length). If you submitted a chimeric protein, you are likely to have problems related to coverage. Read more about the coverage parameter in the submission page.
If at this point you are feeling lost, Contact us.