PCRPi Webserver

INFORMATION and HELP


  • General information
  • How to use the server:
  • - Submit a job
    - Advanced parameters
    - Retrieve your prediction
    - Errors during prediction
  • Download benchmark and training datasets
  • Related Resources

  • GENERAL INFORMATION:

    PCRPi is a method for predicting critical residues (i.e. hot spot residues) in protein interfaces. PCRPi relies on the integration of seven measures or variables that account for energy, structural and evolutionary information by using Bayesian Networks. Figure 1 shows an overall description of the prediction process.

    prediction flowchart
    Figure 1. Overall description of the prediction process (click here to enlarge the figure).

    Interface residue attributes:
      1. Interaction engagement index (IE). The IE index gauges the proportion of side chain atoms (including the main-chain amino nitrogen and carboxyl oxygen) of a given interface residue that are engaged on atomic interactions with proteins in the complex. Non-bonded atomic interactions are described using CSU program [1]
      2. Topographical index (TOP). The TOP index estimates the structural micro-environment of a given interface residue and is calculated as the ratio between structurally neighbor residues and the average number of residues that the given residue type (e.g. Ala) interact with when located on a protein interface. Structurally neighbor residues are any residues not belonging to the same chain whose carbon alpha is enclosed in a sphere of 10 Angstroms of radius centered on the carbon alpha of the given residue. The average number of contacts by residue type is shown in table 1.
      chart
      Table 1. (a) Average; (b) Standard deviation (click here to enlarge the table).

      3. CON index. The CON index refers to the evolutionary conservation of residues that are in contact with a given interface residue. To analyze the conservation, sequence profiles are derived as described in our previous work [2]. In short, homologous sequences are culled from NR database of NCBI [3] using five iterations of PSI-BLAST [4] with an E-value of 0.0001. The homologous sequences are then filtered using ParseBlast [5] with default parameters to maximize the sequence sampling avoiding bias toward overrepresented protein families. The resulting sequence profile are given to al2co program [6] as an input and the al2co sequence conservation score is assigned to each individual residue. Finally, for a given interface residue i, the CON index is the ration between residues with an al2co_score greater or equal to 1.0 and the total number of residues in contact with residue i.
      4. ANCCON index. The ANCCON index refers to the raw al2co[6] sequence conservation score applied to each individual interface residue.
      5. 3DCON index. The 3DCON index is equivalent to the CON index but the al2co sequence conservation score is substituted by the 3D regional conservation scores as defined by Landgrad et al.[7], therefore this index quantifies structurally conserved patches.
      6. ANC3DCON index. The ANC3DCON index is equivalent to the ANCCON index but using the 3D regional conservation score as conservation measure.
      7. BE index. the BE index is equivalent to an in silico Alanine scanning. Each interface residue is mutated to Alanine and the effect of such mutation in the stability of the protein complex is estimated using FoldX[8]. Crystallographic waters, if any, are kept during the energy calculations. Therefore, BE reflects the difference in binding free energy between the unmutated (wild-type) and mutated complex.
    Bayesian Networks
      Each interface residue is characterized by 7 different measures (shown above) that are expressed in different units. In order to unify them into a common probabilistic framework, Bayesian Networks (BN). are employed. Three different Bayesian Networks architectures are used by the prediction method: (i) naïve; (ii) an Ab+ specific expert BN; and (iii) and Ab- specific expert BN (see Figure 1).
    Training datasets
      Experimentally validated data extracted from the AseDB, and the BID databases, and Kortemme and Baker’s [9] and Guerois et al.’s [8] works was compiled to generate the training datasets. Briefly, residues in protein complexes were labeled as critical or non-critical depending on effect of point mutations (Alanine scanning) on the stability of the protein complexes. If the mutation was increasing the binding energy in 2.0 Kcal.mol-1 or more the residue was annotated as critical or hot-spot residue, otherwise it was considered as non-critical. This information is used to train the BNs.

      Two different types of training set are considered: Ab+ and Ab- datasets. Ab+ is effectively the entire dataset (25 protein complexes, 636 interface residues, 300 of them experimentally annotated). Ab- dataset does not include non-evolutionary related protein complexes, i.e. Antigen-Antibody complexes. The reason of having two training datasets: Ab+ and Ab- is because some of the metrics used as input variables (ANCCON, ANC3DCON, CON, and 3DCON) rely on evolutionary information, i.e. sequence conservation, and in the case of (for instance) Antigen-Antibody complexes, this information is meaningless when referring to sequence conservation on CDRs (see Figure 1).
    More information is available in the paper describing the method.


    HOW TO USE THE SERVER?

    PCRPi Webserver has a user friendly and simple interface. In order to use the server you only need to provide either:
      The PDB code of the protein complex of interest (e.g. 2uzi) or upload the coordinates of the protein complex (PDB format only!) and provide the chain ID (as in the PDB file) of the protein in the complex that want to be analyzed. Alternatively, users can select the type of Bayesian Network to be use (i.e. naive or expert and the dataset used to train the system (i.e. Ab+ or Ab- dataset), see below ADVANCED PARAMETERS.

      As a result, users will get a list of interface residues sorted by the probability of being critical for the interaction (i.e. higher probability would mean that the residue is indeed important in the interaction and likely to be part of the hot-spot of the interaction). Probabilities are mapped onto the protein structure by substituting the 'b-factor' field in the PDB file by the probability multiplied by 100. This provides and easy and convenient way to visualize the predictions in molecular viewers. A Jmol applet is also implemented, so users can visualize the structure of the protein complex and prediction in their own internet browser. Links are provided to any of the files that are generated during the prediction process allowing users to store all the information in their local computers. A successful prediction will generated the following files:
        - Original PDB file (as in PDB Databank or as uploaded by users)
        - Remediated PDB file (PDB file where quality checks have been perfomed (see below)
        - B-factor substituted PDB file
        - A file listing the atomic interactions at the interface or selected chain
        - A file containing a list of predicted hot spot residues ranked by probability (plain text)
        - A log file
    Click here to access a sample output



    ADVANCED PARAMETERS


    By default PCRPi uses a naïve Bayesian network that is trained using protein complexes that includes non-evolutionary related complexes, e.g Antigen-Antibody (Ab+ dataset). On advanced options, the user can choose between naïve of expert Bayesian networks, and also the training dataset: Ab+ and Ab-. For more information, refer to the paper describing the method and to the general information shown above.



    RETRIEVE YOUR PREDICTION


    At the time of the submission an unique job identification code will be assigned to the task. The format of the job identification code is PCRPi_XXXXXXXX where XXXXXXXX is an unique string combination of letters and number. To retrieve the results or check the progress of the prediction process simply enter the full job identification code (including the PCRPi_ prefix) in the appropiate field at the submission web-site.



    ERRORS DURING PREDICTION


    There are some rare situations when PCRPi fails to deliver a prediction. The most common problem is when users submit (or select) a PDB containing a single chain or if more that one, there are no atomic interactions between chains. As PCRPi predicts critical residues on protein interfaces, the PDB *MUST* have at least two or more protein chains that are in contact (i.e. atomic interactions between residues in different chains). Also, keep in mind PCRPi filters protein chains by length; any chain shorter than 40 residues is discarded and not taken into consideration.

    There are some other cases that will lead to problems and errors during the prediction process and are the following ones:
      1. Fail during quality checks: A remediated PDB file is generated before running a prediction. The protein complex undergoes a set of quality checks that includes checking for missing atoms, romaters, and inserted residues. Being energy methods very sensitive to the quality of the crystal structure and missing atoms, this step is very important. If for some reason this step fails, the prediction process will stop.
      2. Method fails to find sequence homologous. Psi-Blast do not yield any sequences with significant E-value, and as some of the measures are evolutionary-based and require the construction of a sequence profile, the prediction process will abort.
      3. Other unlikely situations: Unable to connect to centralized DBs and/or queue system due to temporary network failure, machine shutdown, etc.
    In any case the prediction process is fully logged and log file is provided to users for their inspection.



    BENCHMARK and TRAINING SETS

      The sets of structures used to train and benchmark PCRPi are available here.



    RELATED RESOURCES

    REFERENCES


      [1] Sobolev, V., Sorokine, A., Prilusky, J., Abola, E.E. and Edelman, M. (1999) Automated analysis of interatomic contacts in proteins. Bioinformatics, 15, 327-332. CSU server
      [2] Fernandez-Fuentes, N., Rai, B.K., Madrid-Aliste, C.J., Fajardo, J.E. and Fiser, A. (2007) Comparative protein structure modeling by combining multiple templates and optimizing sequence-to-structure alignments. Bioinformatics, 23, 2558-2565. M4T server
      [3] Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res, 35, D61-65. NCBI web-site
      [4] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389. Psi-blast web-server @ EBI
      [5] Rai, B.K., Madrid-Aliste, C.J., Fajardo, J.E. and Fiser, A. (2006) MMM: a sequence-to-structure alignment protocol. Bioinformatics, 22, 2691-2692. ParseBlast web-server
      [6] Pei, J. and Grishin, N. (2001) AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics, 17, 700-712. al2co web-server
      [7] Landgraf, R., Xenarios, I. and Eisenberg, D. (2001) Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J.Mol.Biol., 307, 1487.
      [8] Guerois, R., Nielsen, J.E. and Serrano, L. (2002) Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol, 320, 369-387. FoldX web-server
      [9] Kortemme, T. and Baker, D. (2002) A simple physical model for binding energy hot spots in protein-protein complexes. Proc Natl Acad Sci U S A, 99, 14116-14121. Robetta-Ala web-server