PROSPERous: an integrative tool for predicting protease cleavage sites

 
General

The PROSPERous server facilitates the in silico identification of cleavage sites of various proteases (See a full list below). It covers four major protease families- Aspartic (A), Cysteine (C), Metallo (M) and Serine (S), encompassing 75 individual proteases. A number of cleavage site scoring functions are provided, based on different cleavage site P4-Pn', where n=1, 2, 3 and 4.

A complete analysis of a submitted substrate sequence involves the following steps:

  1. Input sequence(s): This step involves submission of the substrate sequence(s) in the FASTA format.
  2. Select protease: The users need to specify a protease of interest in order to submit the sequence and predict potential cleavage sites of that protease.
  3. Select cleavage site P4-Pn': PoPS works by calculating scores at each position of the cleavage site P4-Pn' (n=1, 2, 3 and 4 ) based on the scoring functions. In other words, selection of a cleavage site P4-Pn' window has an impact on the prediction performance. Depending on the protease family of interest, prediction performance difference between different window sizes vary between 1 and 4%.
  4. Select scoring function: Users need to specify one of the seven different scoring functions in order to make the prediction: Nearest Neighbor Similarity (NNS), Amino Acid Frequency (AAF), WebLogo-based Sequence conservation (WLS), BLOSUM62 Substitution Index (BSI), and their combinations: AAF+NNS, WLS+BSI, NNS+WLS.
  5. Select top ranking results: PROSPERous provides users an option to list the top 1, 3, 5, 10 and 20 predicted results which will appear at the first result webpage.

 

Table 1. The statistics of substrate datasets used to develop PROSPERous server. All the substrates of proteases were extracted from the MEROPS database (Rawlings et al., 2006; 2008). Each substrate dataset of a protease can be downloaded by clicking the hyperlink of each MEROPS ID of the corresponding protease family in this table.

 

Protease class
Protease family
Number of substrate sequences
Number of cleavage sites
P4-P4' sequence logo
Aspartic protease
pepsin A(A01.001)
11
34
 
cathepsin D (A01.009)
38
141
 
cathepsin E (A01.010)
17
60
 
phytepsin (A01.020)
5
22
 
nemepsin-2 (A01.068)
4
123
 
HIV-1 retropepsin (A02.001)
284
473
 
Cysteine protease
papain (C01.001)
5
28
 
cathepsin L (C01.032)
21
63
 
cathepsin L1 ({Fasciola} sp.) (C01.033)
6
172
 
cathepsin S (C01.034)
6
23
 
cathepsin K (C01.036)
99
115
 
falcipain-2 (C01.046)
3
120
 
cathepsin B (C01.060)
22
45
 
falcipain-3 (C01.063)
2
97
 
peptidase 1 (mite) (C01.073)
7
20
 
cathepsin B-like peptidase, nematode (C01.101)
4
43
 
calpain-1 (C02.001)
45
87
 
calpain-2 (C02.002)
38
125
 
caspase-1 (C14.001)
44
54
 
caspase-3 (C14.003)
304
426
 
caspase-7 (C14.004)
81
96
 
caspase-6 (C14.005)
64
174
 
caspase-8 (C14.009)
43
61
 Metallopeptidase
matrix metallopeptidase-1 (M10.001)
28
59
 
matrix metallopeptidase-8 (M10.002)
22
76
 
matrix metallopeptidase-2 (M10.003)
705
1661
 
matrix metallopeptidase-9 (M10.004)
47
225
 
matrix metallopeptidase-3 (M10.005)
56
155
 
matrix metallopeptidase-7 (M10.008)
47
105
 
matrix metallopeptidase-12 (M10.009)
27
119
 
matrix metallopeptidase-13 (M10.013)
29
94
 
membrane-type matrix metallopeptidase-1 (M10.014)
38
116
 
membrane-type matrix metallopeptidase-3 (M10.016)
4
20
 
matrix metallopeptidase-20 (M10.019)
6
26
 
membrane-type matrix metallopeptidase-6 (M10.024)
6
40
 
mirabilysin (M10.057)
2
28
 
meprin beta subunit (M12.004)
13
32
 
procollagen C-peptidase (M12.005)
18
20
 
ADAM10 peptidase (M12.210)
12
20
 
ADAM17 peptidase (M12.217)
23
37
 
ADAMTS4 peptidase (M12.221)
17
57
 
ADAMTS5 peptidase (M12.225)
15
37
 
insulysin (M16.002)
12
56
 
mitochondrial processing peptidase beta-subunit (M16.003)
52
54
 
eupitrilysin (M16.009)
10
34
 
aminopeptidase Ap1 (M28.002)
9
38
 Serine protease
chymotrypsin A (cattle-type) (S01.001)
221
531
 
granzyme B ({Homo sapiens}-type) (S01.010)
265
318
 
kallikrein-related peptidase 5 (S01.017)
11
21
 
kallikrein-related peptidase 14 (S01.029)
15
34
 
elastase-2 (S01.131)
191
321
 
cathepsin G (S01.133)
168
270
 
myeloblastin (S01.134)
8
21
 
granzyme A (S01.135)
210
261
 
granzyme B, rodent-type (S01.136)
156
162
 
chymase ({Homo sapiens}-type) (S01.140)
25
33
 
kallikrein-related peptidase 2 (S01.161)
13
27
 
kallikrein-related peptidase 3 (S01.162)
12
45
 
coagulation factor Xa (S01.216)
14
27
 
thrombin (S01.217)
91
113
 
plasmin (S01.233)
45
100
 
glutamyl peptidase I (S01.269)
512
959
 
HtrA2 peptidase (S01.278)
18
55
 
subtilisin Carlsberg (S08.001)
5
27
 
high alkaline protease ({Alkaliphilus transvaalensis}) (S08.028)
1
24
 
peptidase K (S08.054)
6
39
 
kexin (S08.070)
37
58
 
furin (S08.071)
78
90
 
proprotein convertase 1 (S08.072)
30
61
 
proprotein convertase 2 (S08.073)
24
45
 
cucumisin (S08.092)
1
20
 
prolyl oligopeptidase (S09.001)
14
22
 
signal peptidase I (S26.001)
291
291
 
thylakoidal processing peptidase (S26.008)
49
50
 
signalase (animal) 21 kDa component (S26.010)
359
359

 

Detailed explanation of individual fields of the input form will be given below. Some fields contain default values.

 

Detailed explanations

Input sequence

Please input and submit the substrate sequence(s) in the FASTA format. Due to the computational capability, PROSPERous can accept a maximum number of 100 substrate sequences once a time.

 

An example of two substrate sequences in the FASTA format are shown below:

>P55957
MDCEVNNGSSLRDECITNLLVFGFLQSCSDNSFRRELDALGHELPVLAPQWEGYDELQTDGNRSSHSRLGRIEADSESQEDIIRNIARHLAQVGDSMDRSIPPGLVNGLALQLRNTSRSEEDRNRDLATALEQLLQAYPRDMEKEKTMLVLALLLAKKVASHTPSLLRDVFHTTVNFINQNLRTYVRSLARNGMD
>O75496
MNPSMKQKQEEIKENIKNSSVPRRTLKMIQPSASGSLVGRENELSAGLSKRKHRNDHLTSTTSSPGVIVPESSENKNLGGVTQESFDLMIKENPSSQYWKEVAEKRRKALYEALKENEKLHKEIEQKDNEIARLKKENKELAEVAEHVQYMAELIERLNGEPLDNFESLDNQEFDSEEETVEDSLVEDSEIGTCAEGTVSSSTDAKPCI

 

In cases where users have a significant number of substrate sequences to predict, we recommend that you use the following sourcecode to automatically submit these sequences to PROSPERous server, instead of manual submission, which is particularly a more efficient way for large-scale prediction of cleavage sites.

Use the following commands like to automatically submit large number of sequences to PROSPERous:

 

 

Select protease

Please select one of the proteases from the drop-down meau in order to submit your query sequence. PROSPERous can predict up to

 

 

The following are the advanced options with default values pre-specified.

Select cleavage site P4-Pn':

Users Select cleavage site P4-Pn':

 

 

Cleavage site scoring functions:

Selection of appropriate scoring function is critical for determining the prediction performance of a prediction tool. PROSPERous uses a number of different scoring functions, including Nearest Neighbor Similarity (NNS), Amino Acid Frequency (AFF), WebLogo-based Sequence conservation (WLS), BLOSUM62 Substitution Index (BSI), and the combination of two scoring methods, such as AAF+NNS, WLS+BSI and NNS+WLS. We will briefly describe these scoring functions in the following section.

Nearest Neighbor Similarity (NNS):

This function is used to evaluate the similarity between two peptide sequences A(m, n) and B (m, n), where m denotes the upstream (non-prime side) and n denotes the downstream (prime side) length centered on the potential cleavage site residue of interest.
In this study, we set up m=4 and n=1, 2, 3, and 4, i.e. P4-Pn' (where n=1, 2, 3, 4, indicating P4-P1', P4-P2', P4-P3' and P4-P4', respectively). Then the similarity between two peptides A(m, n) and B (m, n) is defined as:

S(A, B)=sum (Score (A[i], B[i])), where i runs from 1 to m+n and the Score (a, b) is the element value in the BLOSUM62 substitution matrix.

Given a putative P(m, n) peptide, it will be compared against all the known cleavage sites (considered as 'known neighbors') in the substrate dataset for a protease to calculate the substitution scores separately. Then the average of the substitution scores is calculated as the final prediction score for the given putative site.

Higher prediction scores indicate higher similarity between the two sites (predicted site and known cleavage site). As this similarity-based prediction score is averaged over all the known cleavage sites, we hence call this function "Nearest neighbor similarity" (NNS) function. Using this scoring funciton, we are able to predict potential cleavage sites.

 

Amino acid frequency (AAF):

This is one of the commonly used scoring method to predict the potential cleavage sites of a protease. Several previous tools use this scoring method, including PoPS (Boyd et al., 2005), CaSPredictor (Garay-Malpartida et al., 2005)and SitePrediction (Verspurten et al., 2009)

 

 

BLOSUM62 Substitution Index (BSI):

The idea of applying the BLOSUM62 Substitution Matrix to predict cleavage sites has been previously described in (et al., 2005).

References:

 

Combination of two or more scoring functions:

Combined scoring functions:

1. AAF+NNS:

2. WLS+BSI:

3. NNS+WLS:

PROSPERous has further combined individual scoring functions in order to make a more reliable and accurate prediction of cleavage sites. Three combinations were provided.

Prediction result analysis:

Analyze prediction scores
This is to analyze which predicted cleavage is more likely to be the putative cleavage site of the protease, determined by a given threshold.

Top 1, 3, 5, 10 and 20 ranking
Users can select to visualize the top 1, 3, 5, 10 and 20 ranking of the predicted cleavage sites, which will be displayed on the first result webpage.
 

Performance evaluation of various scoring functions:

Provide performance comparison table here.

 

 

 

 

Other relevant tools for cleavage sites prediction

Predictors Description URL address Features
PeptideCutter Predict potential cleavage sites cleaved by proteases or chemicals http://au.expasy.org/tools/peptidecutter/ Amino acid occurrence
PEPS Rule-based endopeptidase cleavage site scoring matrices and predict paspase-3 substrates not available Position-specific scoring matrices (PSSM)
CasPredictor Predict Caspase cleavage sites http://icb.usp.br/~farmaco/Jose/CaSpredictorfiles Amino acid substitution, amino acid frequency and the presence of
' PEST' sequences was used as the input features
GraBCas Position-specific scoring prediction of cleavage sites for caspase 1-9 and granzyme B http://wwwalt.med-rz.uniklinik-saarland.de/med_fak/humangenetik/software/index.html Improved position specific scoring scheme
CASVM SVM-based prediction of caspase cleavage sites http://casbase.org/casvm/index.html SVM classifiers trained with sequence segments containing the
tetrapeptide cleavage sites
PoPS Prediction of protease specificity http://pops.csse.monash.edu.au/pops-cgi/ Amino acid occurrence, Position-specific scoring matrices (PSSM), predicted structure information such as substrate secondary structure and solvent accessibility
Cascleave Predict substrate cleavage site of caspases http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/Cascleave/ Multiple sequential and structural features including local amino acid sequence profile, predicted solvent accessibility and native disorder features
SitePrediction Predicting the cleavage of proteinase substrastes http://www.dmbr.ugent.be/prx/bioit2-public/SitePrediction/ Amino acid occurrence, 'PEST' sequences, and structural information such as secondary structure and solvent accessibility
Residue grouping
This parameter specifies the grouping of residues. The purpose of the parameter is to allow the detection of weak coevolution signals, where a mild residue substitution is considered not significant.

Sequence weighting
This parameter specifies the sequence weighting method.

 

 

 

 

References:

Backes C, Kuentzer J, Lenhof HP, Comtesse N, Meese E. GraBCas: a bioinformatics tool for score-based prediction of Caspase- and Granzyme B-cleavage sites in protein sequences. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W208-213.

Barkan DT, Hostetter DR, Mahrus S, Pieper U, Wells JA, Craik CS, Sali A. Prediction of protease substrates using sequence and structure features. Bioinformatics. 2010 Jul 15;26(14):1714-1722.

Boyd SE, Pike RN, Rudy GB, Whisstock JC, Garcia de la Banda M. PoPS: a computational tool for modeling and predicting protease specificity. J Bioinform Comput Biol. 2005 Jun;3(3):551-585.

Chen CT, Yang EW, Hsu HJ, Sun YK, Hsu WL, Yang AS. Protease substrate site predictors derived from machine learning on multilevel substrate phage display data. Bioinformatics. 2008 Dec 1;24(23):2691-2697.

Garay-Malpartida HM, Occhiucci JM, Alves J, Belizário JE. CaSPredictor: a new computer-based tool for caspase substrate prediction. Bioinformatics. 2005 Jun;21 Suppl 1:i169-176.

Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A, Protein identification and analysis tools on the ExPASy server, in The Proteomics Protocols Handbook, Walker JM (ed.), Humana Press, Totowa, New Jersey, pp. 571–607, 2005.

Lohmüller T, Wenzler D, Hagemann S, Kiess W, Peters C, Dandekar T, Reinheckel T. Toward computer-based cleavage site prediction of cysteine endopeptidases. Biol Chem. 2003 Jun;384(6):899-909.

Piippo M, Lietzén N, Nevalainen OS, Salmi J, Nyman TA. Pripper: prediction of caspase cleavage sites from whole proteomes. BMC Bioinformatics. 2010 Jun 15;11:320.

Song J, Tan H, Boyd SE, Shen H, Mahmood K, Webb GI, Akutsu T, Whisstock JC, Pike RN. Bioinformatic approaches for predicting substrates of proteases. J Bioinform Comput Biol. 2011 Feb;9(1):149-178.

Song J, Tan H, Shen H, Mahmood K, Boyd SE, Webb GI, Akutsu T, Whisstock JC. Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics. 2010 Mar 15;26(6):752-60. Epub 2010 Feb 3.

Verspurten J, Gevaert K, Declercq W, Vandenabeele P. SitePredicting the cleavage of proteinase substrates. Trends Biochem Sci. 2009 Jul;34(7):319-323.

Wee LJ, Tan TW, Ranganathan S. SVM-based prediction of caspase substrate cleavage sites. BMC Bioinformatics. 2006 Dec 18;7 Suppl 5:S14.

Wee LJ, Tan TW, Ranganathan S. CASVM: web server for SVM-based prediction of caspase substrates cleavage sites. Bioinformatics. 2007 Dec 1;23(23):3241-3243.

Wee LJ, Tong JC, Tan TW, Ranganathan S. A multi-factor model for caspase degradome prediction. BMC Genomics. 2009 Dec 3;10 Suppl 3:S6.

Yang ZR. Prediction of caspase cleavage sites using Bayesian bio-basis function neural networks. Bioinformatics. 2005 May 1;21(9):1831-1837.

Contact

If you have any queries about or suggestions to improve PROSPERous, please send Email to