HELP


PLANT SSP PREDICTION TOOL INSTRUCTIONS


Plant SSP Prediction tool is an integrative pipeline to identify Small Secreted Peptides in plants based on different predictions - protein length; presence and location of signal peptide cleavage sites (by SignalP analysis); homology with previous identified SSPs and families; and transmembrane helices detection. Users can detect 3 types of SSPs - Known, Likely Known, and Putative SSPs based on the defined criteria explained below.


PREDICTION INFORMATION AND CRITERIA

The prediction of SSPs is performed based on different criteria: peptide length; D-score estimation by SignalP 4.1 server; homology analyses and transmembrane helices prediction. D-score is a combined value from signal peptide and cleavage site prediction networks and it is used to discriminate signal peptides from non-signal peptides from SignalP 4.1 server. According the SignalP, for eukaryotes the D-score > 0.45-0.5 is an indication of a SSP. SignalP analyses are implemented with SignalP server and transmembrane helices prediction with TMHMM server.

Homology analyses are performed with two different algorithms and databases: (1) HMM search using HMMER tool with 4,818 HMM profiles from 36 known Mt SSP families identified by de Bang et al. 2017, SSP families from PlantSSP database based on five plant species identified by Ghorbani et al., 2015, IMA family described in Grillet et al., 2018, and PROSCOOP family described in Gully et al., 2019; and (2) Smith-Waterman search using SSearch tool with 3,348 Known SSP genes including previous SSPs identified by PlantSSPdb, 132 IMA SSPs from Grillet et al., 2018, 14 PROSCOOP SSPs from Gully et al., 2019 and 1,974 SSPs identified by our group.

Our tool provides a prediction of three types of Small Secreted Peptides - Known, Likely Known and Putative SSPs. More details about the criteria used to identify Known and Putative SSPs are described in de Bang et al., 2017. Additionally, there are SSPs classified as Likely Known SSPs with significant homology with previous identified SSPs and small protein size, but these should be checked carefully.


SSP Type Criteria
Known SSPs significant homologies with known SSPs (considering HMM and Smith-Waterman searches; e-values ≤ 0.01); small protein length (≤ 200 aa); and SignalP D-score > 0.25
Likely Known SSPs significant homologies with known SSPs (considering HMM and Smith-Waterman searches; e-values ≤ 0.01); and small protein length (≤ 250 aa)
Putative SSPs no significant homologies with known SSPs or hit with one type of homology; small protein length (≤ 230 aa); SignalP D-score > 0.45; and no presence of TM domains

NOTES: Our prediction is an estimation of SSPs based on different parameters; but TM predictions can vary depending on the tool used. Significant homologies with known SSPs require both homology analyses to be classified as a known SSP or likely known - HMM and Smith-Waterman searches (e-values ≤ 0.01). We recommend that SSPs should be further confirmed by expression evidence.



SPECIES AVAILABLE

This tool is suited to the analysis of protein sequences from multiple plant species, since its reference sequences and HMMs are built based on sequences from 58 plant species listed below in alphabetic order (number of sequences):

Aegilops tauschii (n = 1), Alloispermum scabrifolium (n = 1), Amborella trichopoda (n = 6), Arabidopsis lyrata (n = 7), Arabidopsis thaliana (n = 1,196), Arabis alpine (n = 4), Brassica napus (n = 4), Brassica rapa (n = 9), Cannabis sativa (n = 12), Cicer arietinum (n = 1), Citrus clementina (n = 3), Crocus sativus (n = 2), Cucumis melo (n = 1), Erythranthe guttata (n = 1), Espeletia schultzii (n = 1), Galinsoga quadriradiata (n = 1), Glycine max (n = 22), Gossypium hirsutum (n = 1), Helianthus mollis (n = 1), Helianthus nuttallii (n = 2), Helianthus porteri (n = 1), Helianthus praecox (n = 1), Helianthus schweinitzii (n = 3), Heliopsis helianthoides (n = 3), Impatiens balsamina (n = 1), Iostephane heterophylla (n = 2), Jatropha curcas (n = 2), Kingianthus paniculatus (n= 1), Lotus japonicus (n = 2), Medicago truncatula (n = 1,989), Monactis holwayae (n = 1), Morus notabilis (n = 3), Nicotiana attenuata (n = 1), Nicotiana tabacum (n = 3), Oldenlandia affinis (n = 1), Oryza sativa (n = 2), Otopappus epaleaceus (n = 1), Perymenium jelskii (n = 1), Perymenium macrocephalum (n = 1), Phaseolus vulgaris (n = 3), Philactis zinnioides (n = 1), Pisum sativum (n = 1), Populus trichocarpa (n = 1), Ricinus communis (n = 3), Sabazia liebmannii (n = 2), Sabazia sarmentosa (n = 1), Solanum tuberosum (n = 13), Solanum lycopersicum (n = 9), Stellaria media (n = 1), Tetrachyron orizabensis (n = 1), Theobroma cacao (n = 1), Tilesia baccata (n = 1), Tithonia rotundifolia (n = 1), Triticum urartu (n = 3), Viguiera phenax (n = 1), Vitis vinifera (n = 1), Wamalchitamia aurantiaca (n = 1), and Zea mays (n = 7).



PREPARING INPUT FILE

The tool analyses protein sequences in FASTA format with unique IDs. Please make sure the sequences are proteins in a fasta format and have a unique ID.


Input Example:

>Medtr2g010580
TIKKITMNMILAIFFICSTLSCMNISLAQNSPQDFLEVHNQARDEVGVGPLYWEQTLEAYAQNYANKRIKNCE
LEHSMGPYGENLAEGYGEVNGTDSVKFWLSEKPNYDYNSNSCVNDECGHYTQIIWRDSVHLGCAKSKCK
NGWVFVICSYSPPGNVEGERPY
>MT35Noble_004404
MQGYKPISNTYRYDKVHKSESFGFNLSLFDSLIKRWITREVFAALRTKEYHTSSYTRKQENITIHL
>Medtr0001s0100
MGLILLQILFIQELFLPTLQVLKLVKLSPKVRWNARGQPVKEASQVFVSYIGVINCREVPISMEN


HOW TO RUN IT

Paste the protein sequence(s) or select your fasta file to run. Then press "Upload & Submit".

A thousand protein sequences will take about 1 minute to run.



OUTPUT FILE

The output is displayed as a table or can be downloaded as a csv file.

SSP families starting with "c" are from the PlantSSPdb (cluster number).


Output example:

ID Protein length Signal D-score Homology with SSP HMM profiles Homology e-value Homology with known SSPs Homology e-value TM domains SSP prediction result
Medtr2g010580 164 0.725 CAPE 6.4e-68 CAPE_CAPE1_Mt 2.9e-77 0 Known SSP
MT35Noble_004404 66 0.118 - - - - - Non SSP
Medtr0001s0100 65 0.593 - - - - 0 Putative SSP


ERROR MESSAGES

There are 3 different types of error messages that users can obtain:

  • "Input file is not FASTA". Fasta format is required.

  • "Input file does not have protein sequences". Protein sequences are required, and nucleotide sequences are not accepted.

  • "The query list contains identical protein IDs". Unique protein IDs are required, and duplicate IDs are not accepted.



HOW TO CITE

Description of the criteria used to identify Known and Putative SSPs are described in de Bang et al., 2017.



HELP

Scientific or technical issues: Clarissa Boschiero or Xinbin Dai.

More information about the SSP project: Patrick Zhao or Wolf Scheible.