Compute PSSM (Position-Specific Scoring Matrix) for given protein sequence
Source:R/desc-15-PSSM.R
extractPSSM.Rd
This function calculates the PSSM (Position-Specific Scoring Matrix) derived by PSI-Blast for given protein sequence or peptides.
Usage
extractPSSM(
seq,
start.pos = 1L,
end.pos = nchar(seq),
psiblast.path = NULL,
makeblastdb.path = NULL,
database.path = NULL,
iter = 5,
silent = TRUE,
evalue = 10L,
word.size = NULL,
gapopen = NULL,
gapextend = NULL,
matrix = "BLOSUM62",
threshold = NULL,
seg = "no",
soft.masking = FALSE,
culling.limit = NULL,
best.hit.overhang = NULL,
best.hit.score.edge = NULL,
xdrop.ungap = NULL,
xdrop.gap = NULL,
xdrop.gap.final = NULL,
window.size = NULL,
gap.trigger = 22L,
num.threads = 1L,
pseudocount = 0L,
inclusion.ethresh = 0.002
)
Arguments
- seq
Character vector, as the input protein sequence.
- start.pos
Optional integer denoting the start position of the fragment window. Default is
1
, i.e. the first amino acid of the given sequence.- end.pos
Optional integer denoting the end position of the fragment window. Default is
nchar(seq)
, i.e. the last amino acid of the given sequence.- psiblast.path
Character string indicating the path of the
psiblast
program. If NCBI Blast+ was previously installed in the operation system, the path will be automatically detected.- makeblastdb.path
Character string indicating the path of the
makeblastdb
program. If NCBI Blast+ was previously installed in the system, the path will be automatically detected.- database.path
Character string indicating the path of a reference database (a FASTA file).
- iter
Number of iterations to perform for PSI-Blast.
- silent
Logical. Whether the PSI-Blast running output should be shown or not (May not work on some Windows versions and PSI-Blast versions), default is
TRUE
.- evalue
Expectation value (E) threshold for saving hits. Default is
10
.- word.size
Word size for wordfinder algorithm. An integer >= 2.
- gapopen
Integer. Cost to open a gap.
- gapextend
Integer. Cost to extend a gap.
- matrix
Character string. The scoring matrix name (default is
"BLOSUM62"
).- threshold
Minimum word score such that the word is added to the BLAST lookup table. A real value >= 0.
- seg
Character string. Filter query sequence with SEG (
"yes"
,"window locut hicut"
, or"no"
to disable). Default is"no"
.- soft.masking
Logical. Apply filtering locations as soft masks? Default is
FALSE
.- culling.limit
An integer >= 0. If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit. Incompatible with
best.hit.overhang
andbest_hit_score_edge
.- best.hit.overhang
Best Hit algorithm overhang value (A real value >= 0 and =< 0.5, recommended value: 0.1). Incompatible with
culling_limit
.- best.hit.score.edge
Best Hit algorithm score edge value (A real value >=0 and =< 0.5, recommended value: 0.1). Incompatible with
culling_limit
.- xdrop.ungap
X-dropoff value (in bits) for ungapped extensions.
- xdrop.gap
X-dropoff value (in bits) for preliminary gapped extensions.
- xdrop.gap.final
X-dropoff value (in bits) for final gapped alignment.
- window.size
An integer >= 0. Multiple hits window size, To specify 1-hit algorithm, use
0
.- gap.trigger
Number of bits to trigger gapping. Default is
22
.- num.threads
Integer. Number of threads (CPUs) to use in the BLAST search. Default is
1
.- pseudocount
Integer. Pseudo-count value used when constructing PSSM. Default is
0
.- inclusion.ethresh
E-value inclusion threshold for pairwise alignments. Default is
0.002
.
Value
The original PSSM, a numeric matrix which has
end.pos - start.pos + 1
columns and 20
named rows.
Details
For given protein sequences or peptides, PSSM represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence. Note that the output value is not normalized.
Note
The function requires the makeblastdb
and psiblast
programs
to be properly installed in the operation system or
their paths provided.
The two command-line programs are included in the NCBI-BLAST+ software package. To install NCBI Blast+ for your operating system, see https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html for detailed instructions.
Ubuntu/Debian users can directly use the command
sudo apt-get install ncbi-blast+
to install NCBI Blast+.
For OS X users, download ncbi-blast- ... .dmg
then install.
For Windows users, download ncbi-blast- ... .exe
then install.
References
Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic acids research 25.17 (1997): 3389–3402.
Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356–3363.
Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239–4247.
Author
Nan Xiao <https://nanx.me>
Examples
if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") {
cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+ first")
} else {
x <- readFASTA(system.file(
"protseq/P00750.fasta",
package = "protr"
))[[1]]
dbpath <- tempfile("tempdb", fileext = ".fasta")
invisible(file.copy(from = system.file(
"protseq/Plasminogen.fasta",
package = "protr"
), to = dbpath))
pssmmat <- extractPSSM(seq = x, database.path = dbpath)
dim(pssmmat) # 20 x 562 (P00750: length 562, 20 Amino Acids)
}
#> Cannot find makeblastdb or psiblast. Please install NCBI Blast+ first