Parallel Protein Sequence Similarity Calculation Based on Sequence Alignment (In-Memory Version)
Source:R/par-01-parSeqSim.R
parSeqSim.Rd
Parallel calculation of protein sequence similarity based on sequence alignment.
Usage
parSeqSim(
protlist,
cores = 2,
batches = 1,
verbose = FALSE,
type = "local",
submat = "BLOSUM62",
gap.opening = 10,
gap.extension = 4
)
Arguments
- protlist
A length
n
list containingn
protein sequences, each component of the list is a character string, storing one protein sequence. Unknown sequences should be represented as""
.- cores
Integer. The number of CPU cores to use for parallel execution, default is
2
. Users can use theavailableCores()
function in the parallelly package to see how many cores they could use.- batches
Integer. How many batches should we split the pairwise similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and fit all the pairwise similarities into a single batch. Defaults to 1.
- verbose
Print the computation progress? Useful when
batches > 1
.- type
Type of alignment, default is
"local"
, can be"global"
or"local"
, where"global"
represents Needleman-Wunsch global alignment;"local"
represents Smith-Waterman local alignment.- submat
Substitution matrix, default is
"BLOSUM62"
, can be one of"BLOSUM45"
,"BLOSUM50"
,"BLOSUM62"
,"BLOSUM80"
,"BLOSUM100"
,"PAM30"
,"PAM40"
,"PAM70"
,"PAM120"
, or"PAM250"
.- gap.opening
The cost required to open a gap of any length in the alignment. Defaults to 10.
- gap.extension
The cost to extend the length of an existing gap by 1. Defaults to 4.
See also
See parSeqSimDisk
for the disk-based version.
Author
Nan Xiao <https://nanx.me>
Examples
if (FALSE) { # \dontrun{
# Be careful when testing this since it involves parallelization
# and might produce unpredictable results in some environments
library("Biostrings")
library("foreach")
library("doParallel")
s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]
s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]
s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]
s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]
s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]
plist <- list(s1, s2, s3, s4, s5)
(psimmat <- parSeqSim(plist, cores = 2, type = "local", submat = "BLOSUM62"))
} # }