Parallellized Protein Sequence Similarity Calculation Based on Sequence Alignment (Disk-Based Version)
Source:R/par-01-parSeqSim.R
parSeqSimDisk.Rd
Parallellized calculation of protein sequence similarity based on sequence alignment. This version caches the partial results in each batch to the hard drive and merges the results together in the end, which reduces the memory usage.
Usage
parSeqSimDisk(
protlist,
cores = 2,
batches = 1,
path = tempdir(),
verbose = FALSE,
type = "local",
submat = "BLOSUM62",
gap.opening = 10,
gap.extension = 4
)
Arguments
- protlist
A length
n
list containingn
protein sequences, each component of the list is a character string, storing one protein sequence. Unknown sequences should be represented as""
.- cores
Integer. The number of CPU cores to use for parallel execution, default is
2
. Users can use thedetectCores()
function in theparallel
package to see how many cores they could use.- batches
Integer. How many batches should we split the pairwise similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and hold all the pairwise similarities in a single batch. Defaults to 1.
- path
Directory for caching the results in each batch. Defaults to the temporary directory.
- verbose
Print the computation progress?
- type
Type of alignment, default is
"local"
, can be"global"
or"local"
, where"global"
represents Needleman-Wunsch global alignment;"local"
represents Smith-Waterman local alignment.- submat
Substitution matrix, default is
"BLOSUM62"
, can be one of"BLOSUM45"
,"BLOSUM50"
,"BLOSUM62"
,"BLOSUM80"
,"BLOSUM100"
,"PAM30"
,"PAM40"
,"PAM70"
,"PAM120"
, or"PAM250"
.- gap.opening
The cost required to open a gap of any length in the alignment. Defaults to 10.
- gap.extension
The cost to extend the length of an existing gap by 1. Defaults to 4.
See also
See parSeqSim
for the in-memory version.
Author
Nan Xiao <https://nanx.me>
Examples
if (FALSE) {
# Be careful when testing this since it involves parallelisation
# and might produce unpredictable results in some environments
library("Biostrings")
library("foreach")
library("doParallel")
s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]
s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]
s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]
s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]
s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]
set.seed(1010)
plist <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 100, replace = TRUE)])
psimmat <- parSeqSimDisk(
plist,
cores = 2, batches = 10, verbose = TRUE,
type = "local", submat = "BLOSUM62"
)
}