
Parallel Protein Sequence Similarity Calculation Between Two Sets Based on Sequence Alignment (In-Memory Version)
Source:R/par-01-parSeqSim.R
crossSetSim.RdParallel calculation of protein sequence similarity based on sequence alignment between two sets of protein sequences.
Usage
crossSetSim(
protlist1,
protlist2,
type = "local",
cores = 2,
batches = 1,
verbose = FALSE,
submat = "BLOSUM62",
gap.opening = 10,
gap.extension = 4
)Arguments
- protlist1
A length
nlist containingnprotein sequences, each component of the list is a character string, storing one protein sequence. Unknown sequences should be represented as"".- protlist2
A length
nlist containingmprotein sequences, each component of the list is a character string, storing one protein sequence. Unknown sequences should be represented as"".- type
Type of alignment, default is
"local", can be"global"or"local", where"global"represents Needleman-Wunsch global alignment;"local"represents Smith-Waterman local alignment.- cores
Integer. The number of CPU cores to use for parallel execution, default is
2. Users can use theavailableCores()function in the parallelly package to see how many cores they could use.- batches
Integer. How many batches should we split the similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and fit all the similarities into a single batch. Defaults to 1.
- verbose
Print the computation progress? Useful when
batches > 1.- submat
Substitution matrix, default is
"BLOSUM62", can be one of"BLOSUM45","BLOSUM50","BLOSUM62","BLOSUM80","BLOSUM100","PAM30","PAM40","PAM70","PAM120", or"PAM250".- gap.opening
The cost required to open a gap of any length in the alignment. Defaults to 10.
- gap.extension
The cost to extend the length of an existing gap by 1. Defaults to 4.
Author
Sebastian Mueller <https://alva-genomics.com>
Examples
if (FALSE) { # \dontrun{
# Be careful when testing this since it involves parallelization
# and might produce unpredictable results in some environments
library("Biostrings")
library("foreach")
library("doParallel")
s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]
s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]
s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]
s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]
s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]
plist1 <- list(s1 = s1, s2 = s2, s4 = s4)
plist2 <- list(s3 = s3, s4_again = s4, s5 = s5, s1_again = s1)
psimmat <- crossSetSim(plist1, plist2)
colnames(psimmat) <- names(plist1)
rownames(psimmat) <- names(plist2)
print(psimmat)
# s1 s2 s4
# s3 0.10236985 0.18858241 0.05819984
# s4_again 0.04921696 0.12124217 1.00000000
# s5 0.03943488 0.06391103 0.05714638
# s1_again 1.00000000 0.11825938 0.04921696
} # }