Skip to contents

The ssw package offers an R interface for SSW described in Zhao et al. (2013), a fast implementation of the Smith-Waterman algorithm for sequence alignment using SIMD. The package is currently built on ssw-py.

A short read sequence:

read <- "ACGT"

Exact alignment

Align the read against the reference sequence (TTTTACGTCCCCC) and print the results:

read %>% align("TTTTACGTCCCCC")
CIGAR start index 4: 4M
optimal_score: 8    sub-optimal_score: 0    
target_begin: 4 target_end: 7   
query_begin: 0  query_end: 3

Target:        4    ACGT    7
                    ||||
Query:         0    ACGT    3

ACGT

Get specific results, such as the alignment scores:

a <- read %>% align("TTTTACGTCCCCC")
a$alignment$optimal_score
[1] 8
a$alignment$sub_optimal_score
[1] 0

Deletion

read %>% align("TTTTACAGTCCCCC")
CIGAR start index 4: 2M1D2M
optimal_score: 5    sub-optimal_score: 0    
target_begin: 4 target_end: 8   
query_begin: 0  query_end: 3

Target:        4    ACAGT    8
                    ||*||
Query:         0    AC-GT    3

ACGT

Insertion with gap open

read %>% align("TTTTACTCCCCC", gap_open = 3)
CIGAR start index 4: 2M
optimal_score: 4    sub-optimal_score: 0    
target_begin: 4 target_end: 5   
query_begin: 0  query_end: 1

Target:        4    AC    5
                    ||
Query:         0    AC    1

ACGT

Insertion with no gap open penalty

read %>% align("TTTTACTCCCCC", gap_open = 0)
CIGAR start index 4: 2M1I1M
optimal_score: 6    sub-optimal_score: 0    
target_begin: 4 target_end: 6   
query_begin: 0  query_end: 3

Target:        4    AC-T    6
                    ||*|
Query:         0    ACGT    3

ACGT

Specify start index

a <- align("ACTG", "ACTCACTG", start_idx = 4)
a
CIGAR start index 0: 4M
optimal_score: 8    sub-optimal_score: 0    
target_begin: 0 target_end: 3   
query_begin: 0  query_end: 3

Target:        0    ACTC    3
                    |||*
Query:         0    ACTG    3

ACTG

Print the results from position 4:

a %>% print(start_idx = 4)
CIGAR start index 0: 4M
optimal_score: 8    sub-optimal_score: 0    
target_begin: 0 target_end: 3   
query_begin: 0  query_end: 3

Target:        0    ACTG    3
                    ||||
Query:         0    ACTG    3

ACTG

Forced alignment

Enforce no gaps by increasing the penalty:

a <- force_align("ACTG", "TTTTCTGCCCCCACG")
a

The results are truncated:

CIGAR start index 4: 3M
optimal_score: 6    sub-optimal_score: 0    
target_begin: 4 target_end: 6   
query_begin: 1  query_end: 3

Target:        4    CTG    6
                    |||
Query:         1    CTG    3

ACTG

Use formatter() to avoid the truncation:

b <- a %>% formatter()
b
[[1]]
[1] "TTTTCTGCCCCCACG"

[[2]]
[1] "   ACTG"

Or pretty-print the formatted results directly:

a %>% formatter(print = TRUE)
TTTTCTGCCCCCACG
   ACTG

References

Zhao, Mengyao, Wan-Ping Lee, Erik P Garrison, and Gabor T Marth. 2013. “SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications.” PloS ONE 8 (12): e82138. https://doi.org/10.1371/journal.pone.0082138.