Exercise: Day 3 - Advanced Data Structures and Recursion

Developing a Smith-Waterman Aligner for two sequences

Generate some sample DNA data

#DNA example
seq_1 <- c("A","C","A","C","A","C","T","A")
seq_2 <- c("A","G","C","A","C","A","C","A")

seq_1

## [1] "A" "C" "A" "C" "A" "C" "T" "A"

seq_2

## [1] "A" "G" "C" "A" "C" "A" "C" "A"

Create a scoring matrix for nucleotides

#Score +1 for match, 0 for no matach
dist_nucleotides <- diag(4)
rownames(dist_nucleotides) <- c("A","C","T","G")
colnames(dist_nucleotides) <- c("A","C","T","G")
dist_identity_nucleotides <- dist_nucleotides
dist_identity_nucleotides

##   A C T G
## A 1 0 0 0
## C 0 1 0 0
## T 0 0 1 0
## G 0 0 0 1

Create a pairwise matrix using the sample sequences

s <- dist_nucleotides[seq_2,seq_1]
colnames(s) <- c(seq_1)
rownames(s) <- c(seq_2)
s

##   A C A C A C T A
## A 1 0 1 0 1 0 0 1
## G 0 0 0 0 0 0 0 0
## C 0 1 0 1 0 1 0 0
## A 1 0 1 0 1 0 0 1
## C 0 1 0 1 0 1 0 0
## A 1 0 1 0 1 0 0 1
## C 0 1 0 1 0 1 0 0
## A 1 0 1 0 1 0 0 1

Constructing the core aligner

A. Use nested loops to populate a nucleotide similarity matrix where matches are +2 and non-matches are -1. Name the rows and columns.

##    A  C  T  G
## A  2 -1 -1 -1
## C -1  2 -1 -1
## T -1 -1  2 -1
## G -1 -1 -1  2

B. Using the distance matrix from (A.), create a pairwise distance matrix based on the sequences to align.

##    A  C  A  C  A  C  T  A
## A  2 -1  2 -1  2 -1 -1  2
## G -1 -1 -1 -1 -1 -1 -1 -1
## C -1  2 -1  2 -1  2 -1 -1
## A  2 -1  2 -1  2 -1 -1  2
## C -1  2 -1  2 -1  2 -1 -1
## A  2 -1  2 -1  2 -1 -1  2
## C -1  2 -1  2 -1  2 -1 -1
## A  2 -1  2 -1  2 -1 -1  2

C. Populate an empty matrix of zeroes using the dimensions of each sequence plus one additional cell to pad the upper-left corner, as shown.

##     A C A C A C T A
##   0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0
## G 0 0 0 0 0 0 0 0 0
## C 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0
## C 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0
## C 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0

D. Fill the matrix from (C.) with the maximum similarity scores by way of nested loops to implement the Smith-Waterman algorithm.

##     A C A C  A  C  T  A
##   0 0 0 0 0  0  0  0  0
## A 0 2 1 2 1  2  1  0  2
## G 0 1 1 1 1  1  1  0  1
## C 0 0 3 2 3  2  3  2  1
## A 0 2 2 5 4  5  4  3  4
## C 0 1 4 4 7  6  7  6  5
## A 0 2 3 6 6  9  8  7  8
## C 0 1 4 5 8  8 11 10  9
## A 0 2 3 6 7 10 10 10 12

E. What is the optimal alignment score for the two sample sequences? What position in the maximum similarity-score matrix holds the highest value?

## [1] 12

## A 
## 9

## A 
## 9

Recursion

A. Implement a recursive Smith-Waterman algorithm to align the sample DNA sequences. Fill the maximum similarity scores matrix with a function that calls itself, recursively.

#F is the maximum similarity scores matrix, reset to be filled with all zeroes prior to this function call.
#sw_align is the recursive function, starting in the lower-right of F
sw_align(nrow(F),ncol(F))
F

##     A C A C  A  C  T  A
##   0 0 0 0 0  0  0  0  0
## A 0 2 1 2 1  2  1  0  2
## G 0 1 1 1 1  1  1  0  1
## C 0 0 3 2 3  2  3  2  1
## A 0 2 2 5 4  5  4  3  4
## C 0 1 4 4 7  6  7  6  5
## A 0 2 3 6 6  9  8  7  8
## C 0 1 4 5 8  8 11 10  9
## A 0 2 3 6 7 10 10 10 12

B. Time how long the recursive function takes. Find a way to speed it up and show the new completion time. Hint: have the function print which indices it is currently computing for a clue why it may be taking a long time.

#recursive function, sw_align
time_start = proc.time()
sw_align(nrow(F),ncol(F))
proc.time() - time_start

##    user  system elapsed 
##   5.153   0.010   5.166

#start inside the matrix to show example with added print statement
sw_align(2,2)

## Working on i= 2  and j= 2 
## Working on i= 1  and j= 1 
## Working on i= 0  and j= 0 
## Working on i= 0  and j= 1 
## Working on i= 1  and j= 0 
## Working on i= 1  and j= 2 
## Working on i= 0  and j= 1 
## Working on i= 0  and j= 2 
## Working on i= 1  and j= 1 
## Working on i= 0  and j= 0 
## Working on i= 0  and j= 1 
## Working on i= 1  and j= 0 
## Working on i= 2  and j= 1 
## Working on i= 1  and j= 0 
## Working on i= 1  and j= 1 
## Working on i= 0  and j= 0 
## Working on i= 0  and j= 1 
## Working on i= 1  and j= 0 
## Working on i= 2  and j= 0

#new function with improvement, sw_align_fast
time_start = proc.time()
sw_align_fast(nrow(F),ncol(F))
proc.time() - time_start

##    user  system elapsed 
##   0.005   0.000   0.006

Extending the algorithm

A. Modify the algorithm from (D., above; Constructing the core aligner) to store information about the direction of each cell’s most optimal neighbor(left, up, or upper-left).

# In this example the values correspond to:
# 0 or 1 no direction
# 2 upper-left; diagonal; match
# 3 up; deletion 
# 4 left; insertion

##     A C A C A C T A
##   0 0 0 0 0 0 0 0 0
## A 0 2 4 2 4 2 4 1 2
## G 0 3 2 3 2 3 2 1 3
## C 0 1 2 4 2 4 2 4 4
## A 0 2 3 2 4 2 4 4 2
## C 0 3 2 3 2 4 2 4 4
## A 0 2 3 2 3 2 4 4 2
## C 0 3 2 3 2 3 2 4 4
## A 0 2 3 2 3 2 3 2 2

B. Print the alignment according the optimal path.

## A-CACACTA
## AGCACAC-A