Exercise: Day 3 - Advanced Data Structures and Recursion
Developing a Smith-Waterman Aligner for two sequences
Generate some sample DNA data
#DNA example
seq_1 <- c("A","C","A","C","A","C","T","A")
seq_2 <- c("A","G","C","A","C","A","C","A")
seq_1
## [1] "A" "C" "A" "C" "A" "C" "T" "A"
seq_2
## [1] "A" "G" "C" "A" "C" "A" "C" "A"
Create a scoring matrix for nucleotides
#Score +1 for match, 0 for no matach
dist_nucleotides <- diag(4)
rownames(dist_nucleotides) <- c("A","C","T","G")
colnames(dist_nucleotides) <- c("A","C","T","G")
dist_identity_nucleotides <- dist_nucleotides
dist_identity_nucleotides
## A C T G
## A 1 0 0 0
## C 0 1 0 0
## T 0 0 1 0
## G 0 0 0 1
Create a pairwise matrix using the sample sequences
s <- dist_nucleotides[seq_2,seq_1]
colnames(s) <- c(seq_1)
rownames(s) <- c(seq_2)
s
## A C A C A C T A
## A 1 0 1 0 1 0 0 1
## G 0 0 0 0 0 0 0 0
## C 0 1 0 1 0 1 0 0
## A 1 0 1 0 1 0 0 1
## C 0 1 0 1 0 1 0 0
## A 1 0 1 0 1 0 0 1
## C 0 1 0 1 0 1 0 0
## A 1 0 1 0 1 0 0 1
Constructing the core aligner
A. Use nested loops to populate a nucleotide similarity matrix where matches are +2 and non-matches are -1. Name the rows and columns.
## A C T G
## A 2 -1 -1 -1
## C -1 2 -1 -1
## T -1 -1 2 -1
## G -1 -1 -1 2
B. Using the distance matrix from (A.), create a pairwise distance matrix based on the sequences to align.
## A C A C A C T A
## A 2 -1 2 -1 2 -1 -1 2
## G -1 -1 -1 -1 -1 -1 -1 -1
## C -1 2 -1 2 -1 2 -1 -1
## A 2 -1 2 -1 2 -1 -1 2
## C -1 2 -1 2 -1 2 -1 -1
## A 2 -1 2 -1 2 -1 -1 2
## C -1 2 -1 2 -1 2 -1 -1
## A 2 -1 2 -1 2 -1 -1 2
C. Populate an empty matrix of zeroes using the dimensions of each sequence plus one additional cell to pad the upper-left corner, as shown.
## A C A C A C T A
## 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0
## G 0 0 0 0 0 0 0 0 0
## C 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0
## C 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0
## C 0 0 0 0 0 0 0 0 0
## A 0 0 0 0 0 0 0 0 0
D. Fill the matrix from (C.) with the maximum similarity scores by way of nested loops to implement the Smith-Waterman algorithm.
## A C A C A C T A
## 0 0 0 0 0 0 0 0 0
## A 0 2 1 2 1 2 1 0 2
## G 0 1 1 1 1 1 1 0 1
## C 0 0 3 2 3 2 3 2 1
## A 0 2 2 5 4 5 4 3 4
## C 0 1 4 4 7 6 7 6 5
## A 0 2 3 6 6 9 8 7 8
## C 0 1 4 5 8 8 11 10 9
## A 0 2 3 6 7 10 10 10 12
E. What is the optimal alignment score for the two sample sequences? What position in the maximum similarity-score matrix holds the highest value?
## [1] 12
## A
## 9
## A
## 9
Recursion
A. Implement a recursive Smith-Waterman algorithm to align the sample DNA sequences. Fill the maximum similarity scores matrix with a function that calls itself, recursively.
#F is the maximum similarity scores matrix, reset to be filled with all zeroes prior to this function call.
#sw_align is the recursive function, starting in the lower-right of F
sw_align(nrow(F),ncol(F))
F
## A C A C A C T A
## 0 0 0 0 0 0 0 0 0
## A 0 2 1 2 1 2 1 0 2
## G 0 1 1 1 1 1 1 0 1
## C 0 0 3 2 3 2 3 2 1
## A 0 2 2 5 4 5 4 3 4
## C 0 1 4 4 7 6 7 6 5
## A 0 2 3 6 6 9 8 7 8
## C 0 1 4 5 8 8 11 10 9
## A 0 2 3 6 7 10 10 10 12
B. Time how long the recursive function takes. Find a way to speed it up and show the new completion time. Hint: have the function print which indices it is currently computing for a clue why it may be taking a long time.
#recursive function, sw_align
time_start = proc.time()
sw_align(nrow(F),ncol(F))
proc.time() - time_start
## user system elapsed
## 5.153 0.010 5.166
#start inside the matrix to show example with added print statement
sw_align(2,2)
## Working on i= 2 and j= 2
## Working on i= 1 and j= 1
## Working on i= 0 and j= 0
## Working on i= 0 and j= 1
## Working on i= 1 and j= 0
## Working on i= 1 and j= 2
## Working on i= 0 and j= 1
## Working on i= 0 and j= 2
## Working on i= 1 and j= 1
## Working on i= 0 and j= 0
## Working on i= 0 and j= 1
## Working on i= 1 and j= 0
## Working on i= 2 and j= 1
## Working on i= 1 and j= 0
## Working on i= 1 and j= 1
## Working on i= 0 and j= 0
## Working on i= 0 and j= 1
## Working on i= 1 and j= 0
## Working on i= 2 and j= 0
#new function with improvement, sw_align_fast
time_start = proc.time()
sw_align_fast(nrow(F),ncol(F))
proc.time() - time_start
## user system elapsed
## 0.005 0.000 0.006
Extending the algorithm
A. Modify the algorithm from (D., above; Constructing the core aligner) to store information about the direction of each cell’s most optimal neighbor(left, up, or upper-left).
# In this example the values correspond to:
# 0 or 1 no direction
# 2 upper-left; diagonal; match
# 3 up; deletion
# 4 left; insertion
## A C A C A C T A
## 0 0 0 0 0 0 0 0 0
## A 0 2 4 2 4 2 4 1 2
## G 0 3 2 3 2 3 2 1 3
## C 0 1 2 4 2 4 2 4 4
## A 0 2 3 2 4 2 4 4 2
## C 0 3 2 3 2 4 2 4 4
## A 0 2 3 2 3 2 4 4 2
## C 0 3 2 3 2 3 2 4 4
## A 0 2 3 2 3 2 3 2 2
B. Print the alignment according the optimal path.
## A-CACACTA
## AGCACAC-A