Problem 47658. Calculate the similarity between DNA sequences
A DNA sequence contains only the letters A, C, G and T. (Each letter represents a small molecule, and a DNA sequence is a ``macromolecular'' chain of them.) Each letter in a DNA sequence is called a base, basepair, or nucleotide. Normally, DNA occurs as a double strand where each A is paired with a T and vice versa, and each C is paired with a G and vice versa. The reverse complement of a DNA sequence is formed by reversing the letters, interchanging A and T and interchanging C and G. Thus the reverse complement of ACCTGAG is CTCAGGT. Rapid DNA sequencing machinary produces short strands of DNA, and these can be arbitrarily orientated - i.e the DNA strand is 'read' from left to right, or right to left (depends on which side is 'grabbed' when reading starts) , and it comes from either the forward or the reverse complement strand that had been uncoiled to facilitate sequencing.
Here we are interested in computing the similarity between two strand sequences (called 'reads'). We define the similarity in terms of the length of the longest common subsequence LCS, and lengths of the two sequences, LS1 and LS2 respectively.
SimSeq(S1,S2) = 2*LCS / (LS1+LS2)
When S1 = S2 the similarity score is 1.0 (also when S1 is equal to S2 reverse complement)
When the sequences are dissimilar the similarity score approaches zero, as well as when one of the sequences is much longer than the other (alternative definitions that correct for length diffrence are possible, but are of no concern here.)
Note that the longest common subsequence LCS can be found in the forward or in the reverse complement direction. For instance, LCS('TACCTGAGA','GACCTGAGC') is equal to 7 since the common substring is 'ACCTGAG' in S1, which matches the reverse complement 'ACCTGAG' in S2.
Solution Stats
Problem Comments
-
2 Comments
Tim
on 1 Dec 2020
It would seem that the reverse complement of ACCTGAG should be CTCAGGT, not ACCTGAG.
Shlomo Geva
on 1 Dec 2020
Oops. That's correct Tim. It's a cut-and-paste bug...
The reverse complement of ACCTGAG is CTCAGGT.
Corrected now.
Solution Comments
Show commentsProblem Recent Solvers3
Suggested Problems
-
345 Solvers
-
Project Euler: Problem 8, Find largest product in a large string of numbers
1100 Solvers
-
35 Solvers
-
Generate a vector like 1,2,2,3,3,3,4,4,4,4
11358 Solvers
-
476 Solvers
Problem Tags
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!