|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
Softpanorama Search
|
| Softpanorama: Algorithms and Data Structures | Recommended Books | Recommended Links | Pattern Matching | Searching Algorithms | Regular expressions | Humor | Etc |
The longest common substring problem is to find the longest string (or strings) that is a substring (or are substrings) of two or more strings. It should not be confused with the longest common subsequence problem. (For an explanation of the difference between a substring and a subsequence, see Substring vs. subsequence).
For example, The longest common substrings of the strings "ABAB", "BABA" and "ABBA" are the strings "AB" and "BA" of length 2. Other common substrings are "A", "B" and the empty string "".
ABAB ||| BABA || ABBA
A good lecture notes by Professor David Eppstein (Longest Common Subsequences, dated 09 Feb 2002,) is available on the Web:
In this lecture we examine another string matching problem, of finding the longest common subsequence of two strings.This is a good example of the technique of dynamic programming, which is the following very simple idea: start with a recursive algorithm for the problem, which may be inefficient because it calls itself repeatedly on a small number of subproblems. Simply remember the solution to each subproblem the first time you compute it, then after that look it up instead of recomputing it. The overall time bound then becomes (typically) proportional to the number of distinct subproblems rather than the larger number of recursive calls. We already saw this idea briefly in the first lecture.
As we'll see, there are two ways of doing dynamic programming, top down and bottom-up. The top down (memoizing) method is closer to the original recursive algorithm, so easier to understand, but the bottom up method is usually a little more efficient.
Excerpt from The Algorithm Design Manual: The problem of longest common subsequence arises whenever we search for similarities across multiple texts. A particularly important application is in finding a consensus among DNA sequences. The genes for building particular proteins evolve with time, but the functional regions must remain consistent in order to work correctly. By finding the longest common subsequence of the same gene in different species, we learn what has been conserved over time.
The longest common substring problem is a special case of edit distance, when substitutions are forbidden and only exact character match, insert, and delete are allowable edit operations. Under these conditions, the edit distance between p and t is n+m-2 |lcs(p,t)|, since we can delete the missing characters from p to the lcs(p,t) and insert the missing characters from $t$ to transform p to t. This is particularly interesting because the longest common subsequence can be faster to compute than edit distance.
http://www.nist.gov/dads/HTML/longestCommonSubstring.html
See also longest common subsequence, shortest common superstring.
Note: The longest common substring is contiguous, while the longest common subsequence need not be.
Dan Hirschberg's pseudocode as an example of dynamic programming.
|
|||||||
This short article describes an extension of the Longest Common Sub-string (LCS) problem with maximal consecutive. Personally I used this algorithm in computing the gloss overlaps (Lesk algorithm) for finding the number of shared common words between two strings (it means gloss here).Some applications of the LCS problem include:
- Searching for proteins with similar DNA sequences: DNA sequences (genes) can be represented as sequences of four letters ACGT, corresponding to the four submolecules forming DNA. When biologists find a new sequence, they typically want to know what other sequences it is most similar to. One way of computing how similar two sequences are is to find the length of their longest common subsequence.
- Approximate string matching like spell checkers.
- File comparison (used in diff, CVS, source comparer, cheater finder, plagiarism detection).
Preparing the ground
The original LCS problem
Let us start with a simple approach, given are two strings: short string (sentence) and long string (text), as in the string matching problem. You want to know if all the words of the sentence appear in order (but possibly separated) in the text. If they do, we say that the sentence is a subsequence of the text. That is the "longest common" is the subsequence of "the longest shared common between two strings". If they do not occur in the text, it still makes sense to find the longest subsequence that occurs both in the sentence and in the text. This is the longest common subsequence problem:
- Given two strings/sequences X and Y. Each string is composed by a list of words or abbreviations. We denote m to be the length of X and n to be the length of Y.
- Z is called as common subsequence, if it is subsequence of both X and Y.
- The requirement is finding the common subsequence which has the maximum length among all the possible common subsequences of X and Y.
Basic solution
How would we solve this problem? Can we find the overlapping sub-problems such that the optimal solution to the LCS problem consists of optimal solutions to the sub-problems?
Dynamic programming gives us a way of making the solution more efficient but the idea does not tell us how to find this:
- Think of "optimal substructure'' property (break into optimal subproblems).
- Think of a recursive solution.
There are two ways to implement dynamic programming: first is top-down (memoization) and second is bottom-up. The top down method is closer to the original recursive algorithm, the bottom up method is usually a little more efficient.
Optimal substructure
Let
and
be sequences (xi is from prefix to i), and let
be any LCS of X and Y.
- If
, then
and
are LCS of
and
.
- If
, then
implies that Z is an LCS of
and Y.
- If
, then
implies that Z is an LCS of X and
.
Recursive Formulation
The major idea of the dynamic programming solution is to check whenever we want to solve a sub-problem, whether we've already done it before. Each time we need the solution to a sub-problem, we first look up if that one's already in the array table, and return right away if it is. Otherwise we will perform the computation and store the result.
We define an array to store the sub-problem results of the previous step: C[i, j] = k to be the length of the LCS of X[1.. i] and Y[1.. j], where 1 <= i <= m and 1 <= j <= n.
Question is: Would the optimal solution for C[i, j] consist of optimal solutions for C[i', j'], where i' <= i and j' <= j?
There is one thing we have to worry about, and that is when we fill in a cell C[i, j], we need to already know the values it depends on, which in this case are C[i-1, j-1], C[i , j-1], and C[i-1, j]. Due to this reason we'll traverse the array forwards, from the first row working up to the last and from the first column working up to the last.
I am looking for a module to solve (or appoximately solve) the longest common substring problem.
I found two modules on CPAN: String-LCSS-0.10 and String-Ediff-0.01.... ... ...
Hi, all. Thanks for the help. After spending some time with Algorithm::Diff, I've realized that Alg::Diff's LCS(Longest Common SubSequence) is not the same as String-LCSS (Longest Common Substring). Just wanted to share that (perhaps obvious) observation. LCSS does what I need, and is working fine for me...thanks!
Input description: A set S of strings
.
Problem description: What is the longest string S' such that for each
,
, the characters of S appear as a subsequence of
?
Discussion: The problem of longest common subsequence arises whenever we search for similarities across multiple texts. A particularly important application is in finding a consensus among DNA sequences. The genes for building particular proteins evolve with time, but the functional regions must remain consistent in order to work correctly. By finding the longest common subsequence of the same gene in different species, we learn what has been conserved over time.
The longest common substring problem is a special case of edit distance (see Section
), when substitutions are forbidden and only exact character match, insert, and delete are allowable edit operations. Under these conditions, the edit distance between p and t is n+m-2 |lcs(p,t)|, since we can delete the missing characters from p to the lcs(p,t) and insert the missing characters from t to transform p to t. This is particularly interesting because the longest common subsequence can be faster to compute than edit distance.
Issues arising include:
- Are you looking for a common substring or scattered subsequence? - In detecting plagiarism or attempting to identify the authors of anonymous works, we might need to find the longest phrases shared between several documents. Since phrases are strings of consecutive characters, we need the longest common substring between the texts.
The longest common substring of a set of strings can be identified in linear time using suffix trees, discussed in Section
. The trick is to build a suffix tree containing all the strings, label each leaf with the set of strings that contain it, and then do a depth-first traversal to identify the deepest node that has descendents from each input string.
For the rest of this discussion, we will restrict attention to finding common scattered subsequences. Dynamic programming can be used to find the longest common subsequence of two strings, S and T, of n and m characters each. This algorithm is a special case of the edit-distance computation of Section
. Let M[i,j] denote the number of characters in the longest common substring of
and
. In general, if
, there is no way the last pair of characters could match, so
. If S[i] = T[j], we have the option to select this character for our substring, so
. This gives a recurrence that computes M, and thus finds the length of the longest common subsequence in O(nm) time. We can reconstruct the actual common substring by walking backward from M[n,m] and establishing which characters were actually matched along the way.
- What if there are relatively few sets of matching characters? - For strings that do not contain too many copies of the same character, there is a faster algorithm. Let r be the number of pairs of positions (i,j) such that
. Thus r can be as large as mn if both strings consist entirely of the same character, but r = n if the two strings are permutations of
. This technique treats the pairs of r as defining points in the plane.
The complete set of r such points can be found in O(n + m + r) time by bucketing techniques. For each string, we create a bucket for each letter of the alphabet and then partition all of its characters into the appropriate buckets. For each letter c of the alphabet, create a point (,t) from every pair
and
, where
and
are buckets for c.
A common substring represents a path through these points that moves only up and to the right, never down or to the left. Given these points, the longest such path can be found in
time. We will sort the points in order of increasing x-coordinate (breaking ties in favor of increasing y-coordinate. We will insert these points one by one in this order, and for each k,
, maintain the minimum y-coordinate of any path going through exactly k points. Inserting a new point will change exactly one of these paths by reducing the y-coordinate of the path whose last point is barely greater than the new point.
- What if the strings are permutations? - If the strings are permutations, then there are exactly n pairs of matching characters, and the above algorithm runs in
time. A particularly important case of this occurs in finding the longest increasing subsequence of a sequence of numbers. Sorting this sequence and then replacing each number by its rank in the total order gives us a permutation p. The longest common subsequence of p and
gives the longest increasing subsequence.
- What if we have more than two strings to align? - The basic dynamic programming algorithm can be generalized to k strings, taking
time, where n is the length of the longest string. This algorithm is exponential in the number of strings k, and so it will likely be too expensive for more than 3 to 4 strings. Further, the problem is NP-complete, so no better exact algorithm is destined to come along.
This problem of multiple sequence alignment has received considerable attention, and numerous heuristics have been proposed. Many heuristics begin by computing the pairwise alignment between each of the
pairs of strings, and then work to merge these alignments. One approach is to build a graph with a vertex for each character of each string. There will be an edge between
and
if the corresponding characters are matched in the alignment between S and T. Any k-clique (see Section
) in this graph describes a commonly aligned character, and all such cliques can be found efficiently because of the sparse structure of this graph.
Although these cliques will define a common subsequence, there is no reason to believe that it will be the longest such substring. Appropriately weakening the clique requirement provides a way to increase it, but still there can be no promises.
Implementations: MAP (Multiple Alignment Program) [Hua94] by Xiaoqiu Huang is a C language program that computes a global multiple alignment of sequences using an iterative pairwise method. Certain parameters will need to be tweaked to make it accommodate non-DNA data. It is available by anonymous ftp from cs.mtu.edu in the pub/huang directory.
Combinatorica [Ski90] provides a Mathematica implementation of an algorithm to construct the longest increasing subsequence of a permutation, which is a special case of longest common subsequence. This algorithm is based on Young tableaux rather than dynamic programming. See Section
.
Notes: Good expositions on longest common subsequence include [AHU83, CLR90]. A survey of algorithmic results appears in [GBY91]. The algorithm for the case where all the characters in each sequence are distinct or infrequent is due to Hunt and Szymanski [HS77]. Expositions of this algorithm include [Aho90, Man89]. Multiple sequence alignment for computational biology is treated in [Wat95].
Certain problems on strings become easier when we assume a constant-sized alphabet. Masek and Paterson [MP80] solve longest common subsequence in
for constant-sized alphabets, using the four Russians technique.
Related Problems: Approximate string matching (see page
), shortest common superstring (see page
).
Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
Last modified: August 15, 2009