Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Softpanorama Search

Longest Common Substring Problem

News

Softpanorama: Algorithms and Data Structures Recommended Books Recommended Links Pattern Matching Searching Algorithms Regular expressions Humor Etc

The longest common substring problem is to find the longest string (or strings) that is a substring (or are substrings) of two or more strings. It should not be confused with the longest common subsequence problem. (For an explanation of the difference between a substring and a subsequence, see Substring vs. subsequence).

For example, The longest common substrings of the strings "ABAB", "BABA" and "ABBA" are the strings "AB" and "BA" of length 2. Other common substrings are "A", "B" and the empty string "".

 ABAB
  |||
  BABA
  ||
ABBA

A good lecture notes by Professor David Eppstein (Longest Common Subsequences, dated 09 Feb 2002,) is available on the Web:

In this lecture we examine another string matching problem, of finding the longest common subsequence of two strings.

This is a good example of the technique of dynamic programming, which is the following very simple idea: start with a recursive algorithm for the problem, which may be inefficient because it calls itself repeatedly on a small number of subproblems. Simply remember the solution to each subproblem the first time you compute it, then after that look it up instead of recomputing it. The overall time bound then becomes (typically) proportional to the number of distinct subproblems rather than the larger number of recursive calls. We already saw this idea briefly in the first lecture.

As we'll see, there are two ways of doing dynamic programming, top down and bottom-up. The top down (memoizing) method is closer to the original recursive algorithm, so easier to understand, but the bottom up method is usually a little more efficient.

Excerpt from The Algorithm Design Manual: The problem of longest common subsequence arises whenever we search for similarities across multiple texts. A particularly important application is in finding a consensus among DNA sequences. The genes for building particular proteins evolve with time, but the functional regions must remain consistent in order to work correctly. By finding the longest common subsequence of the same gene in different species, we learn what has been conserved over time.

The longest common substring problem is a special case of edit distance, when substitutions are forbidden and only exact character match, insert, and delete are allowable edit operations. Under these conditions, the edit distance between p and t is n+m-2 |lcs(p,t)|, since we can delete the missing characters from p to the lcs(p,t) and insert the missing characters from $t$ to transform p to t. This is particularly interesting because the longest common subsequence can be faster to compute than edit distance.

http://www.nist.gov/dads/HTML/longestCommonSubstring.html

See also longest common subsequence, shortest common superstring.

Note: The longest common substring is contiguous, while the longest common subsequence need not be.

Dan Hirschberg's pseudocode as an example of dynamic programming.


Notes:
  • This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected.
  • The site contain some broken links as it develops like a living tree... Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.
Google Search
Open directory

Research Index


Old News ;-)

CodeProject The Longest Common Substring with Maximal Consecutive. By Thanh Dao

This short article describes an extension of the Longest Common Sub-string (LCS) problem with maximal consecutive. Personally I used this algorithm in computing the gloss overlaps (Lesk algorithm) for finding the number of shared common words between two strings (it means gloss here).

Some applications of the LCS problem include:

Preparing the ground

The original LCS problem

Let us start with a simple approach, given are two strings: short string (sentence) and long string (text), as in the string matching problem. You want to know if all the words of the sentence appear in order (but possibly separated) in the text. If they do, we say that the sentence is a subsequence of the text. That is the "longest common" is the subsequence of "the longest shared common between two strings". If they do not occur in the text, it still makes sense to find the longest subsequence that occurs both in the sentence and in the text. This is the longest common subsequence problem:

Basic solution

How would we solve this problem? Can we find the overlapping sub-problems such that the optimal solution to the LCS problem consists of optimal solutions to the sub-problems?

Dynamic programming gives us a way of making the solution more efficient but the idea does not tell us how to find this:

There are two ways to implement dynamic programming: first is top-down (memoization) and second is bottom-up. The top down method is closer to the original recursive algorithm, the bottom up method is usually a little more efficient.

Optimal substructure

Let tex2html_wrap_inline219 and tex2html_wrap_inline253 be sequences (xi is from prefix to i), and let tex2html_wrap_inline255 be any LCS of X and Y.

Recursive Formulation

The major idea of the dynamic programming solution is to check whenever we want to solve a sub-problem, whether we've already done it before. Each time we need the solution to a sub-problem, we first look up if that one's already in the array table, and return right away if it is. Otherwise we will perform the computation and store the result.

We define an array to store the sub-problem results of the previous step: C[i, j] = k to be the length of the LCS of X[1.. i] and Y[1.. j], where 1 <= i <= m and 1 <= j <= n.

Question is: Would the optimal solution for C[i, j] consist of optimal solutions for C[i', j'], where i' <= i and j' <= j?

displaymath297

There is one thing we have to worry about, and that is when we fill in a cell C[i, j], we need to already know the values it depends on, which in this case are C[i-1, j-1], C[i , j-1], and C[i-1, j]. Due to this reason we'll traverse the array forwards, from the first row working up to the last and from the first column working up to the last.

Longest Common Substring

I am looking for a module to solve (or appoximately solve) the longest common substring problem.

I found two modules on CPAN: String-LCSS-0.10 and String-Ediff-0.01.

... ... ...

Hi, all. Thanks for the help. After spending some time with Algorithm::Diff, I've realized that Alg::Diff's LCS(Longest Common SubSequence) is not the same as String-LCSS (Longest Common Substring). Just wanted to share that (perhaps obvious) observation. LCSS does what I need, and is working fine for me...thanks!

Longest Common Substring

   

  figure22785

Input description: A set S of strings tex2html_wrap_inline31069 .

Problem description: What is the longest string S' such that for each tex2html_wrap_inline31073 , tex2html_wrap_inline31075 , the characters of S appear as a subsequence of tex2html_wrap_inline31077 ?

Discussion: The problem of longest common subsequence arises whenever we search for similarities across multiple texts. A particularly important application is in finding a consensus among DNA sequences. The genes for building particular proteins evolve with time, but the functional regions must remain consistent in order to work correctly. By finding the longest common subsequence of the same gene in different species, we learn what has been conserved over time.      

The longest common substring problem is a special case of edit distance (see Section gif), when substitutions are forbidden and only exact character match, insert, and delete are allowable edit operations. Under these conditions, the edit distance between p and t is n+m-2 |lcs(p,t)|, since we can delete the missing characters from p to the lcs(p,t) and insert the missing characters from t to transform p to t. This is particularly interesting because the longest common subsequence can be faster to compute than edit distance.    

Issues arising include:

Implementations: MAP (Multiple Alignment Program) [Hua94] by Xiaoqiu Huang is a C language program that computes a global multiple alignment of sequences using an iterative pairwise method. Certain parameters will need to be tweaked to make it accommodate non-DNA data. It is available by anonymous ftp from cs.mtu.edu in the pub/huang directory.

Combinatorica [Ski90] provides a Mathematica implementation of an algorithm to construct the longest increasing subsequence of a permutation, which is a special case of longest common subsequence. This algorithm is based on Young tableaux rather than dynamic programming. See Section gif.    

Notes: Good expositions on longest common subsequence include [AHU83, CLR90]. A survey of algorithmic results appears in [GBY91]. The algorithm for the case where all the characters in each sequence are distinct or infrequent is due to Hunt and Szymanski [HS77]. Expositions of this algorithm include [Aho90, Man89]. Multiple sequence alignment for computational biology is treated in [Wat95].

Certain problems on strings become easier when we assume a constant-sized alphabet. Masek and Paterson [MP80] solve longest common subsequence in tex2html_wrap_inline31137 for constant-sized alphabets, using the four Russians technique.  

Related Problems: Approximate string matching (see page gif), shortest common superstring (see page gif).  

Recommended Links



Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Disclaimer:

Last modified: August 15, 2009