Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Google   


Longest Common Substring Problem

(classic problem)

Definition: Find the longest substring of two or more strings.

See also longest common subsequence, shortest common superstring.

Note: The longest common substring is contiguous, while the longest common subsequence need not be.

Author: PEB

Implementation

(C and Mathematica)

More information

Dan Hirschberg's pseudocode as an example of dynamic programming.


Go to the Dictionary of Algorithms and Data Structures home page.

If you have suggestions, corrections, or comments, please get in touch with Paul E. Black.

Entry modified 17 December 2004.
HTML page formatted Mon Sep 11 09:46:04 2006.

Cite this as:
Paul E. Black, "longest common substring", in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. 17 December 2004. (accessed TODAY) Available from: http://www.nist.gov/dads/HTML/longestCommonSubstring.html

 

News

Longest Common Substring

  

  figure22785

 

Input description: A set S of strings tex2html_wrap_inline31069 .

Problem description: What is the longest string S' such that for each tex2html_wrap_inline31073 , tex2html_wrap_inline31075 , the characters of S appear as a subsequence of tex2html_wrap_inline31077 ?

Discussion: The problem of longest common subsequence arises whenever we search for similarities across multiple texts. A particularly important application is in finding a consensus among DNA sequences. The genes for building particular proteins evolve with time, but the functional regions must remain consistent in order to work correctly. By finding the longest common subsequence of the same gene in different species, we learn what has been conserved over time.     

The longest common substring problem is a special case of edit distance (see Section gif), when substitutions are forbidden and only exact character match, insert, and delete are allowable edit operations. Under these conditions, the edit distance between p and t is n+m-2 |lcs(p,t)|, since we can delete the missing characters from p to the lcs(p,t) and insert the missing characters from t to transform p to t. This is particularly interesting because the longest common subsequence can be faster to compute than edit distance.   

Issues arising include:

Implementations: MAP (Multiple Alignment Program) [Hua94] by Xiaoqiu Huang is a C language program that computes a global multiple alignment of sequences using an iterative pairwise method. Certain parameters will need to be tweaked to make it accommodate non-DNA data. It is available by anonymous ftp from cs.mtu.edu in the pub/huang directory.

Combinatorica [Ski90] provides a Mathematica implementation of an algorithm to construct the longest increasing subsequence of a permutation, which is a special case of longest common subsequence. This algorithm is based on Young tableaux rather than dynamic programming. See Section gif.    

Notes: Good expositions on longest common subsequence include [AHU83, CLR90]. A survey of algorithmic results appears in [GBY91]. The algorithm for the case where all the characters in each sequence are distinct or infrequent is due to Hunt and Szymanski [HS77]. Expositions of this algorithm include [Aho90, Man89]. Multiple sequence alignment for computational biology is treated in [Wat95].

Certain problems on strings become easier when we assume a constant-sized alphabet. Masek and Paterson [MP80] solve longest common subsequence in tex2html_wrap_inline31137 for constant-sized alphabets, using the four Russians technique.  

 

Related Problems: Approximate string matching (see page gif), shortest common superstring (see page gif).  

 

Recommended Links

Longest common subsequence problem - Wikipedia, the free encyclopedia

Longest common substring problem - Wikipedia, the free encyclopedia

1.7.8 Longest Common Substring

Longest Common Subsequences


Copyright © 1996-2008 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Standard disclaimer: The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

 

Last modified: February 28, 2008