Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Softpanorama Search

Searching Algorithms

News

See Also Recommended Books Recommended Links Lecture Notes and E-books Donald Knuth TAOCP Volume3 Animations
Linear Search Binary Search Hashing Trees AVL trees Fibonacci search Humor Etc

Searching algorithms are closely related to the concept of dictionaries. Dictionaries are data structures that support search, insert, and delete operations. One of the most effective representations is a hash table. Typically, a simple function is applied to the key to determine its place in the dictionary. Another efficient search algorithms on sorted tables is binary search.

If the dictionary is not sorted then heuristic methods of dynamic reorganization of the dictionary are of great value. One of the simplest are cache-based methods: several recently used keys are stored a special data structure that permits fast search (for example is always sorted). For example keeping the last N recently found values at the top of the table (or list) dramatically improved performance. Other cache-based approach are also possible. In the simplest form the cache can be merged with the dictionary:

In the paper On self-organizing sequential search heuristics, Communications of the ACM, v.19 n.2, p.63-67, Feb. 1976 , R.L. Rivest presents a set of methods for dynamically reordering a sequential list containing N records in order to increase search efficiency. The method Ai (for i between 1 and N) performs the following operation each time that a record R has been successfully retrieved: Move R forward i positions in the list, or to the front of the list if it was in a position less than i. The method A1 is called the transposition method, and the method AN-1 is called the move-to-front method.  See also J. H. Hester , D. S. Hirschberg, Self-organizing linear search, ACM Computing Surveys (CSUR), v.17 n.3, p.295-311, Sept. 1985

Here is a good description of the problem area given by Alison Cawsey  in 1998 (alison-ds98):

Motivation

As mentioned above, string searching algorithms are important in all sorts of applications that we meet everyday. In text editors, we might want to search through a very large document (say, a million characters) for the occurence of a given string (maybe dozens of characters). In text retrieval tools, we might potentially want to search through thousands of such documents (though normally these files would be indexed, making this unnecessary). Other applications might require string matching algorithms as part of a more complex algorithm (e.g., the Unix program ``diff'' that works out the differences between two simiar text files).

Sometimes we might want to search in binary strings (ie, sequences of 0s and 1s). For example the ``pbm'' graphics format is based on sequences of 1s and 0s. We could express a task like ``find a wide white stripe in the image'' as a string searching problem.

In all these applications the naive algorithm (that you might first think of) is rather inefficient. There are algorithms that are only a little more complex which give a very substantial increase in efficiency. In this section we'll first introduce the naive algorithm, then two increasingly sophisticated algorithms that give gains in efficiency. We'll end up discussing how properties of your string searching problem might influence choice of algorithm and average case efficiency, and how you might avoid having to search at all!

A Naive Algorithm

The simplest algorithm can be written in a few lines:

Procedure NaiveSearch(s1, s2: String): Integer;
{Returns an index of S2, corresponding to first match of S1 with S2, or}
{ -1 if there is no match}
var i, j: integer;
match: boolean;
begin
   i:=1; match := FALSE;
   while match = FALSE and i <= length(s1) do 
     begin
       j := 1;
       {Try matching from start to end of s2, quitting when no match}
       while GetChar(s1, i) = GetChar(s2, j) and (j <= length(s2)) do
          begin
           j := j+1;           {Increment both pointers while it matches}
           i := i+1
          end;
       match := j>length(s2);       {There's a match if you got to the end of s2}
       i:= i-j+2                    {move the pointer to the first string back to
                                    {just after where you left off}
     end;
   if match
      then NaiveSearch := i-1       {If it's a match, return index of S1}
                                    {corresponding to start of match
      else NaiveSearch := -1
end;

We'll illustrate the algorithms by considering a search of string 'abc' (s2) in document 'ababcab' (s1). The following diagram should be fairly self-explanatory (the îndicates where the matching characters are checked each round):

'ababcab'               i=1,j=1 
'abc'
 ^                      matches: increment i and j

'ababcab'               i=2.j=2
'abc'
  ^                     matches: increment i and j

'ababcab'               i=3,j=3
'abc'   
   ^                    match fails: exit inner loop, i=i-j+2 

'ababcab'               i=2, j=1        
 'abc'                  match fails, exit inner loop, i=i-j+2
  ^

'ababcab'               i=3, j=1
  'abc'                 matches: increment i and j
   ^

'ababcab'               i=4, j=2
  'abc'                 matches: increment i and j
    ^

'ababcab'               i=5, j=3
  'abc'                 matches: increment i and j
     ^

                        i=6, j=4, exit loop (j>length(s2)), 
                        i = 4, match=TRUE, 
                        NaiveSearch = 3

Note that for this example 7 comparisons were required before the match was found. In general, if we have a string s1 or length N, s2 of length M then a maximum of approx (M-1)*N matches may be required, though often the number required will be closer to N (if few partial matches occur). To illustrate these extremes consider: s1= 'aaaabaaaabaaaaab', s2 = 'aaaaa', and s1= 'abcdefghi', s2= 'fgh'.

The Knuth-Morris-Pratt Algorithm

[Note: In this discussion I assume that we the index 1 indicates the first character in the string. Other texts may assume that 0 indicates the first character, in which case all the numbers in the examples will be one less].

The Knuth-Morris-Pratt (KMP) algorithm uses information about the characters in the string you're looking for to determine how much to `move along' that string after a mismatch occurs. To illustrate this, consider one of the examples above: s1= 'aaaabaaaabaaaaab', s2 = 'aaaaa'. Using the naive algorithm you would start off something like this:

'aaaabaaaabaaaaab'
'aaaaa'
 ^

'aaaabaaaabaaaaab'
'aaaaa'
  ^

'aaaabaaaabaaaaab'
'aaaaa'
   ^

'aaaabaaaabaaaaab'
'aaaaa'
    ^

'aaaabaaaabaaaaab'   
'aaaaa'
     ^                  match fails, move s2 up one..

'aaaabaaaabaaaaab'
 'aaaaa'
  ^

etc etc

but in fact if we look at s2 (and the 'b' in s1 that caused the bad match) we can tell that there is no chance that a match starting at position 2 will work. The 'b' will end up being matched against the 4th character in s2, which is an 'a'. Based on our knowledge of s2, what we really want is the last iteration above replaced with:

'aaaabaaaabaaaaab'
     'aaaaa'
      ^

We can implement this idea quite efficiently by associating with each element position in the searched for string the amount that you can safely move that string forward if you get a mismatch in that element position. For the above example..

a 1     If mismatch in first el, just move string on 1 place.
a 2     If mismatch here, no point in trying just one place, as
        that'll involve matching with the same el (a) so move 2 places
a 3
a 4
a 5

In fact the KMP algorithm is a little more cunning than this. Consider the following case:

'aaaab'   i=3,j=3
'aab'
   ^

We can only move the second string up 1, but we KNOW that the first character will then match, as the first two elements are identical, so we want the next iteration to be:

'aaaab'  i=3,j=2
 'aab'
   ^

Note that i has not changed. It turns out that we can make things work by never decrementing i (ie, just moving forward along s1), but, given a mismatch, just decrementing j by the appropriate amount, to capture the fact that we are moving s2 up a bit along s1, so the position on s2 corresponding to i's position is lower. We can have an array giving, for each position in s2, the position in s2 that you should backup to in s2 given a mismatch (while holding the position in s1 constant). We'll call this array next[j].

j       s2[j]   next[j]
1       a       0
2       b       1
3       a       1
4       b       2
5       b       3
6       a       1
7       a       2

In fact next[1] is treated as a special case. In fact, if the first match fails we want to keep j fixed and increment i. One way to do this is to let next[1] = 0, and then have a special case that says if j=0 then increment i and j.

'abababbaa'      i=5, j=5
'ababbaa'
     ^          mismatch, so j = next[j]=3

'abababbaa'      i=5, j=3
  'ababbaa'
     ^
-------------------
'abaabbaa'      i=4, j=4
'ababbaa'
    ^           mismatch, so j = next[j]= 2

'abaababbaa'      i=5, j=2
  'ababbaa'
    ^
-------------------
'bababbaa'      i=1, j=1
'ababbaa'
 ^              mismatch, so j = next[j]= 0, increment i and j.

'abaababbaa'      i=2, j=1
 'ababbaa'
  ^

It's easy enough to implement this algorithm once you have the next[..] array. The bit that is mildly more tricky is how to calculate next[..] given a string. We can do this by trying to match a string against itself. When looking for next[j] we'd find the first index k such that s2[1..k-1] = s2[j-k+1..j-1], e.g:

'ababbaa'      s2[1..2] = s2[3..4]
  'aba....'    so next[5] = 2.
     ^

(Essentially we find next[j] by sliding forward the pattern along itself, until we find a match of the first k-1 characters with the k-1 characters before (and not including) position j).

The detailed implementations of these algorithms are left as an exercise for the reader - it's pretty easy, so long as you get the boundary cases right and avoid out-by-one errors.

The KMP algorithm is extremely simple once we have the next table:

i := 1; j:= i;
while (i <= length(s1) and j <= length(s2)) do
begin
  if (j = 0) or (GetChar(s1,i) = GetChar(s2, j)))  then begin
     i := i+1; j:= j+1
  end
  else j := next[j]
end;

(If j = length(s2) when the loop exits we have a match and can return something appropriate, such as the index in s1 where the match starts).

The Boyer-Moore Algorithm

Although the above algorithm is quite cunning, it doesnt help that much unless the strings you are searching involve alot of repeated patterns. It'll still require you to go all along the document (s1) to be searched in. For most text editor type applications, the average case complexity is little better than the naive algorithm (O(N), where N is the length of s1). (The worst case for the KMP is N+M comparisons - much better than naive, so it's useful in certain cases).

The Boyer-Moore algorithm is significantly better, and works by searching the target string s2 from right to left, while moving it left to right along s1. The following example illustrates the general idea:

'the caterpillar'    Match fails:
'pill'               There's no space (' ') in the search string, so move it
    ^                right along 4 places

'the caterpillar'   Match fails. There's no e either, so move along 4
    'pill'
        ^

'the caterpillar'    'l' matches, so  continue trying to match right to left
        'pill'
            ^

'the caterpillar'    Match fails. But there's an 'i' in 'pill' so move along
        'pill'       to position where the 'i's line up.
           ^
'the caterpillar'    Matches, as do all the rest..
         'pill'
             ^

This still only requires knowledge of the second string, but we require an array containing an indication, for each possible character that may occur, where it occurs in the search string and hence how much to move along. So, index['p']=1, index['i']=2, index['l'] = 4 (index the rightmost 'l' where repetitions) but index['r']=0 (let the value be 0 for all characters not in the string). When a match fails at a position i in the document, at a character C we move along the search string to a position where the current character in the document is above the index[C]th character in the string (which we know is a C), and start matching again at the right hand end of the string. (This is only done when this actually results in the string being moved right - otherwise the string is just moved up one place, and the search started again from the right hand end.)

The Boyer-Moore algorithm in fact combines this method of skipping over characters with a method similar to the KMP algorithm (useful to improve efficiency after you've partially matched a string). However, we'll just assume the simpler version that skips based on the position of a character in the search string.

It should be reasonably clear that, if it is normally the case that a given letter doesnt appear at all in the search string, then this algorithm only requires approx N/M character comparisons (N=length(s1), M=length(s2)) - a big improvement on the KMP algorithm, which still requires N. However, if this is not the case then we may need up to N+M comparisons again (with the full version of the algorithm). Fortunately, for many applications we get close to the N/M performance. If the search string is very large, then it is likely that a given character WILL appear in it, but we still get a good improvement compared with the other algorithms (approx N*2/alphabet_size if characters are randomly distributed in a string).


Notes:
  • This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected.
  • The site contain some broken links as it develops like a living tree... Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.
Google Search
Open directory

Research Index


Old News ;-)

[May 6, 2008] C Minimal Perfect Hashing Library 0.8 by Davi de Castro Reis

About: C Minimal Perfect Hashing Library is a portable LGPL library to create and to work with minimal perfect hashing functions. The library encapsulates the newest and more efficient algorithms available in the literature in an easy-to-use, production-quality, fast API. The library is designed to work with big entries that cannot fit in the main memory. It has been used successfully for constructing minimal perfect hashing functions for sets with billions of keys.

Changes: This version adds the internal memory bdz algorithm and utility functions to (de)serialize minimal perfect hash functions from mmap'ed memory regions. The new bdz algorithm for minimal perfect hashes requires 2.6 bits per key and is the fastest one currently available in the literature.

Joseph Culberson's Binary Search Tree Research Binary Search Trees

This research is all concerned with the development of an analysis and simulations of the effect of mixed deletions and insertions in binary search trees. Previously it was believed that the average depth of a node in a tree subjected to updates was decreased towards an optimal O(log n) when using the usual algorithms (See e.g. Knuth Vol. 3 [8] ) This was supported to some extent by a complete analysis for trees of only three nodes by Jonassen and Knuth [7] in 1978. Eppinger [6] performed some simulations showing that this was likely false, and conjectured, based on the experimental evidence, that the depth grew as O(log^3 n), but offered no analysis or explanation as to why this should occur. Essentially, this problem concerning one of the most widely used data structures had remained open for 20 years.

Both my MSc [1] and Ph.D. [2] focused on this problem. This research indicates that in fact the depth grows as O(n^{1/2}). A proof under certain simplifying assumptions was given in the theoretical analysis in Algorithmica [3], while a set of simulations was presented together with a less formal analysis of the more general case, in the Computer Journal [4] , intended for a wider audience. A complete analysis for the most general case is still open.

More recently P. Evans [5] has demonstrated that asymmetry may be even more dangerous in smaller doses! Algorithms with only occasional asymmetric moves tend to develop trees with larger skews, although the effects take much longer.

    Publications

  1. Joseph Culberson. Updating Binary Trees. MSc Thesis, University of Waterloo Department of Computer Science, 1984. (Available as Waterloo Research Report CS-84-08.) Abstract
  2. Joseph Culberson. The Effect of Asymmetric Deletions on Binary Search Trees. Ph. D. Thesis University of Waterloo Department of Computer Science, May 1986. (Available as Waterloo Research Report CS-86-15.) Abstract
  3. Joseph Culberson J. Ian Munro. Analysis of the standard deletion algorithms in exact fit domain binary search trees. Algorithmica, vol 6, 295-311, 1990. Abstract
  4. Joseph Culberson J. Ian Munro. Explaining the behavior of binary search trees under prolonged updates: A model and simulations. The Computer Journal, vol 32(1), 68-75, February 1989. Abstract
  5. Joseph C. Culberson and Patricia A. Evans Asymmetry in Binary Search Tree Update Algorithms Technical Report TR94-09

    Other References

  6. Jeffery L. Eppinger. An empirical study of insertion and deletion in binary trees. Communications of the ACM vol. 26, September 1983.
  7. Arne T. Jonassen and Donald E. Knuth A trivial algorithm whose analysis isn't. Journal of Computer and System Sciences, 16:301-322, 1978.
  8. D. E. Knuth Sorting and Searching Volume III, The Art of Computer Programming Addison-Wesley Publishing Company, Inc., Reading, Massachusetts, 1973.

Joseph Culberson

Simulations of dynamic sequential search algorithms

In [3], R.L. Rivest presents a set of methods for dynamically reordering a sequential list containing N records in order to increase search efficiency. The method Ai (for i between 1 and N) performs the following operation each time that a record R has been successfully retrieved: Move R forward i positions in the list, or to the front of the list if it was in a position less than i. The method A1 is called the transposition method, and the method AN-1 is called the move-to-front method.

Recommended Links


In case of broken links please try to use Google search. If you find the page please notify us about new location
Google     

Hashing

Algorithms/data animations

Linear Search

NIST: Linear search

Linear search - encyclopedia article about Linear search.

Review of Linear Search

In the previous labs we have already dealt with linear search, when we talked about lists. Linear Search in Scheme is probably the simplest example of a storage/retrieval datastructure due to the number of primitive operations on lists that we can use. For instance, creation of the datastructure just requires defining a null list.

There are two types of linear search we will examine. The first is search in an unordered list; we have seen this already in the lab about cockatoos and gorillas. All of the storage retrieval operations are almost trivial: insertion is just a single cons of the element on the front of the list. Search involves recurring down the list, each time checking whether the front of the list is the element we need. Deletion is similar to search, but we cons the elements together as we travel down the list until we find the element, at which point we return the cdr of the list. We also did analysis on these algorithms and deduced that the runtime is O(n).

C++ Examples - Linear Search in a Range

On the Competitiveness of Linear Search

C++ Notes: Algorithms: Linear Search

CSC 108H - Lecture Notes
... Linear Search (Sequential). Start with the first name, and continue looking until
x is found. Then print corresponding phone number. ... Linear Search (on a Vector). ...
 

Linear Search: 1.1 Description/Definition
The simplest of these searches is the Linear Search. A Linear Search is simply searching
through a line of objects in order to find a particular item. ...
 

Linear Search
Linear Search. No JDK 1.3 support, applet not shown.&nbsp;
www.cs.usask.ca/.../csconcepts/2002_7/static/ tutorial/introduction/linearsearchapp/linearsearch.html - 2k - Cached - Similar pages

 

Move to the front (self-organizing) modification

Linear Search  the Transpose Method

Here the heuristic is slightly different from the Move-To-Front method : once found, the transition is swapped with the immediately preceding one, performing an incremental bubble sort at each access.

Of course, these two techniques provide a sensitive speed-up only if the probabilities for each transition to be accessed are not uniform.

These are the seven representations that caught our attention at first but that set of containers will be broadened when a special need is to be satisfied.
What is obvious here is that each of these structures has advantages and drawbacks, whether on space complexity or on time complexity. All these constraints have to be taken in account in order to construct code adapted to particuliar situations.

Linear Search : the Move-To-Front Method

Self-organizing linear search
... Self-organizing linear search. Full text, pdf formatPdf (1.73 MB). ... ABSTRACT Algorithms
that modify the order of linear search lists are surveyed. ...
portal.acm.org/citation.cfm?id=5507 - Similar pages

Linear search in lists

Binary Search

NIST: binary search

Definition: Search a sorted array by repeatedly dividing the search interval in half. Begin with an interval covering the whole array. If the value of the search key is less than the item in the middle of the interval, narrow the interval to the lower half. Otherwise narrow it to the upper half. Repeatedly check until the value is found or the interval is empty.

Generalization (I am a kind of ...)
dichotomic search.

Aggregate parent (I am a part of or used in ...)
binary insertion sort, ideal merge, suffix array.

Aggregate child (... is a part of or used in me.)
divide and conquer.

See also linear search, interpolation search, Fibonaccian search, jump search.

Note: Run time is O(ln n). The search implementation in C uses 0-based indexing.

Author: PEB

Implementation

search (C) n is the highest index of the 0-based array; possible key indexes, k, are low < k <= high in the loop. (Scheme), Worst-case behavior annotated for real time (WOOP/ADA), including bibliography.

en.wikipedia.org/wiki/Binary_search

Topic #9 Binary search trees

Search Trees

Binary Search Trees

Binary Search Trees http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.htm

Binary Search Tree http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/binarySearchTree.htm

Trees

AVL trees

(see also Libraries -- there are a several libraries that implementat AVL trees)

Fibonacci search

(see The Fibonacci Numbers and Golden section in Nature for information about Fibinacci, who introduced the decimal number system into Europe and first suggested a problem that lead to Fibonacci numbers)



Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Disclaimer:

Last modified:  August 15, 2009