These page is very incomplete and other pages give a much better overview of
the various compression algorithms. The following types of compression are well documented
elsewhere:
Bijective compression means that for any file X, F( F'( X ) ) == X. (F is
either the compressor or decompressor, and F' is its opposite number.) This is
definitely not the case for most conventional compression algorithms. This type of compression is important in crypto algorithms.
Compression always rely on non-randomness of the text: random sequence of
bytes is poorly compressible.
In general the compression can be viewed as a conversion of static text into
a special program, In other words compression consists of converting a static test into some kind of
code for a specially constructed (may be unique for this case) abstract machine.
Programs in this abstract machine are shorter then the original text while
output of the program after processing by an interpreter (decompressor) is
exactly the same (original) text. For example if we first process the text
and construct a dictionary of words used and then replace each word with the
index to a dictionary that will be a primitive compression that might work well
on text with a small dictionary and frequently repeated words. For example it
might work well for the answer to some lengthy form like "Yes No n/a Yes No
n/a n/a Yes No
No No Yes n/a n/a No". In this case the dictionary consists of three
entries
and each reply can be encoded in two bits.
It is important to understand that the more you know about the structure of
the text, the more specialized processor you and construct and the higher level
of compression you can achieve. For example, if you know that a particular text
is, say, an http log then you can compress it several times better then using
any generic algorithms.
Compression is also connected with the pattern matching. The notion of a longest common subsequence (LCS) of two strings is
widely used to compare files and can be used in compression by splitting
text into several parts if diffing them against each other. In the most
primitive case such diffing can be based on lines or paragraphs. The "diff" command of UNIX system
implement an algorithm of finding the shortest sequence of editing
command able to convert one string to another. Informally, the result of a diff
algorithm gives the
minimum number of operations (insert a symbol, or delete a symbol) to
transform one string into the other. This is
related to an edit distance, called the
In its turn, the search of LCS is connected with efficient string
matching algorithms. The first discovered linear-time string-matching algorithm is from
Morris and Pratt (1970). It has been improved by Knuth, Morris, and
Pratt (1976). The search behaves like a recognition process by
automaton, and a character of the text is compared to a character of the
pattern no more than \log_\Phi(m+1) (\Phi is the
golden ratio (1+\sqrt 5)/2). The Boyer and Moore's
algorithm (1977) is considered as the most efficient string-matching
algorithm in usual applications. A simplified version of it (or the
entire algorithm) is often implemented in text editors for the "search"
and "substitute" commands. Several variants of Boyer and Moore's
algorithm avoid the quadratic behavior when searching for all
occurrences of the pattern. In 1975, Aho and Corasick designed an O(n\log\sigma)
algorithm to solve this problem, with a running time independent of the
number of patterns. It is implemented by the "fgrep" command under the
UNIX operating system. In applications where the text is to be searched for several
patterns, it is the text that needs to be preprocessed. Even if no
further information is known on their syntactic structure, it is
possible and indeed extremely efficient to built an index that supports
searches. Data structures to represent indexes on text files are: suffix
trees (Weiner 1973, McCreight 1976, Ukkonen 1994), direct acyclic word
graph (Blumer et al., 1985), suffix automata (Crochemore, 1986), and
suffix arrays (Manber and Myers, 1993). All algorithms (except for
suffix arrays) build the index in time O(n\log\sigma).
- Aho, A.V. 1990. Algorithms for finding patterns in strings. In:
Handbook of Theoretical Computer Science, Algorithms and
Complexity, Vol. A, ch. 5, pp 255--330. J. van Leeuwen ed.,
Elsevier, Amsterdam.
- Bell, T.C., Cleary J.G. and Witten, I.H. 1990. Text
Compression. Prentice Hall, Englewood Cliffs, New Jersey.
- Cormen, T.H., Leiserson C.E. and Rivest, R.L. 1990.
Introduction to Algorithms, ch. 34, pp 853--885. MIT Press.
- Crochemore, M. and Rytter W. 1994. Text Algorithms.
Oxford University Press.
- Gonnet, G.H. and Baeza-Yates, R.A. 1991. Handbook of
Algorithms and Data Structures, ch. 7, pp 251--288.
Addison-Wesley.
- Nelson, M. 1992. The Data Compression Book. M&T Books.
- Sedgewick R. 1990. Algorithms in C, ch. 19 and 22.
Addison-Wesley.
- Stephen, G.A. 1994. String Searching Algorithms. World
Scientific Press
Notes:
- Those pages are written by people for whom English is not a
native language. Some amount of grammar and spelling errors
should be expected.
- This is a Spartan WHYFF (We Help You For Free) site. It
cannot replace the best teachers and
the
best books.
- The site contain some obsolete pages as it develops like a
living tree... Some links on older pages
are broken. Please
try to use Google, Open directory, etc. to find a replacement link
(see
HOWTO search the WEB for details).
We would appreciate if you can
mail us a correct link.
|
|
In 1987, I was asked by a magazine editor to write an article about data
compression. I wrote a manuscript and an accompanying program, sent them to
the editor, and forgot about them. The next time I heard from him I was told
that the magazine was discontinued. So I uploaded my program, a simple
implementation of the LZSS compression algorithm (see below) to PC-VAN,
a big Japanese BBS run by NEC. That was May 1, 1988.
Soon a number of hobby programmers gathered and began improving on that
program. The project culminated in Kazuhiko Miki's archiver LArc,
which was fairly widey used in Japan. (Dr. Miki was then a medical
specialist working at a governmental office. I heard he left office and
began work on freeware/shareware promotion.)
The LZSS algorithm is based on a very simple idea. Suppose I'm going to
write "compression" here. But probably I've already used that word before in
this file. If I used that word 57 characters before, I might as well write
"go 57 characters backward, and read 11 characters," or <57,11> for short.
In general, when I've already used the string of characters among the recent
4096 characters, say, I encode the string by a <position,length> pair.
In Storer's [8] terminology, this is a sliding
dictionary algorithm, analyzed first by Ziv and Lempel [14]
and then by Storer and Szymanski [9], among others.
Later versions of my LZSS implementations and Miki's LArc used
binary search trees to make string search faster; see Bell [1].
Incidentally, there are two distinct Ziv-Lempel (LZ) methods: sliding
dictionary [14] and dynamic dictionary [15]
in Storer's [8] terminology. The LZW algorithm
[12] belongs to the latter. Most pre-LHarc
compression tools, such as 'compress', 'ARC', and 'PKARC', used LZW.
During the summer of 1988, I wrote another compression program, LZARI.
This program is based on the following observation: Each output of LZSS is
either a single character or a <position,length> pair. A single character
can be coded as an integer between 0 and 255. As for the <length> field, if
the range of <length> is 2 to 257, say, it can be coded as an integer
between 256 and 511. Thus, I can say that there are 512 kinds of
"characters," and the "characters" 256 through 511 are accompanied by a
<position> field. These 512 "characters" can be Huffman-coded, or better
still, algebraically coded. The <position> field can be coded in the same
manner. In LZARI I used an adaptive algebraic compression
[13], [2] to encode the "characters," and
static algebraic compression to encode the <position> field. (There were
several versions of LZARI; some of them were slightly different from
the above description.) The compression of LZARI was very tight,
though rather slow.
Haruyasu Yoshizaki (Yoshi), a physician and guru hobby programmer, worked
very hard to make LZARI faster. Most importantly, he replaced
LZARI's algebraic compression by dynamic Huffman coding.
His program, LZHUF, was very successful. It was much faster than
my LZARI. As for compression ratio, Huffman cannot beat algebraic
compression, but the difference turned out to be very small.
Yoshi rewrote the compression engine of LZHUF in assembler, and
added a nifty user interface. His archiver, LHarc, soon became the de
facto standard among Japanese BBS users. After Prof. Kenjirou Okubo, a
mathematician, introduced LHarc to the United States, it became
world-famous. Other vendors began using similar techniques: sliding
dictionary plus statistical compressions such as Huffman and Shannon-Fano.
(I wondered why they used Shannon-Fano rather than Huffman which is
guaranteed to compress tighter than Shannon-Fano. As it turned out, a
then-popular book on compression published in U.S. contained a wrong
description and buggy sample programs, such that Shannon-Fano outperformed
(buggy) Huffman on many files.)
Although LHarc was much faster than LZARI, we weren't quite
satisfied with its speed. Because LHarc was based on dynamic Huffman,
it had to update Huffman tree every time it received a character. Yoshi and
I tried other dynamic Huffman algorithms [5],
[10], [11], but improvements were not as great as we
desired.
So I took a different step: replacing LHarc's dynamic Huffman by a
static Huffman method.
Traditional static Huffman coding algorithm first scans the input file to
count character distribution, then builds Huffman tree and encodes the file.
In my approach, the input file is read only once. It is first compressed by
a sliding dictionary method like LZARI and LHarc, and at the
same time the distributions of the "characters" (see above) and positions
are counted. The output of this process is stored in main memory. When the
buffer in memory is full (or the input is exhausted), the Huffman trees are
constructed, and the half-processed content of the buffer is actually
compressed and output.
In static Huffman, the Huffman tree must be stored in the compressed
file. In the traditional approach this information consumes hundreds of
bytes. My approach was to standardize Huffman trees so that (1) each left
subtree is no deeper than its right counterpart, and (2) the leaves at the
same level are sorted in ascending order. In this way the Huffman tree can
be uniquely specified by the lengths of the codewords. Moreover, the
resulting table is again compressed by the same Huffman algorithm.
To make the decoding program simpler, the Huffman tree is adjusted so
that the codeword lengths do not exceed 16 bits. Since this adjusting is
rarely needed, the algorithm is made very simple. It does not create optimal
length-limited Huffman trees; see e.g. [6] for an optimal
algorithm. Incidentally, my early program had a bug here, which was quickly
pointed out and corrected by Yoshi.
The sliding dictionary algorithm is also improved by Yoshi using a
"PATRICIA tree" data structure; see McCreight [7] and Fiala
and Greene [4].
After completing my algorithm, I learned that Brent [3]
also used a sliding dictionary plus Huffman coding. His method, SLH,
is simple and elegant, but since it doesn't find the most recent longest
match, the distribution of match position becomes flat. This makes the
second-stage Huffman compression less efficient.
On the basis of these new algorithms, Yoshi began to rewrite his LHarc,
but it took him so long (remember he was a busy doctor!) that I decided to
write my own archiver. My archiver was quite recklessly named 'ar'.
(Actually I appended version numbers as in 'ar002' for version 0.02.) I
should have named it 'har' (after my name), say, because 'ar' collides with
the name of UNIX's archiver. I didn't want my program to compete with
LHarc, but I wanted many people to try the algorithm, so I wrote it in
pure ANSI C. This is the reason 'ar' lacked many bells and whistles
necessary for a real archiver.
Note: The version of 'ar002' most often found in the U.S. had a
bug. Line 24 of maketbl.c should read, of course,
while (i <= 16) {
weight[i] = 1U << (16 - i); i++;
}
Somehow the bug didn't show up when compiled by Turbo C.
Yoshi finally showed us his new archiver written in C. It was tentatively
named LHx. He then rewrote the main logic in assembler. Yoshi and I
wrote an article describing his new archiver, which would be named LH,
in the January, 1991, issue of "C Magazine" (in Japanese). The suffix 'arc'
of LHarc was deliberately dropped because the people who sold ARC
did not want others to use the name.
Then we learned that for the new DOS 5.0, LH meaned LoadHigh, an internal
command. We decided to rename LH to LHA.
Also, I was told that the algorithm described in Fiala and Greene
[4] got patented ("Textual Substitution Data Compression
With Finite Length Search Windows," U.S. Patent 4,906,991, Mar. 6, 1990.
Actually they got three patents! The other two were: "Start, Step, Stop
Unary Encoding for Data Compression," Application Ser. No. 07/187,697, and
"Search Tree Data Structure Encoding for Textual Substitution Data
Compression Systems," Application Ser. No. 07/187,699.)
Furthermore, I learned that the original Ziv-Lempel compression method
(Eastman et al., U.S. Patent 4,464,650, 8/1984) and the LZW method (Welch,
4,558,302, 12/1985) were patented. I also heard that Richard Stallman, of
the Free Software Foundation, author of the EMACS editor and leader of the
GNU project, ceased to use 'compress' program any more because its LZW
algorithm got patented.
Are algorithms patentable? (See [16].) If these patents
should turn out to be taken seriously, all compression programs now in use
may infringe some of these patents. (Luckily, not all claims made by those
algorithm patents seems to be valid.)
The foregoing is a slight modification of what I wrote in 1991. The year
1991 was a very busy year for me. In 1992, I joined the faculty of Matsusaka
University. This opportunity should have given me more free time, but as it
turned out I got ever busier. I stopped hacking on my compression
algorithms; so did Yoshi.
Luckily, all good things in LHA were taken over, and all bad
things abandoned, by the new great archiver
zip and the compression tool
gzip. I admire the efforts of Jean-loup
Gailly and others.
A brief historical comment on PKZIP: At one time a programmer for
PK and I were in close contact. We exchanged a lot of ideas. No wonder
PKZIP and LHA are so similar.
Another historical comment: LHICE and ICE are definitely
not written by Yoshi (or me or anyone I know). I think they are faked
versions of LHarc.
REFERENCES
- [1]
- Timothy C. Bell. Better OPM/L text compression. IEEE Transactions on
Communications, COM-34(12):1176--1182, 1986.
- [2]
- Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text
Compression. Prentice Hall, 1990.
- [3]
- R. P. Brent. A linear algorithm for data compression. The Australian
Computer Journal, 19(2):64--68, 1987.
- [4]
- Edward R. Fiala and Daniel H. Greene. Data compression with finite
windows. Communications of the ACM, 32(4):490--505, 1989.
- [5]
- Donald E. Knuth. Dynamic Huffman coding. Journal of Algorithms,
6:163--180, 1985.
- [6]
- Lawrence L. Larmore and Daniel S. Hirschberg. A fast algorithm for
optimal length-limited Huffman codes. Journal of the Association for
Computing Machinery, 37(3):464--473, 1990.
- [7]
- Edward M. McCreight. A space-economical suffix tree construction
algorithm. Journal of the Association for Computing Machinery,
23(2):262--272, 1976.
- [8]
- James A. Storer. Data Compression: Methods and Theory.
Computer Science Press, Rockville, MD., 1988.
- [9]
- James A. Storer and Thomas G. Szymanski. Data compression via
textual substitution. Journal of the Association for Computing
Machinery, 29(4):928--951, 1982.
- [10]
- Jeffrey Scott Vitter. Design and analysis of dynamic Huffman codes.
Journal of the Association for Computing Machinery, 34(4):825--845,
1987.
- [11]
- Jeffrey Scott Vitter. Algorithm 673: Dynamic Huffman coding. ACM
Transactions on Mathematical Software, 15(2):158--167, 1989.
- [12]
- Terry A. Welch. A technique for high-performance data compression.
IEEE Computer}, 17(6):8--19, 1984.
- [13]
- Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic
coding for data compression. Communications of the ACM, 30(6):520--540,
1987.
- [14]
- Jacob Ziv and Abraham Lempel. A universal algorithm for sequential
data compression. IEEE Transactions on Information Theory,
IT-23(3):337--343, 1977.
- [15]
- Jacob Ziv and Abraham Lempel. Compression of individual sequences
via variable-rate coding. IEEE Transactions on Information Theory,
IT-24(5):530--536, 1978.
- [16]
- Edward N. Zalta. Are algorithms patentable? Notices of the American
Mathematical Society, 35(6):796--799, 1988.
Haruhiko Okumura,
okumura@matsusaka-u.ac.jp
Last modified: Tue Mar 17 17:02:03 1998
The column "html" shows the size of the article in
html format and if clicked opens it in a new window. The
"zip" column contains the zipped article and images or
source if any. Note that brighter rows mean new or
updated article. If you take the time to read the
articles, please take the time to tell me what do you
think of them.
| Title |
Html |
Zip |
Last update |
| Arithmetic coding, entropy coder |
20k |
7k |
22-Jul-1999 |
| Bwt, block reduction |
21k |
8k |
22-Jul-1999 |
| Canonical huffman. |
15k |
5k |
23-Jul-1999 |
| Crc-32, the standard Crc |
6k |
7k |
10-Aug-1999 |
| Finite context modeling |
37k |
12k |
16-Nov-1999 |
| Flexible parsing, improvement of lz |
10k |
4k |
24-Jul-1999 |
| Lz77, also called lzss |
40k |
14k |
23-Jul-1999 |
| Lzw, for gif decoding |
27k |
9k |
23-Jul-1999 |
| Mtf, a transformation scheme |
9k |
3k |
24-Jul-1999 |
| Implementing ppmc with hash tables |
111k |
327k |
21-March-2000 |
| Quasi Static model |
19k |
6k |
13-Aug-1999 |
| Range coder |
24k |
8k |
17-Nov-1999 |
| Rle, Basic scheme |
7k |
3k |
24-Jul-1999 |
As you can read in the index I'm going to publish as
soon as possible new articles, if you want to get an
email when this happens, please fill this
form,
moreover you can use it to tell me which file format do
you prefer. If you have any question about the articles
feel free to email me.
My goal is to offer the best articles about
compression for free. Thus I need your feedback on the
articles to further improve them: What you don't know.
What was confusing or unclear. What difficulties you
found while implementing it. Or what you miss. Teeling
me so is the only way of further improving articles for
you, and depends only of you.
Note that since my first articles both my knowledge
about compression and english have greatly improved,
thus you can notice a difference in quality between old
articles or the latest ones. Unfortunately I don't have
the time to rewrite all the articles (as I would like
to) so instead please tell me which articles do you find
worth rewriting, and I'll do so. |
[Oct 31, 2004] Data
Compression -- mini-tutorial by Greg Goebel.
The spread of computing has led to an explosion in the volume of data to
be stored on hard disks and sent over the Internet. This growth has led to a
need for "data compression", that is, the ability to reduce the amount of
storage or Internet bandwidth required to handle data.
This document provides a survey of data compression techniques. The focus
is on the most prominent data compression schemes, particularly popular
archivers, JPEG image compression, and MPEG video and audio compression.
Katz, Phillip W.
Publication Date: April 19,
2000
Age 37. Passed away unexpectedly
on Fri., April 14, 2000. Beloved
son of Hildegard and beloved
brother of Cynthia. Also
survived by other relatives and
friends. Phil was a graduate of
UWM Computer Science Engineering
Program. He was the author of
the PKZIP/PKUNZIP software and
owner of PKWARE Inc. Co. Private
services have been held.
Memorials to the charity of your
choice would be appreciated.
ZWASKA FUNERAL HOME 354-5330
Serving the Family |
The Rise and Fall of a Software Star - Phil Katz
Loved Code -- and Liquor
compression FAQ
Data
Compression Debra A. Lelewer and Daniel S. Hirschberg
Sometimes you can find old O'Reilly graphic formats book on the web too:
Encyclopedia of Graphics File Formats, 2nd Edition
The Complete Reference on CD-ROM with Links to Internet Resources
By
James D. Murray, William vanRyper
2nd Edition May 1996
It contains a chapter about graphic algorithms, see
TOC
AnandTech - All About
Compression, Part I
Games++ Games & Game Programming-Algorithms, Tutorials, Source Code, Direct X,
C-C++, Java, and Much More.
E186-1 Michael
Schindler
Queue, Huffman, Compression and Encryption Applets in C++
Forums - Efficient Huffman implementation
...it is certainly
possible to construct the Huffman code without an explicit tree. The main
observation is that there is a canonical Huffman code entirely defined by the
code lengths of the letters. And, there is a simple data structure to compute
the latter in linear time after a sorting phase without explicitly
constructing the tree. This does save half the memory required for an
explicit tree. For example, see my Huffman implementation in the Vcodex
package at
www.research.att.com/sw/tools.
LZW(Lempel-Ziv-Welch) is the most common algorithm used in computer graphics.
This lossless method of data compression is found in several image file formats,
such as GIF and
TIFF, and is also part of the V.42bis modem compression standard and
PostScript Level 2.
In 1977, Abraham Lempel and Jakob Ziv created the first of what we now call
the LZ family of substitutional compressors. The LZ77 compression algorithms are
commonly found in text compression and archiving programs, such as compress,
zoo, lha, pkzip, and arj. The LZ78 compression algorithms are more commonly used
to compress binary data, such as bitmaps.
In 1984, while working for Unisys, Terry Welch modified the LZ78 compressor
for implementation in high-performance disk controllers. The result was the LZW
algorithm that is commonly found today. It is covered by U.S. Patent
4,558,302 (plus its foreign counterparts, issued or pending). All patents
are held by Unisys Corporation. That means that without obtaining a license from
Unisys you formally cannot read or write a GIF file :-(. See
TF_WP_unisys
for more information.
- Unisys LZW Page
- **** DataCompression.info
- LZ78-LZW and derivatives -- a very nice collection of links. Better that
this one...
- LZW and GIF
explained John Barkaus's "LZW and GIF explained"
>From: john@cooper.cooper.EDU (John Barkaus)
Newsgroups: comp.graphics
Subject: GIF file format responses 5/5
Date: 21 Apr 89 20:58:01 GMT
Organization: The Cooper Union (NY, NY)
LZW and GIF explained----Steve Blackstock
I hope this little document will help enlighten those of you out there who
want to know more about the Lempel-Ziv Welch
compression algorithm,
and, specifically, the implementation that GIF uses.
Before we start, here's a little terminology, for the purposes of this
document:
- "character":
- a fundamental data element. In normal text files, this is just a single
byte. In raster images, which is what we're interested in, it's an index
that specifies the color of a given pixel. I'll refer to an arbitray
character as "K".
- "charstream":
- a stream of characters, as in a data file.
- "string":
- a number of continuous characters, anywhere from one to very many
characters in length. I can specify an arbitrary string as "[...]K".
- "prefix":
- almost the same as a string, but with the implication that a prefix
immediately precedes a character, and a prefix can have a length of zero.
So, a prefix and a character make up a string. I will refer to an arbitrary
prefix as "[...]".
- "root":
- a single-character string. For most purposes, this is a character, but
we may occasionally make a distinction. It is [...]K, where [...] is empty.
- "code":
- a number, specified by a known number of bits, which maps to a string.
- "codestream":
- the output stream of codes, as in the "raster data"
- "entry":
- a code and its string.
- "string table":
- a list of entries; usually, but not necessarily, unique.
That should be enough of that.
LZW is a way of
compressing data that takes advantage of repetition of strings in the data.
Since raster data usually contains a lot of this repetition,
LZW is a good way of
compressing and decompressing it.
For the moment, lets consider normal
LZW encoding and
decoding. GIF's variation on the concept is just an extension from there.
LZW manipulates
three objects in both
compression and decompression: the charstream, the codestream,
and the string table. In
compression, the charstream is the input and the codestream is the output.
In decompression, the codestream is the input and the charstream is the
output. The string table is a product of both compression and
decompression, but is never passed from one to the other.
The first thing we do in
LZW compression is
initialize our string table. To do this, we need to choose a code
size (how many bits) and know how many values our characters can possibly
take. Let's say our code size is 12 bits, meaning we can store 0->FFF,
or 4096 entries in our string table. Lets also say that we have 32
possible different characters. (This corresponds to, say, a
picture in which there are 32 different colors possible for each pixel.)
To initialize the table, we set code#0 to character#0, code #1 to character#1,
and so on, until code#31 to character#31. Actually, we are
specifying that each code from 0 to 31 maps to a root. There will
be no more entries in the table that have this property.
Now we start compressing data. Let's first define something
called the "current prefix". It's just a prefix that we'll store
things in and compare things to now and then. I will refer to it
as "[.c.]". Initially, the current prefix has nothing in it.
Let's also define a "current string", which will be the current prefix plus
the next character in the charstream. I will refer to the current
string as "[.c.]K", where K is some character. OK, look at the
first character in the charstream. Call it P. Make [.c.]P
the current string. (At this point, of course, it's just the root
P.) Now search through the string table to see if [.c.]P appears in it.
Of course, it does now, because our string table is initialized to have all
roots. So we don't do anything. Now make [.c.]P the
current prefix. Look at the next character in the charstream.
Call it Q. Add it to the current prefix to form [.c.]Q, the
current string. Now search through the string table to see if [.c.]Q
appears in it. In this case, of course, it doesn't.
Aha! Now we get to do something. Add [.c.]Q (which is PQ in
this case) to the string table for code#32, and output the code for [.c.] to
the codestream. Now start over again with the current prefix being
just the root Q. Keep adding characters to [.c.] to form [.c.]K,
until you can't find [.c.]K in the string table. Then output the
code for [.c.] and add [.c.]K to the string table. In pseudo-code,
the algorithm goes something like this:
- Initialize string table;
- [.c.] <- empty;
- K <- next character in charstream;
- Is [.c.]K in string table?
- yes:
- [.c.] <- [.c.]K;
- go to [3];
- no:
- add [.c.]K to the string table;
- output the code for [.c.] to the codestream;
- [.c.] <- K;
- go to [3];
It's as simple as that! Of course, when you get to step [3] and
there aren't any more characters left, you just output the code for [.c.] and
throw the table away. You're done.
Wanna do an example? Let's pretend we have a four-character
alphabet: A,B,C,D. The charstream looks like ABACABA.
Let's compress it. First, we initialize our string table to:
#0=A, #1=B, #2=C, #3=D. The first character is A, which is in the
string table, so [.c.] becomes A. Next we get AB, which is not in
the table, so we output code #0 (for [.c.]), and add AB to the string table as
code #4. [.c.] becomes B. Next we get [.c.]A = BA, which is not in
the string table, so output code #1, and add BA to the string table as code
#5. [.c.] becomes A. Next we get AC, which is not in the string
table. Output code #0, and add AC to the string table as code #6.
Now [.c.] becomes C. Next we get [.c.]A = CA, which is not in the
table. Output #2 for C, and add CA to table as code#7.
Now [.c.] becomes A. Next we get AB, which IS in the string table,
so [.c.] gets AB, and we look at ABA, which is not in the string table, so
output the code for AB, which is #4, and add ABA to the string table as code
#8. [.c.] becomes A. We can't get any more characters, so we just
output #0 for the code for A, and we're done. So, the codestream
is #0#1#0#2#4#0.
A few words (four) should be said here about efficiency: use a
hashing strategy. The search through the string table
can be computationally intensive, and some hashing is well worth the effort.
Also, note that "straight
LZW" compression
runs the risk of overflowing the string table - getting to a code which can't
be represented in the number of bits you've set aside for codes.
There are several ways of dealing with this problem, and GIF implements a very
clever one, but we'll get to that.
An important thing to notice is that, at any point during the
compression, if [...]K
is in the string table, [...] is there also. This fact suggests an
efficient method for storing strings in the table. Rather than
store the entire string of K's in the table, realize that any string can be
expressed as a prefix plus a character: [...]K. If
we're about to store [...]K in the table, we know that [...] is already there,
so we can just store the code for [...] plus the final character K.
Ok, that takes care of
compression. Decompression is perhaps more difficult
conceptually, but it is really easier to program.
Here's how it goes: We again have to start with an initialized
string table. This table comes from what knowledge we have about
the charstream that we will eventually get, like what possible values the
characters can take. In GIF files, this information is in the
header as the number of possible pixel values. The beauty of LZW, though, is that
this is all we need to know. We will build the rest of the string
table as we decompress the codestream. The compression is done in
such a way that we will never encounter a code in the codestream that we can't
translate into a string.
We need to define something called a "current code", which I will refer to
as "<code>", and an "old-code", which I will refer to as "<old>".
To start things off, look at the first code. This is now <code>.
This code will be in the intialized string table as the code for a root.
Output the root to the charstream. Make this code the old-code
<old>. *Now look at the next code, and make it <code>.
It is possible that this code will not be in the string table, but let's
assume for now that it is. Output the string corresponding to
<code> to the codestream. Now find the first character in the
string you just translated. Call this K. Add this to
the prefix [...] generated by <old> to form a new string [...]K.
Add this string [...]K to the string table, and set the old-code <old> to the
current code <code>. Repeat from where I typed the asterisk, and
you're all set. Read this paragraph again if you just skimmed
it!!! Now let's consider the possibility that <code> is not in the
string table. Think back to compression, and try to
understand what happens when you have a string like P[...]P[...]PQ appear in
the charstream. Suppose P[...] is already in the string table, but
P[...]P is not. The compressor will parse out P[...], and find
that P[...]P is not in the string table. It will output the code
for P[...], and add P[...]P to the string table. Then it will get
up to P[...]P for the next string, and find that P[...]P is in the table, as
the code just added. So it will output the code for P[...]P if it
finds that P[...]PQ is not in the table. The decompressor is
always "one step behind" the compressor. When the decompressor
sees the code for P[...]P, it will not have added that code to it's string
table yet because it needed the beginning character of P[...]P to add to the
string for the last code, P[...], to form the code for P[...]P.
However, when a decompressor finds a code that it doesn't know yet, it will
always be the very next one to be added to the string table. So it
can guess at what the string for the code should be, and, in fact, it will
always be correct. If I am a decompressor, and I see code#124, and
yet my string table has entries only up to code#123, I can figure out what
code#124 must be, add it to my string table, and output the string.
If code#123 generated the string, which I will refer to here as a prefix,
[...], then code#124, in this special case, will be [...] plus the first
character of [...]. So just add the first character of [...] to
the end of itself. Not too bad. As an example (and a
very common one) of this special case, let's assume we have a raster image in
which the first three pixels have the same color value. That is,
my charstream looks like: QQQ.... For the sake of
argument, let's say we have 32 colors, and Q is the color#12. The
compressor will generate the code sequence 12,32,.... (if you don't know why,
take a minute to understand it.) Remember that #32 is not in the initial
table, which goes from #0 to #31. The decompressor will see #12
and translate it just fine as color Q. Then it will see #32 and
not yet know what that means. But if it thinks about it long
enough, it can figure out that QQ should be entry#32 in the table and QQ
should be the next string output. So the decompression pseudo-code
goes something like:
- Initialize string table;
- get first code: <code>;
- output the string for <code> to the charstream;
- <old> = <code>;
- <code> <- next code in codestream;
- does <code> exist in the string table?
- yes:
- output the string for <code> to the charstream;
- [...] <- translation for <old>;
- K <- first character of translation for <code>;
- add [...]K to the string table;
- <old> <- <code>;
- no:
- [...] <- translation for <old>;
- K <- first character of [...];
- output [...]K to charstream and add it to string table;
- <old> <- <code>
- go to [5];
Again, when you get to step [5] and there are no more codes, you're
finished. Outputting of strings, and finding of initial characters
in strings are efficiency problems all to themselves, but I'm not going to
suggest ways to do them here. Half the fun of programming is
figuring these things out!
Now for the GIF variations on the theme. In part of the header of
a GIF file, there is a field, in the Raster Data stream, called "code size".
This is a very misleading name for the field, but we have to live with it.
What it is really is the "root size". The actual size, in bits, of
the compression codes
actually changes during
compression/decompression, and I will refer to that size here as the "compression
size". The initial table is just the codes for all the roots, as
usual, but two special codes are added on top of those. Suppose
you have a "code size", which is usually the number of bits per pixel in the
image, of N. If the number of bits/pixel is one, then N must be 2:
the roots take up slots #0 and #1 in the initial table, and the two special
codes will take up slots #4 and #5. In any other case, N is the
number of bits per pixel, and the roots take up slots #0 through #(2**N-1),
and the special codes are (2**N) and (2**N + 1). The initial
compression size will
be N+1 bits per code. If you're encoding, you output the codes
(N+1) bits at a time to start with, and if you're decoding, you grab (N+1)
bits from the codestream at a time. As for the special codes:
<CC> or the clear code, is (2**N), and <EOI>, or end-of-information, is (2**N
+ 1). <CC> tells the compressor to re-initialize the string table,
and to reset the
compression size to (N+1). <EOI> means there's no more in the
codestream. If you're encoding or decoding, you should start
adding things to the string table at <CC> + 2. If you're encoding,
you should output <CC> as the very first code, and then whenever after that
you reach code #4095 (hex FFF), because GIF does not allow compression sizes to be
greater than 12 bits. If you're decoding, you should reinitialize
your string table when you observe <CC>. The variable
compression sizes are
really no big deal. If you're encoding, you start with a
compression size of
(N+1) bits, and, whenever you output the code (2**(compression
size)-1), you bump the
compression size up one bit. So the next code you output will
be one bit longer. Remember that the largest
compression size is 12
bits, corresponding to a code of 4095. If you get that far, you
must output <CC> as the next code, and start over. If you're
decoding, you must increase your
compression size AS
SOON AS YOU write entry #(2**(compression
size) - 1) to the string table. The next code you READ will be one
bit longer. Don't make the mistake of waiting until you need to
add the code (2**compression
size) to the table. You'll have already missed a bit from the last
code. The packaging of codes into a bitsream for the raster data
is a potential stumbling block for the novice encoder or decoder.
The lowest order bit in the code should coincide with the lowest available bit
in the first available byte in the codestream. For example, if
you're starting with 5-bit
compression codes, and your first three codes are, say, <abcde>, <fghij>,
<klmno>, where e, j, and o are bit#0, then your codestream will start off
like:
byte#0: hijabcde
byte#1: .klmnofg
So the differences between straight
LZW and GIF
LZW are:
two additional special codes and variable
compression sizes.
If you understand LZW,
and you understand those variations, you understand it all!
Just as sort of a P.S., you may have noticed that a
compressor has a little bit of flexibility at
compression time.
I specified a "greedy" approach to the
compression, grabbing
as many characters as possible before outputting codes. This is,
in fact, the standard LZW
way of doing things, and it will yield the best
compression ratio.
But there's no rule saying you can't stop anywhere along the line and just
output the code for the current prefix, whether it's already in the table or
not, and add that string plus the next character to the string table.
There are various reasons for wanting to do this, especially if the strings
get extremely long and make hashing difficult. If you need to, do
it.
Hope this helps out. ----steve blackstock
- LZW
compression - a whatis definition
- LZW Data Compression
- Interactive
LZW Compression
-
The LZW algorithm
- The LZW
compression algorithm
-
I recommend first check
DataCompression.info -
Tutorials, Reference, Presentations
Creating a
''One to One Adaptive Huffman Compression-Decompression Program for the real
world of 8 bit byte files
The Mandala
Centre - Compression and Security - One on one compression FAQ
One to One
Compression -- This site discusses a characteristic of some compression
algorithms that the author refers to as One to One compression. In a nutshell,
this property means that for any file X, F( F'( X ) ) == X. (F is either the
compressor or decompressor, and F' is its opposite number.) This is definitely
not the case for most conventional compression algorithms.
Bijective Arithmetic Encoding with Really Good End Treatment
DataCompression.info - Adaptive Huffman Coding
... Adaptive Huffman Encoding, Rate, A
library to perform adpative ... an implementation of
Vitter's dynamic Huffman compressor, adapted so that it is bijective.
...
patenting
gifs, lzw compression, unisys gif patent, software for gifs, products using lzw
compression algorithm, png file format, development of png file format
Gifs and patents
The patent for LZW compression algorithm is held by Unisys since
1985. Compuserve Gif uses this algorithm. In 1994, Unisys implemented their
patent rights by asking for license fees from developers of products that read
or write gifs. This announcement took developers by surprise and caused quite
a stir on the Internet. There were rumours that Unisys might charge web
developers for usage of gifs on their sites. At the same time, it was argued
that Gifs are the products of LZW compression and the patent does not extend
to the usage of the end product. Actually, web developers had nothing to fear
since Unisys planned to collect license fees only from software companies
whose products employ LZW algorithm.
Web developers should not be worried. They are free to use gifs on their
web sites. However, if you've developed a software that creates or modifies
gifs, you would have to pay licensing fees to Unisys.
The business acumen of the people at Unisys has to be admired. It seems
that they had waited for Gifs to become popular and beneficial (from a web
developers' point of view) before implementing the patent rights. However,
there was an interesting and fortunate (?) side effect of this story. It lead
to the development of the PNG file format. PNG is a much better and more
versatile image format than both JPG and GIF. It has all the bells and
whistles of these two file formats.
At the time of writing the browser support for PNG is still quite patchy,
though the future does look bright.
LZW compression: conjugations patented
Google Groups View Thread LZW Patent Expiry
Can the LZW algorithm be used freely in all countries after the US
patent expiry in June 2003? I have gathered this information from
newsgroups:
1. The main US patent expires the 20 June 2003. The valid time can not
be extended, if they do not make adjustments/extensions to the patent.
US patent:
http://l2.espacenet.com/espacenet/viewer?PN=US4558302&CY=ep&LG=en&DB=EPD
2. There are other US patents covering this field, but these are filed
in 1999, and are therefore not enforceable. Are there other US
patents?
3. Unisys claim that they have patents in Canada, France, Germany,
U.K. and Italy, but they have never sued any company in European
court though they have had chances. There are some references in the
US patent: CA1223965, DE3476617D, EP0129439. Are these patents
enforceable? Are there other patents?
4. Patents are pending in Japan. The US patent refers to these:
JP2610084B2, JP5068893B, JP60116228, JP7249996, JP8237138. Do we have
to worry about these?
Can anyone confirm the information above and answer the questions?
Does this mean that we can use the lzw algorithm freely in all
countries after June 2003?
Etc
Metalinks
Copyright © 1996-2007 by Dr. Nikolai Bezroukov.
www.softpanorama.org was
created as a service to the UN Sustainable Development Networking Programme (SDNP)
in the author free time.
Submit
comments This document is an industrial compilation designed and created
exclusively for educational use and is placed under the copyright of the
Open Content License(OPL).
Original materials copyright belong to respective owners. Quotes are made
for educational purposes only in compliance with the fair use doctrine.
Standard disclaimer: The statements, views and opinions presented on
this web page are those of the author and are not endorsed by, nor do they necessarily
reflect, the opinions of the author present and former employers, SDNP or any other
organization the author may be associated with. We do not warrant the correctness
of the information provided or its fitness for any purpose.
Last modified:
March 15, 2008
Gone are the days... (Score:5)
I used to go down to the local computer store, which had bins and bins of the latest shareware, all on precious 5 1/4 disks. Each one held some sort of magic that would transform my XT with Hercules graphics into a completely absorbing experience.
Video games, clones of major applications, dinky little Pascal compilers, my first version of Spacewar....
But there was a key to all of that magic. Back then, there were no auto-installing CDs. There was no "setup.exe" There would just be a single file, with that ever-familiar extension: ".ZIP"
I had been on the scene long enough to know what was up, so I not only had PKZIP/PKUNZIP installed on my 4 meg harddrive, but I even had it in the PATH.
A few keystrokes later, the magic was unlocked.
We don't know how much we owe to this great man. I genuinely mourn his passing.
37 (Score:4)
by roman_mir (125474) on Saturday April 22 2000, @06:36PM (#1116035) Homepage Journal(to those of us who remember Vladimir Visotskiy) Na zifre 37, kovaren bog, rebrom vopros postavil: ili, ili
Na etom rubeje legli i Bairon i Rembo a nineshnie kak-to proskochili...