Softpanorama

May the source be with you, but remember the KISS principle ;-)
Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

Compression Algorithms

News

See Also Recommended Books Recommended Links Recommended Papers Magazines eBooks FAQs eBooks
Huffman LZW LZH Bijective RLE JPEG History Humor Etc

These page is very incomplete and other pages give a much better overview of the various compression algorithms. The following types of compression are well documented elsewhere:

Bijective compression means that for any file X, F( F'( X ) ) == X. (F is either the compressor or decompressor, and F' is its opposite number.) This is definitely not the case for most conventional compression algorithms. This type of compression is important in crypto algorithms.

The more we know the better we compress

Lossless compression reduces bits by identifying and eliminating statistical redundancy. The more heavily the source (let's say a text file) text is prepossessed and studied the more efficiently it can be compressed. For example, no generic algorithm can compete with algorithms designed for compressing text string on regular text articles and books.  You can create even more specialized algorithm, for example specifically designed to compress Unix syslog files and it will beat any generic text compression algorithm. and so on and so forth. .

Compression always rely on non-randomness of the text: random sequence of bytes is poorly compressible. 

The Lempel–Ziv (LZ) compression methods are among the most popular algorithms for lossless storage.[7] DEFLATE is a variation on LZ optimized for decompression speed and compression ratio, but compression can be slow. DEFLATE is used in PKZIP, Gzip and PNG. LZW (Lempel–Ziv–Welch) is used in GIF images. Also noteworthy is the LZR (Lempel-Ziv–Renau) algorithm, which serves as the basis for the Zip method.[citation needed] LZ methods use a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). Current LZ-based coding schemes that perform well are Brotli and LZX. LZX is used in Microsoft's CAB format.

Compression as creation of a special abstract machine and encoding the source as a program for this machine

In general the compression can be viewed as a conversion of static text into a special program.

In other words compression consists of converting a static text into some kind of code for a specially constructed (may be unique for this case) abstract machine.  Programs in this abstract machine are shorter then the original text while output of the program after processing by an interpreter (decompressor) is exactly the same (original) text. 

For example if we first process the text and construct a dictionary of words used and then replace each word with the index to a dictionary that will be a primitive compression that might work well on text with a small dictionary and frequently repeated words. For example it might work well for the answer to some lengthy form like "Yes No n/a Yes No n/a n/a Yes No No No Yes n/a n/a No". In this case the dictionary consists of three entries and each reply can be encoded in two bits. 

It is important to understand that the more you know about the structure of the text, the more specialized processor you and construct and the higher level of compression you can achieve. For example, if you know that a particular text is, say, an http log then you can compress it several times better then using any generic algorithms.

Compression and pattern matching

Compression is also connected with the pattern matching. The notion of a longest common subsequence (LCS) of two strings is widely used to compare files and can be used in compression by splitting text into several parts if diffing them against each other. In the most primitive case such diffing can be based on lines or paragraphs.

The "diff" command of UNIX system implement an algorithm of finding the shortest sequence of editing command able to convert one string to another. Informally, the result of a diff algorithm gives the minimum number of operations (insert a symbol, or delete a symbol) to transform one string into the other. This is related to an edit distance, called the Levenshtein distance, with the additional operation of substitution, and with weights associated to operations.

Hirschberg (1975) presents the computation of the LCS in linear space.  This ideas can be very efficiently used for compression of http logs (for example Apache logs) and, say, windows of the last 1K lines often contains at least one string that is very close to the current (in Levenshhtein metric) so that you can encode the whole string by the reference to the "etalon" string and a few operations needs to convert "etalon" to the current string.

In its turn, the search of LCS is connected with efficient string matching algorithms. The first discovered linear-time string-matching algorithm is from Morris and Pratt (1970). It has been improved by Knuth, Morris, and Pratt (1976). The search behaves like a recognition process by automaton, and a character of the text is compared to a character of the pattern no more than \log_\Phi(m+1) (\Phi is the golden ratio (1+\sqrt 5)/2).  The Boyer and Moore's algorithm (1977) is considered as the most efficient string-matching algorithm in usual applications. A simplified version of it (or the entire algorithm) is often implemented in text editors for the "search" and "substitute" commands. Several variants of Boyer and Moore's algorithm avoid the quadratic behavior when searching for all occurrences of the pattern.  In 1975, Aho and Corasick designed an O(n\log\sigma) algorithm to solve this problem, with a running time independent of the number of patterns. It is implemented by the "fgrep" command under the UNIX operating system. In applications where the text is to be searched for several patterns, it is the text that needs to be preprocessed. Even if no further information is known on their syntactic structure, it is possible and indeed extremely efficient to built an index that supports searches. Data structures to represent indexes on text files are: suffix trees (Weiner 1973, McCreight 1976, Ukkonen 1994), direct acyclic word graph (Blumer et al., 1985), suffix automata (Crochemore, 1986), and suffix arrays (Manber and Myers, 1993). All algorithms (except for suffix arrays) build the index in time O(n\log\sigma).

Notion of the optimal compression algorithm

If you can create a dictionary of symbols or words in the text (or just split the source into meaningful, repeatable chunks) the most optimal type of encoding is Huffman encoding.  please note that for ordinary English text calculating the frequency of letters and compression based on this freqncy is far from optimal approach. Words represent far better target. If separation of into "words" is impossible then "digraphs" are better deal then single letters.

Huffman coding - Wikipedia, the free encyclopedia

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding and/or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".[1]

The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in linear time to the number of input weights if these weights are sorted.[2] However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods.

... ... ...

Compression

A source generates 4 different symbols { a 1 , a 2 , a 3 , a 4 } {\displaystyle \{a_{1},a_{2},a_{3},a_{4}\}} with probability { 0.4 ; 0.35 ; 0.2 ; 0.05 } {\displaystyle \{0.4;0.35;0.2;0.05\}} . A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:

Symbol

Code

a1 0
a2 10
a3 110
a4 111

The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the entropy of the source is 1.74 bits/symbol. If this Huffman code is used to represent the signal, then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit because the probabilities of the symbols are different from negative powers of two.
The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols, n {\displaystyle n} . A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to n {\displaystyle n} leaf nodes and n − 1 {\displaystyle n-1} internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.

The process essentially begins with the leaf nodes containing the probabilities of the symbol they represent, then a new node whose children are the 2 nodes with smallest probability is created, such that the new node's probability is equal to the sum of the children's probability. With the previous 2 nodes merged into one node (thus not considering them anymore), and with the new node being now considered, the procedure is repeated until only one node remains, the Huffman tree.

The simplest construction algorithm uses a priority queue where the node with lowest probability is given highest priority:
1.Create a leaf node for each symbol and add it to the priority queue.
2.While there is more than one node in the queue: 1.Remove the two nodes of highest priority (lowest probability) from the queue
2.Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
3.Add the new node to the queue.

3.The remaining node is the root node and the tree is complete.

Since efficient priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n−1 nodes, this algorithm operates in O(n log n) time, where n is the number of symbols.

If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues:
1.Start with as many leaves as there are symbols.
2.Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
3.While there is more than one node in the queues: 1.Dequeue the two nodes with the lowest weight by examining the fronts of both queues.
2.Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
3.Enqueue the new node into the rear of the second queue.

4.The remaining node is the root node; the tree has now been generated.

Although linear-time given sorted input, in the general case of arbitrary input, using this algorithm requires pre-sorting. Thus, since sorting takes O(n log n) time in the general case, both methods have the same overall complexity.

In many cases, time complexity is not very important in the choice of algorithm here, since n here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when n grows to be very large.

It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code.

Here's an example of optimized Huffman coding using the French subject string "j'aime aller sur le bord de l'eau les jeudis ou les jours impairs". Note that the original Huffman coding tree structure would be different from the given example:

 

Decompression[edit]

Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value). Before this can take place, however, the Huffman tree must be somehow reconstructed. In the simplest case, where character frequencies are fairly predictable, the tree can be preconstructed (and even statistically adjusted on each compression cycle) and thus reused every time, at the expense of at least some measure of compression efficiency. Otherwise, the information to reconstruct the tree must be sent a priori. A naive approach might be to prepend the frequency count of each character to the compression stream. Unfortunately, the overhead in such a case could amount to several kilobytes, so this method has little practical use. If the data is compressed using canonical encoding, the compression model can be precisely reconstructed with just B 2 B {\displaystyle B2^{B}} bits of information (where B {\displaystyle B} is the number of bits per symbol). Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered the tree building routine simply reads the next 8 bits to determine the character value of that particular leaf. The process continues recursively until the last leaf node is reached; at that point, the Huffman tree will thus be faithfully reconstructed. The overhead using such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet). Many other techniques are possible as well. In any case, since the compressed data can include unused "trailing bits" the decompressor must be able to determine when to stop producing output. This can be accomplished by either transmitting the length of the decompressed data along with the compression model or by defining a special code symbol to signify the end of input (the latter method can adversely affect code length optimality, however).

Main properties

The probabilities used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. This requires that a frequency table must be stored with the compressed text. See the Decompression section above for more information about the various techniques employed for this purpose.

Optimality

Although Huffman's original algorithm is optimal for a symbol-by-symbol coding (i.e., a stream of unrelated symbols) with a known input probability distribution, it is not optimal when the symbol-by-symbol restriction is dropped, or when the probability mass functions are unknown. Also, if symbols are not independent and identically distributed, a single code may be insufficient for optimality. Other methods such as arithmetic coding and LZW coding often have better compression capability: Both of these methods can combine an arbitrary number of symbols for more efficient coding, and generally adapt to the actual input statistics, useful when input probabilities are not precisely known or vary significantly within the stream. However, these methods have higher computational complexity. Also, both arithmetic coding and LZW were historically a subject of some concern over patent issues. However, as of mid-2010, the most commonly used techniques for these alternatives to Huffman coding have passed into the public domain as the early patents have expired.

However, the limitations of Huffman coding should not be overstated; it can be used adaptively, accommodating unknown, changing, or context-dependent probabilities. In the case of known independent and identically distributed random variables, combining symbols ("blocking") reduces inefficiency in a way that approaches optimality as the number of symbols combined increases. Huffman coding is optimal when each input symbol is a known independent and identically distributed random variable having a probability that is an the inverse of a power of two.

Prefix codes tend to have inefficiency on small alphabets, where probabilities often fall between these optimal points. The worst case for Huffman coding can happen when the probability of a symbol exceeds 2−1 = 0.5, making the upper limit of inefficiency unbounded. These situations often respond well to a form of blocking called run-length encoding; for the simple case of Bernoulli processes, Golomb coding is a probably optimal run-length code.

For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding, e.g., ASCII coding. This reflects the fact that compression is not possible with such an input.

Variations

Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be polynomial time. An exhaustive list of papers on Huffman coding and its variations is given by "Code and Parse Trees for Lossless Source Encoding".[4]

n-ary Huffman coding

The n-ary Huffman algorithm uses the {0, 1, ... , n − 1} alphabet to encode message and build an n-ary tree. This approach was considered by Huffman in his original paper. The same algorithm applies as for binary (n equals 2) codes, except that the n least probable symbols are taken together, instead of just the 2 least probable. Note that for n greater than 2, not all sets of source words can properly form an n-ary tree for Huffman coding. In these cases, additional 0-probability place holders must be added. This is because the tree must form an n to 1 contractor; for binary coding, this is a 2 to 1 contractor, and any sized set can form such a contractor. If the number of source words is congruent to 1 modulo n-1, then the set of source words will form a proper Huffman tree.

Adaptive Huffman coding

A variation called adaptive Huffman coding involves calculating the probabilities dynamically based on recent actual frequencies in the sequence of source symbols, and changing the coding tree structure to match the updated probability estimates. It is used rarely in practice, since the cost of updating the tree makes it slower than optimized adaptive arithmetic coding, that is more flexible and has a better compression.

Huffman template algorithm

Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only that the weights form a totally ordered commutative monoid, meaning a way to order weights and to add them. The Huffman template algorithm enables one to use any kind of weights (costs, frequencies, pairs of weights, non-numerical weights) and one of many combining methods (not just addition). Such algorithms can solve other minimization problems, such as minimizing max i [ w i + l e n g t h ( c i ) ] {\displaystyle \max _{i}\left[w_{i}+\mathrm {length} \left(c_{i}\right)\right]} , a problem first applied to circuit design.

Length-limited Huffman coding/minimum variance Huffman coding

Length-limited Huffman coding is a variant where the goal is still to achieve a minimum weighted path length, but there is an additional restriction that the length of each codeword must be less than a given constant. The package-merge algorithm solves this problem with a simple greedy approach very similar to that used by Huffman's algorithm. Its time complexity is O ( n L ) {\displaystyle O(nL)} , where L {\displaystyle L} is the maximum length of a codeword. No algorithm is known to solve this problem in linear or linearithmic time, unlike the presorted and unsorted conventional Huffman problems, respectively.

Huffman coding with unequal letter costs

In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.

Huffman coding with unequal letter costs is the generalization without this assumption: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of Morse code, where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message. No algorithm is known to solve this in the same manner or with the same efficiency as conventional Huffman coding, though it has been solved by Karp whose solution has been refined for the case of integer costs by Golin.

Optimal alphabetic binary trees (Hu–Tucker coding)

In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example, A = { a , b , c } {\displaystyle A=\left\{a,b,c\right\}} could not be assigned code H ( A , C ) = { 00 , 1 , 01 } {\displaystyle H\left(A,C\right)=\left\{00,1,01\right\}} , but instead should be assigned either H ( A , C ) = { 00 , 01 , 1 } {\displaystyle H\left(A,C\right)=\left\{00,01,1\right\}} or H ( A , C ) = { 0 , 10 , 11 } {\displaystyle H\left(A,C\right)=\left\{0,10,11\right\}} . This is also known as the Hu–Tucker problem, after T. C. Hu and Alan Tucker, the authors of the paper presenting the first linearithmic solution to this optimal binary alphabetic problem,[5] which has some similarities to Huffman algorithm, but is not a variation of this algorithm. These optimal alphabetic binary trees are often used as binary search trees.

The canonical Huffman code

Main article: Canonical Huffman code

If weights corresponding to the alphabetically ordered inputs are in numerical order, the Huffman code has the same lengths as the optimal alphabetic code, which can be found from calculating these lengths, rendering Hu–Tucker coding unnecessary. The code resulting from numerically (re-)ordered input is sometimes called the canonical Huffman code and is often the code used in practice, due to ease of encoding/decoding. The technique for finding this code is sometimes called Huffman-Shannon-Fano coding, since it is optimal like Huffman coding, but alphabetic in weight probability, like Shannon-Fano coding. The Huffman-Shannon-Fano code corresponding to the example is { 000 , 001 , 01 , 10 , 11 } {\displaystyle \{000,001,01,10,11\}} , which, having the same codeword lengths as the original solution, is also optimal. But in canonical Huffman code, the result is { 110 , 111 , 00 , 01 , 10 } {\displaystyle \{110,111,00,01,10\}} .

 

References

  1. Aho, A.V. 1990. Algorithms for finding patterns in strings. In: Handbook of Theoretical Computer Science, Algorithms and Complexity, Vol. A, ch. 5, pp 255--330. J. van Leeuwen ed., Elsevier, Amsterdam.
  2. Bell, T.C., Cleary J.G. and Witten, I.H. 1990. Text Compression. Prentice Hall, Englewood Cliffs, New Jersey.
  3. Cormen, T.H., Leiserson C.E. and Rivest, R.L. 1990. Introduction to Algorithms, ch. 34, pp 853--885. MIT Press.
  4. Crochemore, M. and Rytter W. 1994. Text Algorithms. Oxford University Press.
  5. Gonnet, G.H. and Baeza-Yates, R.A. 1991. Handbook of Algorithms and Data Structures, ch. 7, pp 251--288. Addison-Wesley.
  6. Nelson, M. 1992. The Data Compression Book. M&T Books.
  7. Sedgewick R. 1990. Algorithms in C, ch. 19 and 22. Addison-Wesley.
  8. Stephen, G.A. 1994. String Searching Algorithms. World Scientific Press

HTTP compression

HTTP compression - Wikipedia, the free encyclopedia

Content-Encoding tokens[edit]

The official list of tokens available to servers and client is maintained by IANA,[4] and it includes:

In addition to these, a number of unofficial or non-standardized tokens are used in the wild by either servers or clients:

Servers that support HTTP compression[edit]

The compression in HTTP can also be achieved by using the functionality of server-side scripting languages like PHP, or programming languages like Java.


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[May 13, 2011] Lzip

freshmeat.net

Lzip is a lossless data compressor based on the LZMA algorithm, with very safe integrity checking and a user interface similar to the one of gzip or bzip2. Lzip decompresses almost as quickly as gzip and compresses better than bzip2, which makes it well suited for software distribution and data archiving. Lziprecover is a data recovery tool for lzip compressed files able to repair slightly damaged files, recover badly damaged files from two or more copies, and extract undamaged members from multi-member files.

ed_avis

Re: Other LZMA tools

I think you are right that the standalone 'lzma' program (replacing the older lzmash) has a very basic data format. But still, it works, and is the more established tool. I would be happy for lzip to replace it if lzip is better, but to do that it should include support for decompressing legacy .lzma files.

(I note that the gzip format has provision for alternative compression methods but nobody ever seems to use it.)

> As for lrzip, it is actually

> an extension of rzip---and the two are

> more of a proof-of-concept than a

> realworld-workable format.

The file format may be basic but the tool is very good. It usually compresses better than plain LZMA (the algorithm, used in both lzma-utils and lzip) and faster too. LZMA is better for all-purpose use but for batch compression tasks where you don't mind relatively high memory usage, lrzip can give a big improvement. For some Subversion dump files I back up overnight it gave a fourfold increase in compression for about the same speed.

jengelh

Re: Other LZMA tools

As I gather, lzma-utils-produced files lack magic identification bytes and a checksum, and if you believe forum archives, lzma-utils did not manage to come up with a suitable new format in more than two years. It is about time lzip came along-7z sounds nice too, but seems to have gotten no ground in the Unix world due to subpar unix integration.

As for lrzip, it is actually an extension of rzip-and the two are more of a proof-of-concept than a realworld-workable format.

[Jan 30, 2008] History of Data Compression in Japan

In 1987, I was asked by a magazine editor to write an article about data compression. I wrote a manuscript and an accompanying program, sent them to the editor, and forgot about them. The next time I heard from him I was told that the magazine was discontinued. So I uploaded my program, a simple implementation of the LZSS compression algorithm (see below) to PC-VAN, a big Japanese BBS run by NEC. That was May 1, 1988.

Soon a number of hobby programmers gathered and began improving on that program. The project culminated in Kazuhiko Miki's archiver LArc, which was fairly widey used in Japan. (Dr. Miki was then a medical specialist working at a governmental office. I heard he left office and began work on freeware/shareware promotion.)

The LZSS algorithm is based on a very simple idea. Suppose I'm going to write "compression" here. But probably I've already used that word before in this file. If I used that word 57 characters before, I might as well write "go 57 characters backward, and read 11 characters," or <57,11> for short. In general, when I've already used the string of characters among the recent 4096 characters, say, I encode the string by a <position,length> pair.

In Storer's [8] terminology, this is a sliding dictionary algorithm, analyzed first by Ziv and Lempel [14] and then by Storer and Szymanski [9], among others.

Later versions of my LZSS implementations and Miki's LArc used binary search trees to make string search faster; see Bell [1].

Incidentally, there are two distinct Ziv-Lempel (LZ) methods: sliding dictionary [14] and dynamic dictionary [15] in Storer's [8] terminology. The LZW algorithm [12] belongs to the latter. Most pre-LHarc compression tools, such as 'compress', 'ARC', and 'PKARC', used LZW.

During the summer of 1988, I wrote another compression program, LZARI. This program is based on the following observation: Each output of LZSS is either a single character or a <position,length> pair. A single character can be coded as an integer between 0 and 255. As for the <length> field, if the range of <length> is 2 to 257, say, it can be coded as an integer between 256 and 511. Thus, I can say that there are 512 kinds of "characters," and the "characters" 256 through 511 are accompanied by a <position> field. These 512 "characters" can be Huffman-coded, or better still, algebraically coded. The <position> field can be coded in the same manner. In LZARI I used an adaptive algebraic compression [13], [2] to encode the "characters," and static algebraic compression to encode the <position> field. (There were several versions of LZARI; some of them were slightly different from the above description.) The compression of LZARI was very tight, though rather slow.

Haruyasu Yoshizaki (Yoshi), a physician and guru hobby programmer, worked very hard to make LZARI faster. Most importantly, he replaced LZARI's algebraic compression by dynamic Huffman coding.

His program, LZHUF, was very successful. It was much faster than my LZARI. As for compression ratio, Huffman cannot beat algebraic compression, but the difference turned out to be very small.

Yoshi rewrote the compression engine of LZHUF in assembler, and added a nifty user interface. His archiver, LHarc, soon became the de facto standard among Japanese BBS users. After Prof. Kenjirou Okubo, a mathematician, introduced LHarc to the United States, it became world-famous. Other vendors began using similar techniques: sliding dictionary plus statistical compressions such as Huffman and Shannon-Fano. (I wondered why they used Shannon-Fano rather than Huffman which is guaranteed to compress tighter than Shannon-Fano. As it turned out, a then-popular book on compression published in U.S. contained a wrong description and buggy sample programs, such that Shannon-Fano outperformed (buggy) Huffman on many files.)

Although LHarc was much faster than LZARI, we weren't quite satisfied with its speed. Because LHarc was based on dynamic Huffman, it had to update Huffman tree every time it received a character. Yoshi and I tried other dynamic Huffman algorithms [5], [10], [11], but improvements were not as great as we desired.

So I took a different step: replacing LHarc's dynamic Huffman by a static Huffman method.

Traditional static Huffman coding algorithm first scans the input file to count character distribution, then builds Huffman tree and encodes the file. In my approach, the input file is read only once. It is first compressed by a sliding dictionary method like LZARI and LHarc, and at the same time the distributions of the "characters" (see above) and positions are counted. The output of this process is stored in main memory. When the buffer in memory is full (or the input is exhausted), the Huffman trees are constructed, and the half-processed content of the buffer is actually compressed and output.

In static Huffman, the Huffman tree must be stored in the compressed file. In the traditional approach this information consumes hundreds of bytes. My approach was to standardize Huffman trees so that (1) each left subtree is no deeper than its right counterpart, and (2) the leaves at the same level are sorted in ascending order. In this way the Huffman tree can be uniquely specified by the lengths of the codewords. Moreover, the resulting table is again compressed by the same Huffman algorithm.

To make the decoding program simpler, the Huffman tree is adjusted so that the codeword lengths do not exceed 16 bits. Since this adjusting is rarely needed, the algorithm is made very simple. It does not create optimal length-limited Huffman trees; see e.g. [6] for an optimal algorithm. Incidentally, my early program had a bug here, which was quickly pointed out and corrected by Yoshi.

The sliding dictionary algorithm is also improved by Yoshi using a "PATRICIA tree" data structure; see McCreight [7] and Fiala and Greene [4].

After completing my algorithm, I learned that Brent [3] also used a sliding dictionary plus Huffman coding. His method, SLH, is simple and elegant, but since it doesn't find the most recent longest match, the distribution of match position becomes flat. This makes the second-stage Huffman compression less efficient.

On the basis of these new algorithms, Yoshi began to rewrite his LHarc, but it took him so long (remember he was a busy doctor!) that I decided to write my own archiver. My archiver was quite recklessly named 'ar'. (Actually I appended version numbers as in 'ar002' for version 0.02.) I should have named it 'har' (after my name), say, because 'ar' collides with the name of UNIX's archiver. I didn't want my program to compete with LHarc, but I wanted many people to try the algorithm, so I wrote it in pure ANSI C. This is the reason 'ar' lacked many bells and whistles necessary for a real archiver.

Note: The version of 'ar002' most often found in the U.S. had a bug. Line 24 of maketbl.c should read, of course,
    while (i <= 16) {
        weight[i] = 1U << (16 - i);  i++;
    }
Somehow the bug didn't show up when compiled by Turbo C.

Yoshi finally showed us his new archiver written in C. It was tentatively named LHx. He then rewrote the main logic in assembler. Yoshi and I wrote an article describing his new archiver, which would be named LH, in the January, 1991, issue of "C Magazine" (in Japanese). The suffix 'arc' of LHarc was deliberately dropped because the people who sold ARC did not want others to use the name.

Then we learned that for the new DOS 5.0, LH meaned LoadHigh, an internal command. We decided to rename LH to LHA.

Also, I was told that the algorithm described in Fiala and Greene [4] got patented ("Textual Substitution Data Compression With Finite Length Search Windows," U.S. Patent 4,906,991, Mar. 6, 1990. Actually they got three patents! The other two were: "Start, Step, Stop Unary Encoding for Data Compression," Application Ser. No. 07/187,697, and "Search Tree Data Structure Encoding for Textual Substitution Data Compression Systems," Application Ser. No. 07/187,699.)

Furthermore, I learned that the original Ziv-Lempel compression method (Eastman et al., U.S. Patent 4,464,650, 8/1984) and the LZW method (Welch, 4,558,302, 12/1985) were patented. I also heard that Richard Stallman, of the Free Software Foundation, author of the EMACS editor and leader of the GNU project, ceased to use 'compress' program any more because its LZW algorithm got patented.

Are algorithms patentable? (See [16].) If these patents should turn out to be taken seriously, all compression programs now in use may infringe some of these patents. (Luckily, not all claims made by those algorithm patents seems to be valid.)


The foregoing is a slight modification of what I wrote in 1991. The year 1991 was a very busy year for me. In 1992, I joined the faculty of Matsusaka University. This opportunity should have given me more free time, but as it turned out I got ever busier. I stopped hacking on my compression algorithms; so did Yoshi.

Luckily, all good things in LHA were taken over, and all bad things abandoned, by the new great archiver zip and the compression tool gzip. I admire the efforts of Jean-loup Gailly and others.

A brief historical comment on PKZIP: At one time a programmer for PK and I were in close contact. We exchanged a lot of ideas. No wonder PKZIP and LHA are so similar.

Another historical comment: LHICE and ICE are definitely not written by Yoshi (or me or anyone I know). I think they are faked versions of LHarc.

REFERENCES

[1]
Timothy C. Bell. Better OPM/L text compression. IEEE Transactions on Communications, COM-34(12):1176--1182, 1986.
[2]
Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text Compression. Prentice Hall, 1990.
[3]
R. P. Brent. A linear algorithm for data compression. The Australian Computer Journal, 19(2):64--68, 1987.
[4]
Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Communications of the ACM, 32(4):490--505, 1989.
[5]
Donald E. Knuth. Dynamic Huffman coding. Journal of Algorithms, 6:163--180, 1985.
[6]
Lawrence L. Larmore and Daniel S. Hirschberg. A fast algorithm for optimal length-limited Huffman codes. Journal of the Association for Computing Machinery, 37(3):464--473, 1990.
[7]
Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the Association for Computing Machinery, 23(2):262--272, 1976.
[8]
James A. Storer. Data Compression: Methods and Theory. Computer Science Press, Rockville, MD., 1988.
[9]
James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. Journal of the Association for Computing Machinery, 29(4):928--951, 1982.
[10]
Jeffrey Scott Vitter. Design and analysis of dynamic Huffman codes. Journal of the Association for Computing Machinery, 34(4):825--845, 1987.
[11]
Jeffrey Scott Vitter. Algorithm 673: Dynamic Huffman coding. ACM Transactions on Mathematical Software, 15(2):158--167, 1989.
[12]
Terry A. Welch. A technique for high-performance data compression. IEEE Computer}, 17(6):8--19, 1984.
[13]
Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520--540, 1987.
[14]
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23(3):337--343, 1977.
[15]
Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, IT-24(5):530--536, 1978.
[16]
Edward N. Zalta. Are algorithms patentable? Notices of the American Mathematical Society, 35(6):796--799, 1988.

Haruhiko Okumura,
okumura@matsusaka-u.ac.jp

Last modified: Tue Mar 17 17:02:03 1998

[Oct 31, 2004] Data Compression -- mini-tutorial by Greg Goebel.

The spread of computing has led to an explosion in the volume of data to be stored on hard disks and sent over the Internet. This growth has led to a need for "data compression", that is, the ability to reduce the amount of storage or Internet bandwidth required to handle data.

This document provides a survey of data compression techniques. The focus is on the most prominent data compression schemes, particularly popular archivers, JPEG image compression, and MPEG video and audio compression.

[April 22, 2000] Slashdot Phillip W. Katz, Creator Of PKZIP, Dead At 37

Gone are the days... (Score:5) by zpengo (99887) on Saturday April 22 2000, @06:31PM (#1116033) Homepage

Oh, the memories...

I used to go down to the local computer store, which had bins and bins of the latest shareware, all on precious 5 1/4 disks. Each one held some sort of magic that would transform my XT with Hercules graphics into a completely absorbing experience.

Video games, clones of major applications, dinky little Pascal compilers, my first version of Spacewar....

But there was a key to all of that magic. Back then, there were no auto-installing CDs. There was no "setup.exe" There would just be a single file, with that ever-familiar extension: ".ZIP"

I had been on the scene long enough to know what was up, so I not only had PKZIP/PKUNZIP installed on my 4 meg harddrive, but I even had it in the PATH.

A few keystrokes later, the magic was unlocked.

We don't know how much we owe to this great man. I genuinely mourn his passing.

37 (Score:4) by roman_mir (125474) on Saturday April 22 2000, @06:36PM (#1116035) Homepage Journal

So many celebrities, poets, actors, revolutionaries, wariers, politicians etc have died on 33 and 37, I tell you, if you pass 37 you'll probably live a long life.

(to those of us who remember Vladimir Visotskiy) Na zifre 37, kovaren bog, rebrom vopros postavil: ili, ili
Na etom rubeje legli i Bairon i Rembo a nineshnie kak-to proskochili...

Sad day.... by maelstrom (Score:1) Saturday April 22 2000, @06:37PM

*sigh* (Score:5) by Seumas (6865) on Saturday April 22 2000, @06:37PM (#1116040)

This is the first I've heard of his death and I have to say that it really makes me feel sad. I'm not aware of much that he's done outside of PKZIP, but I sure remember using ZIP for everything online (especially when a 2400 baud modem was considered fast and a zipped file could half your online time).

Huffman, Postel, Stevens . . . Now P.W. Katz. I feel guilty for not ever considering any of these people beyond what their program does or does not do for me -- or how I benefitted from their books, until after their death. To think that while we're all out there unzipping our latest copy of the Jargon file or stashing a bunch of porn in a password protected ZIP file, this guy was suffering a serious problem which eventually took his life at the age of *thirty-seven*.

I'm only 22. I spend all my time working at a desk. I haven't been in-shape for almost six years. I could be next. I could be next and I haven't offered a damn thing to the computer or internet community. These people -- and many others, have.

I hope that we'll remember these things in subsequent posts in reply to this article. The last thing we need is another disgustingly barbaric replay of the posts we saw when W. Richard Stevens died.

I hope you have peace, Phillip.

W. Richard Stevens Slashdot Article [slashdot.org]
W. Richard Stevens Home Page [kohala.com]
David Huffman Slashdot Article [slashdot.org]
Jon Postel Slashdot Article [slashdot.org]
Jon Postel's Home Page [isi.edu]

Re:PK vs. SEA controversy (Score:4) by farrellj (563) on Saturday April 22 2000, @07:04PM (#1116054) Homepage Journal

O.K., here is the story as I remember it.

Phil wrote a better compression program that was compatible with System Enhancements Associates (SEA) program called ARC. So they litigated. And so Phil went off and found a better algorithem for compression, and brought out PKZIP.

Many people in the BBS community thought that SEA was a little heavyhanded (Perception, I don't know the reality), and moved to PKZIP. Others moved over for the speed and the better compression. The rest is history.

See also "arc wars" [cs.hut.fi]MIT Jargon File ver 299. This story seems to have been dropped from the current Jargon File [jargon.org] for some reason.

ttyl
Farrell McGovern

Former Sysop, Data/SFnet (One of the first few hundred Fidonet BBSs!) and Solsbury Hill, founding member of PODSnet.

This is really sad! (Score:5) by DeepDarkSky (111382) on Saturday April 22 2000, @07:04PM (#1116055)

I definitely remember Phil Katz and all the controversy surrounding him, and how grateful I was to have discovered his programs. I remember the first compression program which was SEA's ARC program. It was very slow. Then my friend and I discovered PKARC and PKXARC, which were much faster than ARC. As PKARC gained popularity because of its overall superiority, SEA sued Phil Katz, and he in turn created PKPAK/PKUNPAK (I think it was still paired like that). Tha PKPAK series didn't last long. The PKZIP series came out next, and that was the series that created the ubiquitous ZIP format that we see today. If I remember correctly, PKZ204G was the last official DOS version of the program, and there were plenty of trojans, etc. that were going around, and Phil created self-authenticating zip files, etc. Lots of neat little cool things. I also remember that other programs were giving PK a run for his money, such as ARJ and LHARC, but they never achieved the overall speed/performance/compression that PKZIP ever did (they were often better in one thing or another but not overall). Then WINZIP came out, and I kind lost sight of PK.

I still have thousands of ZIP files that were zipped with PKZIP. If it wasn't for him, I wouldn't have been as into computers as I am, it was because of those early days of playing around with PKARC and PKXARC that really got me started. I am terribly sad to see him go and in such (I think) indignant way.

Re:Would like to know the rest of the story (Score:3) by Trepidity (597) <delirium-slashdot.hackish@org> on Saturday April 22 2000, @07:18PM (#1116062) Homepage

Well, PKZip seems to have stopped development around 1993, well before WinZip became popular. PKZIp v2.04g was pretty much the last version I know of, and it came out february 1, 1993. Up until then there had been fairly frequent updates, but throughout 1993, 1994, and 1995, PKZIp v2.04g for DOS remained the standard compression tool.

Only then, after 2-3 years of no updates, did other tools like WinZip become popular. PKZIp finally made a Windows product in 1997 or 1998, but they were long gone by then. I'm not sure what led to the development halt, but the original stuff is fine coding...

Re:Would like to know the rest of the story (Score:5) by Harinath (26296) on Saturday April 22 2000, @07:28PM (#1116067) Homepage

IIRC, the ZIP file format was made public domain, thus allowing the Info-ZIP people to write a program that reads ZIP archives, which in turn allowed WinZip to not have to license software from PKware.

Unlike LZW, the ZIP "deflate" algorithms (LZ77 + Shannon Fano encoding) are unemcumbered. These compression algorithms are used in GNU Zip (gzip) partly for that reason. I think gzip can even read .zip archives with only one file inside. The zip algorithm is also in the zlib library, which is used in the PNG format, for one. The "deflate" algorithm is also described in RFC 1951.

So, thanks PK, for providing one of the tools that enable us to thumb our noses at Unisys :-)

Some more links... (Score:2) by Megane (129182) on Saturday April 22 2000, @07:31PM (#1116069)

From the Milwaukee Journal Sentinel:
The original obituary notice [jsonline.com]
A more complete article [jsonline.com]

One interesting quote:
"It was just a hobby," he said. "I didn't expect it to turn into a business."

I had a moderately successful shareware program myself during the '80s, and it sure didn't help my life much. Fortunately I have no interest in booze or drugs -- they just get in the way of hacking. And also fortunately, I let it go when it wasn't successful any more. Maybe a little later than I should have, but I did move on.

Phillip Katz's patents (Score:2) by ContinuousPark (92960) on Saturday April 22 2000, @07:32PM (#1116071)

Couldn't find much so far about him but I came across this page where several patents on data compression algorithms are mentioned and this led me to one of his patents [ibm.com], a so-called string searcher and compressor.

It would be interesting to know if what the patent decribes is the technology behind PKZIP.

I first found out about this on CNET Download.com's front page; there was this little message in memoriam of PK but I don't think it was mentioned on News.com; that was strange. This is a sad event and I think it would be more convenient and respectful if we didn't get to know the details of his death just because it turned out to be a morbid and attention-attracting story for the media.

Re:How many people here registered pkzip? (Score:1) by Anonymous Coward on Saturday April 22 2000, @06:53PM (#1116049)

i didn't either.

do you think it would be appropriate to register now, and contribute the check to his estate?

[April 19, 2000] News - Paid Death Notices - Katz, Phillip W.

Katz, Phillip W.
Publication Date: April 19, 2000
Age 37. Passed away unexpectedly on Fri., April 14, 2000. Beloved son of Hildegard and beloved brother of Cynthia. Also survived by other relatives and friends. Phil was a graduate of UWM Computer Science Engineering Program. He was the author of the PKZIP/PKUNZIP software and owner of PKWARE Inc. Co. Private services have been held. Memorials to the charity of your choice would be appreciated.

ZWASKA FUNERAL HOME 354-5330 Serving the Family

Arturo Campos home page

The column "html" shows the size of the article in html format and if clicked opens it in a new window. The "zip" column contains the zipped article and images or source if any. Note that brighter rows mean new or updated article. If you take the time to read the articles, please take the time to tell me what do you think of them.
Title Html Zip Last update
Arithmetic coding, entropy coder 20k 7k 22-Jul-1999
Bwt, block reduction 21k 8k 22-Jul-1999
Canonical huffman. 15k 5k 23-Jul-1999
Crc-32, the standard Crc 6k 7k 10-Aug-1999
Finite context modeling 37k 12k 16-Nov-1999
Flexible parsing, improvement of lz 10k 4k 24-Jul-1999
Lz77, also called lzss 40k 14k 23-Jul-1999
Lzw, for gif decoding 27k 9k 23-Jul-1999
Mtf, a transformation scheme 9k 3k 24-Jul-1999
Implementing ppmc with hash tables 111k 327k 21-March-2000
Quasi Static model 19k 6k 13-Aug-1999
Range coder 24k 8k 17-Nov-1999
Rle, Basic scheme 7k 3k 24-Jul-1999

As you can read in the index I'm going to publish as soon as possible new articles, if you want to get an email when this happens, please fill this form, moreover you can use it to tell me which file format do you prefer. If you have any question about the articles feel free to email me.

My goal is to offer the best articles about compression for free. Thus I need your feedback on the articles to further improve them: What you don't know. What was confusing or unclear. What difficulties you found while implementing it. Or what you miss. Teeling me so is the only way of further improving articles for you, and depends only of you.

Note that since my first articles both my knowledge about compression and english have greatly improved, thus you can notice a difference in quality between old articles or the latest ones. Unfortunately I don't have the time to rewrite all the articles (as I would like to) so instead please tell me which articles do you find worth rewriting, and I'll do so.

The Rise and Fall of a Software Star - Phil Katz Loved Code -- and Liquor

LZW and GIF explained John Barkaus's "LZW and GIF explained"

>From: john@cooper.cooper.EDU (John Barkaus)
Newsgroups: comp.graphics
Subject: GIF file format responses 5/5
Date: 21 Apr 89 20:58:01 GMT
Organization: The Cooper Union (NY, NY)

LZW and GIF explained----Steve Blackstock

I hope this little document will help enlighten those of you out there who want to know more about the Lempel-Ziv Welch compression algorithm, and, specifically, the implementation that GIF uses.

Before we start, here's a little terminology, for the purposes of this document:

"character":
a fundamental data element. In normal text files, this is just a single byte. In raster images, which is what we're interested in, it's an index that specifies the color of a given pixel. I'll refer to an arbitray character as "K".
"charstream":
a stream of characters, as in a data file.
"string":
a number of continuous characters, anywhere from one to very many characters in length. I can specify an arbitrary string as "[...]K".
"prefix":
almost the same as a string, but with the implication that a prefix immediately precedes a character, and a prefix can have a length of zero. So, a prefix and a character make up a string. I will refer to an arbitrary prefix as "[...]".
"root":
a single-character string. For most purposes, this is a character, but we may occasionally make a distinction. It is [...]K, where [...] is empty.
"code":
a number, specified by a known number of bits, which maps to a string.
"codestream":
the output stream of codes, as in the "raster data"
"entry":
a code and its string.
"string table":
a list of entries; usually, but not necessarily, unique.

That should be enough of that.

LZW is a way of compressing data that takes advantage of repetition of strings in the data. Since raster data usually contains a lot of this repetition, LZW is a good way of compressing and decompressing it.

For the moment, lets consider normal LZW encoding and decoding. GIF's variation on the concept is just an extension from there.

LZW manipulates three objects in both compression and decompression: the charstream, the codestream, and the string table. In compression, the charstream is the input and the codestream is the output. In decompression, the codestream is the input and the charstream is the output. The string table is a product of both compression and decompression, but is never passed from one to the other.

The first thing we do in LZW compression is initialize our string table. To do this, we need to choose a code size (how many bits) and know how many values our characters can possibly take. Let's say our code size is 12 bits, meaning we can store 0->FFF, or 4096 entries in our string table. Lets also say that we have 32 possible different characters. (This corresponds to, say, a picture in which there are 32 different colors possible for each pixel.) To initialize the table, we set code#0 to character#0, code #1 to character#1, and so on, until code#31 to character#31. Actually, we are specifying that each code from 0 to 31 maps to a root. There will be no more entries in the table that have this property.

Now we start compressing data. Let's first define something called the "current prefix". It's just a prefix that we'll store things in and compare things to now and then. I will refer to it as "[.c.]". Initially, the current prefix has nothing in it. Let's also define a "current string", which will be the current prefix plus the next character in the charstream. I will refer to the current string as "[.c.]K", where K is some character. OK, look at the first character in the charstream. Call it P. Make [.c.]P the current string. (At this point, of course, it's just the root P.) Now search through the string table to see if [.c.]P appears in it. Of course, it does now, because our string table is initialized to have all roots. So we don't do anything. Now make [.c.]P the current prefix. Look at the next character in the charstream. Call it Q. Add it to the current prefix to form [.c.]Q, the current string. Now search through the string table to see if [.c.]Q appears in it. In this case, of course, it doesn't. Aha! Now we get to do something. Add [.c.]Q (which is PQ in this case) to the string table for code#32, and output the code for [.c.] to the codestream. Now start over again with the current prefix being just the root Q. Keep adding characters to [.c.] to form [.c.]K, until you can't find [.c.]K in the string table. Then output the code for [.c.] and add [.c.]K to the string table. In pseudo-code, the algorithm goes something like this:

  1. Initialize string table;
  2. [.c.] <- empty;
  3. K <- next character in charstream;
  4. Is [.c.]K in string table?
    • yes:
      • [.c.] <- [.c.]K;
      • go to [3];
    • no:
      • add [.c.]K to the string table;
      • output the code for [.c.] to the codestream;
      • [.c.] <- K;
      • go to [3];

It's as simple as that! Of course, when you get to step [3] and there aren't any more characters left, you just output the code for [.c.] and throw the table away. You're done.

Wanna do an example? Let's pretend we have a four-character alphabet: A,B,C,D. The charstream looks like ABACABA. Let's compress it. First, we initialize our string table to: #0=A, #1=B, #2=C, #3=D. The first character is A, which is in the string table, so [.c.] becomes A. Next we get AB, which is not in the table, so we output code #0 (for [.c.]), and add AB to the string table as code #4. [.c.] becomes B. Next we get [.c.]A = BA, which is not in the string table, so output code #1, and add BA to the string table as code #5. [.c.] becomes A. Next we get AC, which is not in the string table. Output code #0, and add AC to the string table as code #6. Now [.c.] becomes C. Next we get [.c.]A = CA, which is not in the table. Output #2 for C, and add CA to table as code#7. Now [.c.] becomes A. Next we get AB, which IS in the string table, so [.c.] gets AB, and we look at ABA, which is not in the string table, so output the code for AB, which is #4, and add ABA to the string table as code #8. [.c.] becomes A. We can't get any more characters, so we just output #0 for the code for A, and we're done. So, the codestream is #0#1#0#2#4#0.

A few words (four) should be said here about efficiency: use a hashing strategy. The search through the string table can be computationally intensive, and some hashing is well worth the effort. Also, note that "straight LZW" compression runs the risk of overflowing the string table - getting to a code which can't be represented in the number of bits you've set aside for codes. There are several ways of dealing with this problem, and GIF implements a very clever one, but we'll get to that.

An important thing to notice is that, at any point during the compression, if [...]K is in the string table, [...] is there also. This fact suggests an efficient method for storing strings in the table. Rather than store the entire string of K's in the table, realize that any string can be expressed as a prefix plus a character: [...]K. If we're about to store [...]K in the table, we know that [...] is already there, so we can just store the code for [...] plus the final character K.

Ok, that takes care of compression. Decompression is perhaps more difficult conceptually, but it is really easier to program.

Here's how it goes: We again have to start with an initialized string table. This table comes from what knowledge we have about the charstream that we will eventually get, like what possible values the characters can take. In GIF files, this information is in the header as the number of possible pixel values. The beauty of LZW, though, is that this is all we need to know. We will build the rest of the string table as we decompress the codestream. The compression is done in such a way that we will never encounter a code in the codestream that we can't translate into a string.

We need to define something called a "current code", which I will refer to as "<code>", and an "old-code", which I will refer to as "<old>". To start things off, look at the first code. This is now <code>. This code will be in the intialized string table as the code for a root. Output the root to the charstream. Make this code the old-code <old>. *Now look at the next code, and make it <code>. It is possible that this code will not be in the string table, but let's assume for now that it is. Output the string corresponding to <code> to the codestream. Now find the first character in the string you just translated. Call this K. Add this to the prefix [...] generated by <old> to form a new string [...]K. Add this string [...]K to the string table, and set the old-code <old> to the current code <code>. Repeat from where I typed the asterisk, and you're all set. Read this paragraph again if you just skimmed it!!! Now let's consider the possibility that <code> is not in the string table. Think back to compression, and try to understand what happens when you have a string like P[...]P[...]PQ appear in the charstream. Suppose P[...] is already in the string table, but P[...]P is not. The compressor will parse out P[...], and find that P[...]P is not in the string table. It will output the code for P[...], and add P[...]P to the string table. Then it will get up to P[...]P for the next string, and find that P[...]P is in the table, as the code just added. So it will output the code for P[...]P if it finds that P[...]PQ is not in the table. The decompressor is always "one step behind" the compressor. When the decompressor sees the code for P[...]P, it will not have added that code to it's string table yet because it needed the beginning character of P[...]P to add to the string for the last code, P[...], to form the code for P[...]P. However, when a decompressor finds a code that it doesn't know yet, it will always be the very next one to be added to the string table. So it can guess at what the string for the code should be, and, in fact, it will always be correct. If I am a decompressor, and I see code#124, and yet my string table has entries only up to code#123, I can figure out what code#124 must be, add it to my string table, and output the string. If code#123 generated the string, which I will refer to here as a prefix, [...], then code#124, in this special case, will be [...] plus the first character of [...]. So just add the first character of [...] to the end of itself. Not too bad. As an example (and a very common one) of this special case, let's assume we have a raster image in which the first three pixels have the same color value. That is, my charstream looks like: QQQ.... For the sake of argument, let's say we have 32 colors, and Q is the color#12. The compressor will generate the code sequence 12,32,.... (if you don't know why, take a minute to understand it.) Remember that #32 is not in the initial table, which goes from #0 to #31. The decompressor will see #12 and translate it just fine as color Q. Then it will see #32 and not yet know what that means. But if it thinks about it long enough, it can figure out that QQ should be entry#32 in the table and QQ should be the next string output. So the decompression pseudo-code goes something like:

  1. Initialize string table;
  2. get first code: <code>;
  3. output the string for <code> to the charstream;
  4. <old> = <code>;
  5. <code> <- next code in codestream;
  6. does <code> exist in the string table?
    • yes:
      • output the string for <code> to the charstream;
      • [...] <- translation for <old>;
      • K <- first character of translation for <code>;
      • add [...]K to the string table;
      • <old> <- <code>;
    • no:
      • [...] <- translation for <old>;
      • K <- first character of [...];
      • output [...]K to charstream and add it to string table;
      • <old> <- <code>
  7. go to [5];

Again, when you get to step [5] and there are no more codes, you're finished. Outputting of strings, and finding of initial characters in strings are efficiency problems all to themselves, but I'm not going to suggest ways to do them here. Half the fun of programming is figuring these things out!

Now for the GIF variations on the theme. In part of the header of a GIF file, there is a field, in the Raster Data stream, called "code size". This is a very misleading name for the field, but we have to live with it. What it is really is the "root size". The actual size, in bits, of the compression codes actually changes during compression/decompression, and I will refer to that size here as the "compression size". The initial table is just the codes for all the roots, as usual, but two special codes are added on top of those. Suppose you have a "code size", which is usually the number of bits per pixel in the image, of N. If the number of bits/pixel is one, then N must be 2: the roots take up slots #0 and #1 in the initial table, and the two special codes will take up slots #4 and #5. In any other case, N is the number of bits per pixel, and the roots take up slots #0 through #(2**N-1), and the special codes are (2**N) and (2**N + 1). The initial compression size will be N+1 bits per code. If you're encoding, you output the codes (N+1) bits at a time to start with, and if you're decoding, you grab (N+1) bits from the codestream at a time. As for the special codes: <CC> or the clear code, is (2**N), and <EOI>, or end-of-information, is (2**N + 1). <CC> tells the compressor to re-initialize the string table, and to reset the compression size to (N+1). <EOI> means there's no more in the codestream. If you're encoding or decoding, you should start adding things to the string table at <CC> + 2. If you're encoding, you should output <CC> as the very first code, and then whenever after that you reach code #4095 (hex FFF), because GIF does not allow compression sizes to be greater than 12 bits. If you're decoding, you should reinitialize your string table when you observe <CC>. The variable compression sizes are really no big deal. If you're encoding, you start with a compression size of (N+1) bits, and, whenever you output the code (2**(compression size)-1), you bump the compression size up one bit. So the next code you output will be one bit longer. Remember that the largest compression size is 12 bits, corresponding to a code of 4095. If you get that far, you must output <CC> as the next code, and start over. If you're decoding, you must increase your compression size AS SOON AS YOU write entry #(2**(compression size) - 1) to the string table. The next code you READ will be one bit longer. Don't make the mistake of waiting until you need to add the code (2**compression size) to the table. You'll have already missed a bit from the last code. The packaging of codes into a bitsream for the raster data is a potential stumbling block for the novice encoder or decoder. The lowest order bit in the code should coincide with the lowest available bit in the first available byte in the codestream. For example, if you're starting with 5-bit compression codes, and your first three codes are, say, <abcde>, <fghij>, <klmno>, where e, j, and o are bit#0, then your codestream will start off like:

      byte#0: hijabcde
      byte#1: .klmnofg

So the differences between straight LZW and GIF LZW are: two additional special codes and variable compression sizes. If you understand LZW, and you understand those variations, you understand it all!

Just as sort of a P.S., you may have noticed that a compressor has a little bit of flexibility at compression time. I specified a "greedy" approach to the compression, grabbing as many characters as possible before outputting codes. This is, in fact, the standard LZW way of doing things, and it will yield the best compression ratio. But there's no rule saying you can't stop anywhere along the line and just output the code for the current prefix, whether it's already in the table or not, and add that string plus the next character to the string table. There are various reasons for wanting to do this, especially if the strings get extremely long and make hashing difficult. If you need to, do it.

Hope this helps out. ----steve blackstock


Recommended Links

Softpanorama hot topic of the month

Softpanorama Recommended

Top articles

Sites

FAQs

compression FAQ

eBooks

Data Compression Debra A. Lelewer and Daniel S. Hirschberg

Sometimes you can find old O'Reilly graphic formats book on the web too:

Encyclopedia of Graphics File Formats, 2nd Edition
The Complete Reference on CD-ROM with Links to Internet Resources

By James D. Murray, William vanRyper
2nd Edition May 1996

It contains a chapter about graphic algorithms, see TOC

Recommended Papers

AnandTech - All About Compression, Part I

Games++ Games & Game Programming-Algorithms, Tutorials, Source Code, Direct X, C-C++, Java, and Much More.

E186-1 Michael Schindler

Huffman code

Queue, Huffman, Compression and Encryption Applets in C++

Forums - Efficient Huffman implementation

...it is certainly possible to construct the Huffman code without an explicit tree. The main observation is that there is a canonical Huffman code entirely defined by the code lengths of the letters. And, there is a simple data structure to compute the latter in linear time after a sorting phase without explicitly constructing the tree. This does save half the memory required for an explicit tree. For example, see my Huffman implementation in the Vcodex package at www.research.att.com/sw/tools.

LZH

LZW

LZW(Lempel-Ziv-Welch) is the most common algorithm used in computer graphics. This lossless method of data compression is found in several image file formats, such as GIF and TIFF, and is also part of the V.42bis modem compression standard and PostScript Level 2.

In 1977, Abraham Lempel and Jakob Ziv created the first of what we now call the LZ family of substitutional compressors. The LZ77 compression algorithms are commonly found in text compression and archiving programs, such as compress, zoo, lha, pkzip, and arj. The LZ78 compression algorithms are more commonly used to compress binary data, such as bitmaps.

In 1984, while working for Unisys, Terry Welch modified the LZ78 compressor for implementation in high-performance disk controllers. The result was the LZW algorithm that is commonly found today. It is covered by U.S. Patent 4,558,302 (plus its foreign counterparts, issued or pending). All patents are held by Unisys Corporation. That means that without obtaining a license from Unisys you formally cannot read or write a GIF file :-(. See TF_WP_unisys for more information.

Bijective compression

I recommend first check DataCompression.info - Tutorials, Reference, Presentations

Creating a ''One to One Adaptive Huffman Compression-Decompression Program for the real world of 8 bit byte files

The Mandala Centre - Compression and Security - One on one compression FAQ

One to One Compression -- This site discusses a characteristic of some compression algorithms that the author refers to as One to One compression. In a nutshell, this property means that for any file X, F( F'( X ) ) == X. (F is either the compressor or decompressor, and F' is its opposite number.) This is definitely not the case for most conventional compression algorithms.

Bijective Arithmetic Encoding with Really Good End Treatment

DataCompression.info - Adaptive Huffman Coding
... Adaptive Huffman Encoding, Rate, A library to perform adpative ... an implementation of
Vitter's dynamic Huffman compressor, adapted so that it is bijective. ...

History

patenting gifs, lzw compression, unisys gif patent, software for gifs, products using lzw compression algorithm, png file format, development of png file format

Gifs and patents

The patent for LZW compression algorithm is held by Unisys since 1985. Compuserve Gif uses this algorithm. In 1994, Unisys implemented their patent rights by asking for license fees from developers of products that read or write gifs. This announcement took developers by surprise and caused quite a stir on the Internet. There were rumours that Unisys might charge web developers for usage of gifs on their sites. At the same time, it was argued that Gifs are the products of LZW compression and the patent does not extend to the usage of the end product. Actually, web developers had nothing to fear since Unisys planned to collect license fees only from software companies whose products employ LZW algorithm.

Web developers should not be worried. They are free to use gifs on their web sites. However, if you've developed a software that creates or modifies gifs, you would have to pay licensing fees to Unisys.

The business acumen of the people at Unisys has to be admired. It seems that they had waited for Gifs to become popular and beneficial (from a web developers' point of view) before implementing the patent rights. However, there was an interesting and fortunate (?) side effect of this story. It lead to the development of the PNG file format. PNG is a much better and more versatile image format than both JPG and GIF. It has all the bells and whistles of these two file formats.

At the time of writing the browser support for PNG is still quite patchy, though the future does look bright.

LZW compression: conjugations patented

Google Groups View Thread LZW Patent Expiry

Can the LZW algorithm be used freely in all countries after the US
patent expiry in June 2003? I have gathered this information from
newsgroups:

1. The main US patent expires the 20 June 2003. The valid time can not
be extended, if they do not make adjustments/extensions to the patent.
US patent:
http://l2.espacenet.com/espacenet/viewer?PN=US4558302&CY=ep&LG=en&DB=EPD

2. There are other US patents covering this field, but these are filed
in 1999, and are therefore not enforceable. Are there other US
patents?

3. Unisys claim that they have patents in Canada, France, Germany,
U.K. and Italy, but they have never sued any company in European
court though they have had chances. There are some references in the
US patent: CA1223965, DE3476617D, EP0129439. Are these patents
enforceable? Are there other patents?

4. Patents are pending in Japan. The US patent refers to these:
JP2610084B2, JP5068893B, JP60116228, JP7249996, JP8237138. Do we have
to worry about these?


Can anyone confirm the information above and answer the questions?
Does this mean that we can use the lzw algorithm freely in all
countries after June 2003?



Etc

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner. 

ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.  

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least


Copyright © 1996-2016 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: September, 12, 2017