Huffman compression
How do we avoid ambiguity for variable-length codes?
- Ensure that no code word is a prefix of another.
How to represent the prefix-free code?
- A binary trie
- chars in leaves
- codeword is path from root to leaf
How to write the trie for transmission?
- preorder traversal
Huffman algorithm.
Find the best prefix-free code that minimizes the number of bits
- Count frequency freq[i] for each character i in input
- Start with one node corresponding to each char i (sort by freq[i])
- Repeat until single trie formed:
- select two tries with min weight freq[i] and freq[j]
- merge into single trie with weight freq[i] + freq[j]
Implementation.
- Pass 1: tabulate char frequencies and build trie.
- Pass 2: encode file by traversing trie or lookup table.
Running time. Using a binary heap => N + RlogR where N is the input size, and R is radix size.
Disadvantages:
- Need preprocess the text to generate model
- Must transmit the model
LZW
Progressively learn and update model as you read text.
- more accurate modeling produces better compression
- Decoding must start from beginning
LZW compression
- Create ST associating W-bit codewords with string keys.
- Initialize ST with codewords for single-char keys.
- find longest string s in ST that is a prefix of unscanned part of input. (use trie to supoort longest prefix match)
- Write the W-bit code word associated with s.
- Add s + c to ST, where c is next char in the input.
Expansion
- create symble table (ST) associating string values with W-bit keys.
- Initialize ST to contain single-char values.
- read a W-bit key.
- Find associated string value in ST and write it out.
- Update ST
LZ77
1) Ternary Huffman codes.
Generalize the Huffman algorithm to codewords over the ternary alphabet (0, 1, and 2) instead of the binary alphabet. That is, given a bytestream, find a prefix-free ternary code that uses as few trits (0s, 1s, and 2s) as possible. Prove that it yields optimal prefix-free ternary code.
_Hint: _Combine smallest 3 probabilities at each step (instead of smallest 2). Don't forget to handle the case when the number of symbols is not of the form 3+2k for some integer k.
Repeat until single trie formed:
* select three tries with min weight freq[i], freq[j] and freq[k]
* merge into single trie with weight freq[i] + freq[j] + freq[k]
2)
Identify an optimal uniquely-decodable code that is neither prefix free nor suffix free.
- special separator
Identify two optimal prefix-free codes for the same input that have a different distribution of codeword lengths.
3) Move-to-front coding.
Design an algorithm to implement move-to-front encoding so that each operation takes logarithmic time in the worst case. That is, maintain alphabet of symbols in a list. A symbol is encoded as the number of symbols that precede it in the list. After encoding a symbol, move it to the front of the list.