Huffman compression

How do we avoid ambiguity for variable-length codes?

  • Ensure that no code word is a prefix of another.

How to represent the prefix-free code?

  • A binary trie
    • chars in leaves
    • codeword is path from root to leaf

How to write the trie for transmission?

  • preorder traversal
Huffman algorithm.

Find the best prefix-free code that minimizes the number of bits

  1. Count frequency freq[i] for each character i in input
  2. Start with one node corresponding to each char i (sort by freq[i])
  3. Repeat until single trie formed:
    1. select two tries with min weight freq[i] and freq[j]
    2. merge into single trie with weight freq[i] + freq[j]

Implementation.

  • Pass 1: tabulate char frequencies and build trie.
  • Pass 2: encode file by traversing trie or lookup table.

Running time. Using a binary heap => N + RlogR where N is the input size, and R is radix size.

Disadvantages:

  • Need preprocess the text to generate model
  • Must transmit the model

LZW

Progressively learn and update model as you read text.

  • more accurate modeling produces better compression
  • Decoding must start from beginning

LZW compression

  • Create ST associating W-bit codewords with string keys.
  • Initialize ST with codewords for single-char keys.
  • find longest string s in ST that is a prefix of unscanned part of input. (use trie to supoort longest prefix match)
  • Write the W-bit code word associated with s.
  • Add s + c to ST, where c is next char in the input.

Expansion

  • create symble table (ST) associating string values with W-bit keys.
  • Initialize ST to contain single-char values.
  • read a W-bit key.
  • Find associated string value in ST and write it out.
  • Update ST

LZ77

1) Ternary Huffman codes.

Generalize the Huffman algorithm to codewords over the ternary alphabet (0, 1, and 2) instead of the binary alphabet. That is, given a bytestream, find a prefix-free ternary code that uses as few trits (0s, 1s, and 2s) as possible. Prove that it yields optimal prefix-free ternary code.

_Hint: _Combine smallest 3 probabilities at each step (instead of smallest 2). Don't forget to handle the case when the number of symbols is not of the form 3+2k for some integer k.

Repeat until single trie formed:

* select three tries with min weight freq[i], freq[j] and freq[k]

* merge into single trie with weight freq[i] + freq[j] + freq[k]

2)

  • Identify an optimal uniquely-decodable code that is neither prefix free nor suffix free.

    • special separator
  • Identify two optimal prefix-free codes for the same input that have a different distribution of codeword lengths.

3) Move-to-front coding.

Design an algorithm to implement move-to-front encoding so that each operation takes logarithmic time in the worst case. That is, maintain alphabet of symbols in a list. A symbol is encoded as the number of symbols that precede it in the list. After encoding a symbol, move it to the front of the list.

results matching ""

    No results matching ""