Huffman Coding
Huffman Coding is a method of compressing data regardless of its type. We can use Huffman Coding to assign a sequence of bits called codewords to each symbol in a given text. For instance, one method uses a fixed-length encoding where each symbol is assigned a bit string of the same length, denoted by m.(m ≥ log2n).
It's also called variable-length encoding, when we encode different symbols with different codewords of varying lengths. However, this introduces a problem that fixed-length encoding needs to have. How can we tell how many bits of an encoded text represent the first symbol? To simplify things, we can use prefix-free codes, which are like a decision tree. This helps us avoid any complications that might come up.
When we want to arrange a set of symbols, it makes sense to use a tree structure, with each symbol placed at the end of a branch. We can label the left branch with a 0 and the right branch with a 1. This way, we can quickly locate any symbol in the set.
- To encode a symbol, you can follow a simple path from the top of a tree to the symbol's label at the bottom of the tree.
- This creates a unique code for each symbol. Since each path ends only at a leaf and doesn't connect to any other leaf, no code can be a part of another code.
- This makes this coding method very efficient, as it saves space by assigning shorter codes to symbols that appear more often and more extended codes to those that appear less often.
- This can be done by using the greedy algorithm invented by David Huffman.
Below is an Example that you can understand better.
Huffman’s Algorithm:
Step 1: To create a system that can understand and process information, we must start by organizing the symbols we'll work with. We do this by creating "node trees" and assigning each tree a symbol from the alphabet. We also track how often each symbol appears so we can give it a weight or level of importance.Step 2: Continue the process until only one tree remains. Locate two trees with the lowest weight and designate them as the left and right subtrees of a fresh tree. The sum of their weights should be recorded in the new tree's root as its weight.
The constructed tree above in the figure is called the Huffman tree. It defines a Huffman code in the manner described above.
Example: Consider an example of a five-symbol alphabet {A, B, C, D, _} with the frequency occurrence in a text made up of these symbols:
Huffman Code tree Construction |
Figure: Example of constructing a Huffman coding tree |
👉 The results code words are as follows;
Figure: Huffman code word results |
Hence, we encode DAD as 011101, and 10011011011101 can be decoded as BAD_AD. With the frequency occurrence given and the codeword lengths obtained, the average bits per symbol in the code will be:
2 * 0.35 + 3 * 0.1 + 2 * 0.2 + 2 * 0.2 + 3 * 0.15 = 2.25
FAQ's:
1. What is simple Huffman coding?- Huffman Coding is a method of compressing data independent of data type.
- Calculate the frequency of each string.
- Sort the characters based on their frequencies in ascending order.
- Make a unique character like the leaf code.
- Create an internal node.
- You can refer to the above Example 👆
- Huffman coding is a greedy approach; the above Example is explained similarly.