Huffman compression algorithm

Ua ntej pib ntawm chav kawm "Algorithms rau Developers" npaj rau koj ib qho kev txhais lus ntawm lwm cov ntaub ntawv tseem ceeb.

Huffman coding yog cov ntaub ntawv compression algorithm uas tsim cov tswv yim yooj yim ntawm cov ntaub ntawv compression. Nyob rau hauv tsab xov xwm no, peb yuav tham txog qhov ntev thiab sib txawv ntawm qhov encoding, cov lej tsis sib xws, cov cai ua ntej, thiab tsim ib tsob ntoo Huffman.

Peb paub tias txhua tus cim tau muab khaws cia ua ntu ntu ntawm 0's thiab 1's thiab siv 8 khoom. Qhov no yog hu ua qhov ntev encoding vim tias txhua tus cim siv tib tus naj npawb ntawm cov khoom khaws cia.

Wb hais tias peb muaj ntawv nyeem. Yuav ua li cas peb thiaj txo tau qhov chaw uas yuav tsum tau khaws cia ib tug cim?

Lub tswv yim tseem ceeb yog qhov sib txawv ntawm qhov ntev encoding. Peb tuaj yeem siv qhov tseeb tias qee cov cim hauv cov ntawv tshwm sim ntau dua li lwm tus (saib ntawm no) los tsim ib qho algorithm uas yuav sawv cev rau tib theem ntawm cov cim hauv tsawg dua. Hauv qhov sib txawv qhov ntev encoding, peb muab cov cim sib txawv ntawm cov khoom, nyob ntawm seb lawv tshwm sim npaum li cas hauv cov ntawv sau. Thaum kawg, qee cov cim yuav siv li 1 me ntsis, thaum lwm tus yuav siv 2 me ntsis, 3 lossis ntau dua. Qhov teeb meem nrog qhov sib txawv ntawm qhov ntev encoding tsuas yog tom qab txiav txim siab ntawm qhov sib lawv liag.

Yuav ua li cas, paub qhov sib lawv liag ntawm cov khoom, txiav txim siab nws unambiguously?

Xav txog kab "abacdab". Nws muaj 8 cim, thiab thaum encoding ntev ntev, nws yuav xav tau 64 khoom los khaws nws. Nco ntsoov tias lub cim zaus "a", "b", "c" ΠΈ "D" sib 4, 2, 1, 1. Cia sim xav txog "abacdab" tsawg me ntsis, siv qhov tseeb tias "rau" tshwm sim ntau dua "B"thiab "B" tshwm sim ntau dua "c" ΠΈ "D". Cia peb pib los ntawm coding "rau" nrog ib ntsis sib npaug rau 0, "B" peb yuav muab ob-ntsis code 11, thiab siv peb ntsis 100 thiab 011 peb yuav encode "c" ΠΈ "D".

Raws li qhov tshwm sim, peb yuav tau txais:

a
0

b
11

c
100

d
011

Yog li txoj kab "abacdab" peb yuav encode li 00110100011011 (0|0|11|0|100|011|0|11)siv cov cai saum toj no. Txawm li cas los xij, qhov teeb meem tseem ceeb yuav yog hauv kev txiav txim siab. Thaum peb sim txiav txim siab txoj hlua 00110100011011, peb yuav tau txais cov txiaj ntsig tsis meej, vim nws tuaj yeem sawv cev raws li:

0|011|0|100|011|0|11    adacdab
0|0|11|0|100|0|11|011   aabacabd
0|011|0|100|0|11|0|11   adacabab 

...
thiab ua li ntawd.

Txhawm rau zam qhov tsis meej pem no, peb yuav tsum xyuas kom meej tias peb cov encoding txaus siab rau lub tswv yim xws li prefix txoj cai, uas nyob rau hauv lem implies hais tias cov lis dej num tsuas yog decoded nyob rau hauv ib tug tshwj xeeb txoj kev. Txoj cai prefix xyuas kom meej tias tsis muaj code yog prefix ntawm lwm tus. Los ntawm txoj cai, peb txhais tau tias cov khoom siv los sawv cev rau ib tus cim tshwj xeeb. Hauv qhov piv txwv saum toj no 0 yog prefix 011, uas ua txhaum txoj cai prefix. Yog li, yog tias peb cov lis dej num txaus siab rau txoj cai ua ntej, ces peb tuaj yeem txiav txim siab tsis sib xws (thiab rov ua dua).

Cia peb rov mus xyuas qhov piv txwv saum toj no. Lub sijhawm no peb yuav muab rau cov cim "a", "b", "c" ΠΈ "D" cov lis dej num raws li txoj cai prefix.

a
0

b
10

c
110

d
111

Nrog no encoding, txoj hlua "abacdab" yuav tau encoded li 00100100011010 (0|0|10|0|100|011|0|10). Thiab ntawm no 00100100011010 peb twb yuav tau unambiguously txiav txim thiab rov qab mus rau peb thawj txoj hlua "abacdab".

Huffman coding

Tam sim no peb tau hais txog qhov sib txawv ntawm qhov ntev encoding thiab txoj cai ua ntej, cia peb tham txog Huffman encoding.

Txoj kev yog ua raws li kev tsim cov ntoo binary. Hauv nws, cov node tuaj yeem yog qhov kawg lossis sab hauv. Thaum pib, tag nrho cov nodes raug suav hais tias yog nplooj (terminals), uas sawv cev rau lub cim nws tus kheej thiab nws qhov hnyav (uas yog, zaus ntawm qhov tshwm sim). Cov nodes sab hauv muaj qhov hnyav ntawm tus cwj pwm thiab xa mus rau ob lub noob caj noob ces. Los ntawm kev pom zoo dav dav, me ntsis Β«0Β» sawv cev tom qab ceg laug, thiab Β«1Β» - ntawm sab xis. hauv tsob ntoo N nplooj thiab N-1 cov nodes. Nws raug pom zoo tias thaum tsim ib tsob ntoo Huffman, cov cim tsis siv yuav raug muab pov tseg kom tau txais cov lej ntev ntev.

Peb yuav siv qhov tseem ceeb hauv kab los tsim ib tsob ntoo Huffman, qhov twg cov ntawm qhov qis tshaj plaws yuav muab qhov tseem ceeb tshaj plaws. Cov kauj ruam tsim kho tau piav qhia hauv qab no:

  1. Tsim ib nplooj ntawv rau txhua tus cim thiab ntxiv rau hauv qhov tseem ceeb queue.
  2. Thaum muaj ntau tshaj ib daim ntawv hauv kab, ua cov hauv qab no:
    • Tshem tawm ob lub nodes nrog qhov tseem ceeb tshaj plaws (tsawg zaus) los ntawm kab;
    • Tsim ib lub hauv paus tshiab, qhov twg ob lub nodes yuav yog menyuam yaus, thiab qhov zaus ntawm qhov tshwm sim yuav sib npaug rau qhov sib npaug ntawm cov zaus ntawm ob lub nodes.
    • Ntxiv cov node tshiab rau qhov tseem ceeb queue.
  3. Qhov tsuas tshuav node yuav yog lub hauv paus, thiab qhov no yuav ua kom tiav kev tsim kho ntawm tsob ntoo.

Xav txog tias peb muaj qee cov ntawv uas tsuas muaj cov cim xwb "a", "b", "c", "d" ΠΈ "thiab", thiab lawv qhov tshwm sim zaus yog 15, 7, 6, 6, thiab 5, feem. Hauv qab no yog cov duab kos uas qhia txog cov kauj ruam ntawm algorithm.

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Ib txoj hauv kev los ntawm lub hauv paus mus rau txhua qhov kawg ntawm qhov kawg yuav khaws qhov zoo tshaj plaws prefix code (tseem hu ua Huffman code) sib raug rau tus cwj pwm cuam tshuam nrog qhov kawg ntawm qhov kawg.

Huffman compression algorithm
Huffman ntoo

Hauv qab no koj yuav pom qhov kev siv ntawm Huffman compression algorithm hauv C ++ thiab Java:

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

// A Tree node
struct Node
{
	char ch;
	int freq;
	Node *left, *right;
};

// Function to allocate a new tree node
Node* getNode(char ch, int freq, Node* left, Node* right)
{
	Node* node = new Node();

	node->ch = ch;
	node->freq = freq;
	node->left = left;
	node->right = right;

	return node;
}

// Comparison object to be used to order the heap
struct comp
{
	bool operator()(Node* l, Node* r)
	{
		// highest priority item has lowest frequency
		return l->freq > r->freq;
	}
};

// traverse the Huffman Tree and store Huffman Codes
// in a map.
void encode(Node* root, string str,
			unordered_map<char, string> &huffmanCode)
{
	if (root == nullptr)
		return;

	// found a leaf node
	if (!root->left && !root->right) {
		huffmanCode[root->ch] = str;
	}

	encode(root->left, str + "0", huffmanCode);
	encode(root->right, str + "1", huffmanCode);
}

// traverse the Huffman Tree and decode the encoded string
void decode(Node* root, int &index, string str)
{
	if (root == nullptr) {
		return;
	}

	// found a leaf node
	if (!root->left && !root->right)
	{
		cout << root->ch;
		return;
	}

	index++;

	if (str[index] =='0')
		decode(root->left, index, str);
	else
		decode(root->right, index, str);
}

// Builds Huffman Tree and decode given input text
void buildHuffmanTree(string text)
{
	// count frequency of appearance of each character
	// and store it in a map
	unordered_map<char, int> freq;
	for (char ch: text) {
		freq[ch]++;
	}

	// Create a priority queue to store live nodes of
	// Huffman tree;
	priority_queue<Node*, vector<Node*>, comp> pq;

	// Create a leaf node for each character and add it
	// to the priority queue.
	for (auto pair: freq) {
		pq.push(getNode(pair.first, pair.second, nullptr, nullptr));
	}

	// do till there is more than one node in the queue
	while (pq.size() != 1)
	{
		// Remove the two nodes of highest priority
		// (lowest frequency) from the queue
		Node *left = pq.top(); pq.pop();
		Node *right = pq.top();	pq.pop();

		// Create a new internal node with these two nodes
		// as children and with frequency equal to the sum
		// of the two nodes' frequencies. Add the new node
		// to the priority queue.
		int sum = left->freq + right->freq;
		pq.push(getNode('', sum, left, right));
	}

	// root stores pointer to root of Huffman Tree
	Node* root = pq.top();

	// traverse the Huffman Tree and store Huffman Codes
	// in a map. Also prints them
	unordered_map<char, string> huffmanCode;
	encode(root, "", huffmanCode);

	cout << "Huffman Codes are :n" << 'n';
	for (auto pair: huffmanCode) {
		cout << pair.first << " " << pair.second << 'n';
	}

	cout << "nOriginal string was :n" << text << 'n';

	// print encoded string
	string str = "";
	for (char ch: text) {
		str += huffmanCode[ch];
	}

	cout << "nEncoded string is :n" << str << 'n';

	// traverse the Huffman Tree again and this time
	// decode the encoded string
	int index = -1;
	cout << "nDecoded string is: n";
	while (index < (int)str.size() - 2) {
		decode(root, index, str);
	}
}

// Huffman coding algorithm
int main()
{
	string text = "Huffman coding is a data compression algorithm.";

	buildHuffmanTree(text);

	return 0;
}

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

// A Tree node
class Node
{
	char ch;
	int freq;
	Node left = null, right = null;

	Node(char ch, int freq)
	{
		this.ch = ch;
		this.freq = freq;
	}

	public Node(char ch, int freq, Node left, Node right) {
		this.ch = ch;
		this.freq = freq;
		this.left = left;
		this.right = right;
	}
};

class Huffman
{
	// traverse the Huffman Tree and store Huffman Codes
	// in a map.
	public static void encode(Node root, String str,
							  Map<Character, String> huffmanCode)
	{
		if (root == null)
			return;

		// found a leaf node
		if (root.left == null && root.right == null) {
			huffmanCode.put(root.ch, str);
		}


		encode(root.left, str + "0", huffmanCode);
		encode(root.right, str + "1", huffmanCode);
	}

	// traverse the Huffman Tree and decode the encoded string
	public static int decode(Node root, int index, StringBuilder sb)
	{
		if (root == null)
			return index;

		// found a leaf node
		if (root.left == null && root.right == null)
		{
			System.out.print(root.ch);
			return index;
		}

		index++;

		if (sb.charAt(index) == '0')
			index = decode(root.left, index, sb);
		else
			index = decode(root.right, index, sb);

		return index;
	}

	// Builds Huffman Tree and huffmanCode and decode given input text
	public static void buildHuffmanTree(String text)
	{
		// count frequency of appearance of each character
		// and store it in a map
		Map<Character, Integer> freq = new HashMap<>();
		for (int i = 0 ; i < text.length(); i++) {
			if (!freq.containsKey(text.charAt(i))) {
				freq.put(text.charAt(i), 0);
			}
			freq.put(text.charAt(i), freq.get(text.charAt(i)) + 1);
		}

		// Create a priority queue to store live nodes of Huffman tree
		// Notice that highest priority item has lowest frequency
		PriorityQueue<Node> pq = new PriorityQueue<>(
										(l, r) -> l.freq - r.freq);

		// Create a leaf node for each character and add it
		// to the priority queue.
		for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
			pq.add(new Node(entry.getKey(), entry.getValue()));
		}

		// do till there is more than one node in the queue
		while (pq.size() != 1)
		{
			// Remove the two nodes of highest priority
			// (lowest frequency) from the queue
			Node left = pq.poll();
			Node right = pq.poll();

			// Create a new internal node with these two nodes as children 
			// and with frequency equal to the sum of the two nodes
			// frequencies. Add the new node to the priority queue.
			int sum = left.freq + right.freq;
			pq.add(new Node('', sum, left, right));
		}

		// root stores pointer to root of Huffman Tree
		Node root = pq.peek();

		// traverse the Huffman tree and store the Huffman codes in a map
		Map<Character, String> huffmanCode = new HashMap<>();
		encode(root, "", huffmanCode);

		// print the Huffman codes
		System.out.println("Huffman Codes are :n");
		for (Map.Entry<Character, String> entry : huffmanCode.entrySet()) {
			System.out.println(entry.getKey() + " " + entry.getValue());
		}

		System.out.println("nOriginal string was :n" + text);

		// print encoded string
		StringBuilder sb = new StringBuilder();
		for (int i = 0 ; i < text.length(); i++) {
			sb.append(huffmanCode.get(text.charAt(i)));
		}

		System.out.println("nEncoded string is :n" + sb);

		// traverse the Huffman Tree again and this time
		// decode the encoded string
		int index = -1;
		System.out.println("nDecoded string is: n");
		while (index < sb.length() - 2) {
			index = decode(root, index, sb);
		}
	}

	public static void main(String[] args)
	{
		String text = "Huffman coding is a data compression algorithm.";

		buildHuffmanTree(text);
	}
}

Nco ntsoov: lub cim xeeb siv los ntawm txoj hlua input yog 47 * 8 = 376 khoom thiab cov hlua encoded tsuas yog 194 khoom xwb i.e. cov ntaub ntawv yog compressed li ntawm 48%. Hauv C ++ program saum toj no, peb siv cov hlua hauv chav kawm los khaws cov hlua encoded kom cov program nyeem tau.

Vim hais tias muaj txiaj ntsig muaj txiaj ntsig qhov tseem ceeb ntawm cov ntaub ntawv kab ke xav tau ib qho kev nkag O(log(N)) lub sij hawm, tab sis nyob rau hauv ib tug tiav binary tsob ntoo nrog N nplooj tam sim no 2N-1 nodes, thiab tsob ntoo Huffman yog ib tsob ntoo tiav binary, ces cov algorithm khiav hauv O(Nlog(N)) lub sij hawm, qhov twg N - Cov cim.

Qhov chaw:

en.wikipedia.org/wiki/Huffman_coding
en.wikipedia.org/wiki/Variable-length_code
www.youtube.com/watch?v=5wRPin4oxCo

Kawm ntxiv txog chav kawm.

Tau qhov twg los: www.hab.com

Ntxiv ib saib