Huffman compression algorithm

Kusati kwatanga kosi "Algorithms kune Vagadziri" vakakugadzirira iwe shanduro yeimwe nyaya inobatsira.

Huffman coding ndeye data compression algorithm inogadzira iyo yakakosha pfungwa yekumanikidza faira. Muchinyorwa chino, tichataura nezve yakagadziriswa uye inosiyana kureba encoding, yakasarudzika decodable macode, prefix mitemo, uye kuvaka Huffman muti.

Isu tinoziva kuti hunhu hwega hwega hunochengetwa senhevedzano ye0 uye 1 uye inotora 8 bits. Izvi zvinodaidzwa kunzi fixed urefu encoding nekuti hunhu hwega hwega hunoshandisa nhamba yakagadziriswa yemabhiti kuchengeta.

Ngatitii tine chinyorwa. Tinogona sei kuderedza huwandu hwenzvimbo inodiwa kuchengetedza chimiro chimwe chete?

Pfungwa huru ndeyekusiyana kureba encoding. Tinogona kushandisa chokwadi chekuti mamwe mavara ari muchinyorwa anoitika kakawanda kupfuura mamwe (ona pano) kugadzira algorithm inomiririra kutevedzana kwemavara mumabhiti mashoma. Mukusiyana kwehurefu hwe encoding, tinopa mavara nhamba inosiyana yemabhiti, zvichienderana nekuti anoonekwa kakawanda sei murugwaro rwakapihwa. Pakupedzisira, mamwe mavara anogona kutora zvishoma se1 bit, nepo mamwe achitora 2 bits, 3 kana kupfuura. Dambudziko rekusiyana-siyana kureba encoding ingori inotevera decoding yekutevedzana.

Sei, nekuziva kutevedzana kwebhiti, kuigadzirisa zvisina kujeka?

Funga nezvemutsetse "baba". Iyo ine mavara masere, uye kana ichikodha hurefu hwakatarwa, inoda makumi matanhatu nemana mabhiti kuichengeta. Ziva kuti frequency yechiratidzo "a", "b", "c" ΠΈ "D" zvakaenzana 4, 2, 1, 1 zvakateerana. Ngatiedzei kufungidzira "baba" zvishoma, uchishandisa chokwadi chekuti "ku" zvinoitika kakawanda kupfuura "B"uye "B" zvinoitika kakawanda kupfuura "c" ΠΈ "D". Ngatitange nekodha "ku" nechidimbu chimwe chakaenzana ne0, "B" isu tichapa maviri-bit kodhi 11, uye tichishandisa matatu mabheti 100 uye 011 isu tichaisa encode. "c" ΠΈ "D".

Somugumisiro, tichawana:

a
0

b
11

c
100

d
011

Saka mutsara "baba" isu ticha encode se 00110100011011 (0|0|11|0|100|011|0|11)uchishandisa macode ari pamusoro. Zvisinei, dambudziko guru richava mukudhirodha. Kana isu tichiedza decode tambo 00110100011011, tichawana mhedzisiro, nekuti inogona kumiririrwa se:

0|011|0|100|011|0|11    adacdab
0|0|11|0|100|0|11|011   aabacabd
0|011|0|100|0|11|0|11   adacabab 

...
uye zvakadaro.

Kuti tidzivise kusajeka uku, isu tinofanirwa kuona kuti encoding yedu inogutsa pfungwa yakadai se prefix mutemo, izvo zvinoreva kuti macode anogona chete kudhindwa nenzira imwechete yakasarudzika. Iyo prefix mutemo unovimbisa kuti hapana kodhi prefix yeimwe. Nekodhi, tinoreva mabits anoshandiswa kumiririra humwe hunhu. Mumuenzaniso uri pamusoro 0 chivakashure 011, izvo zvinotyora mutemo wechivakashure. Saka, kana makodhi edu achigutsa mutemo wekutanga, saka isu tinogona kusarudzika kusarudza (uye zvinopesana).

Ngatitarisei zvakare muenzaniso uri pamusoro. Panguva ino tichagovera zviratidzo "a", "b", "c" ΠΈ "D" macode anogutsa mutemo wechivakashure.

a
0

b
10

c
110

d
111

Ne encoding iyi, tambo "baba" ichave encoded as 00100100011010 (0|0|10|0|100|011|0|10). Uye pano 00100100011010 tinenge tatokwanisa kudhirodha zvisina kujeka uye kudzokera kutambo yedu yekutanga "baba".

Huffman coding

Zvino zvataita nekusiyana-siyana kureba encoding uye mutemo wekutanga, ngatitaure nezve Huffman encoding.

Nzira yacho inobva pakusikwa kwebhinari miti. Mariri, node inogona kuva yekupedzisira kana yemukati. Pakutanga, node dzose dzinoonekwa semashizha (terminals), iyo inomiririra chiratidzo pachayo uye uremu hwayo (kureva, kuwanda kwechiitiko). Manodhi emukati ane huremu hwechimiro uye anoreva zvipfukuto zviviri. Nekubvumirana kwakawanda, zvishoma "0" inomiririra kutevera bazi rekuruboshwe, uye "1" - kurudyi. mumuti wakakwana N mashizha uye N-1 node dzemukati. Zvinokurudzirwa kuti kana uchigadzira muti weHuffman, zviratidzo zvisina kushandiswa zviraswe kuti uwane macode akanyanya kureba.

Isu tichashandisa mutsara wepamberi kuvaka muti weHuffman, uko iyo node ine yakaderera frequency ichapihwa yakanyanya kukosha. Matanho ekugadzira anotsanangurwa pazasi:

  1. Gadzira shizha node kune yega yega hunhu uye wovawedzera kune yekutanga mutsara.
  2. Kunyange paine pepa rinopfuura rimwe mumutsetse, ita zvinotevera:
    • Bvisa node mbiri nepamusoro pekutanga (yakaderera frequency) kubva pamutsetse;
    • Gadzira node itsva yemukati, apo idzi node mbiri dzichava vana, uye kuwanda kwekuitika kunenge kwakaenzana nehuwandu hwehuwandu hwemanodhi maviri aya.
    • Wedzera node itsva kune yekutanga mutsara.
  3. Node yakasara ichava mudzi, uye izvi zvichapedzisa kuvakwa kwemuti.

Fungidzira kuti tine mavara ane mavara chete "a", "b", "c", "d" ΠΈ "uye", uye maitikiro adzo ari 15, 7, 6, 6, uye 5, zvichiteerana. Pazasi pane mifananidzo inoratidza matanho eiyo algorithm.

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Nzira kubva pamudzi kuenda kune chero magumo node inochengetedza iyo yakakwana prefix kodhi (inozivikanwawo seHuffman kodhi) inoenderana nehunhu hwakabatana neiyo yekupedzisira node.

Huffman compression algorithm
Huffman tree

Pazasi iwe unowana kuisirwa kweHuffman compression algorithm muC ++ uye Java:

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

// A Tree node
struct Node
{
	char ch;
	int freq;
	Node *left, *right;
};

// Function to allocate a new tree node
Node* getNode(char ch, int freq, Node* left, Node* right)
{
	Node* node = new Node();

	node->ch = ch;
	node->freq = freq;
	node->left = left;
	node->right = right;

	return node;
}

// Comparison object to be used to order the heap
struct comp
{
	bool operator()(Node* l, Node* r)
	{
		// highest priority item has lowest frequency
		return l->freq > r->freq;
	}
};

// traverse the Huffman Tree and store Huffman Codes
// in a map.
void encode(Node* root, string str,
			unordered_map<char, string> &huffmanCode)
{
	if (root == nullptr)
		return;

	// found a leaf node
	if (!root->left && !root->right) {
		huffmanCode[root->ch] = str;
	}

	encode(root->left, str + "0", huffmanCode);
	encode(root->right, str + "1", huffmanCode);
}

// traverse the Huffman Tree and decode the encoded string
void decode(Node* root, int &index, string str)
{
	if (root == nullptr) {
		return;
	}

	// found a leaf node
	if (!root->left && !root->right)
	{
		cout << root->ch;
		return;
	}

	index++;

	if (str[index] =='0')
		decode(root->left, index, str);
	else
		decode(root->right, index, str);
}

// Builds Huffman Tree and decode given input text
void buildHuffmanTree(string text)
{
	// count frequency of appearance of each character
	// and store it in a map
	unordered_map<char, int> freq;
	for (char ch: text) {
		freq[ch]++;
	}

	// Create a priority queue to store live nodes of
	// Huffman tree;
	priority_queue<Node*, vector<Node*>, comp> pq;

	// Create a leaf node for each character and add it
	// to the priority queue.
	for (auto pair: freq) {
		pq.push(getNode(pair.first, pair.second, nullptr, nullptr));
	}

	// do till there is more than one node in the queue
	while (pq.size() != 1)
	{
		// Remove the two nodes of highest priority
		// (lowest frequency) from the queue
		Node *left = pq.top(); pq.pop();
		Node *right = pq.top();	pq.pop();

		// Create a new internal node with these two nodes
		// as children and with frequency equal to the sum
		// of the two nodes' frequencies. Add the new node
		// to the priority queue.
		int sum = left->freq + right->freq;
		pq.push(getNode('', sum, left, right));
	}

	// root stores pointer to root of Huffman Tree
	Node* root = pq.top();

	// traverse the Huffman Tree and store Huffman Codes
	// in a map. Also prints them
	unordered_map<char, string> huffmanCode;
	encode(root, "", huffmanCode);

	cout << "Huffman Codes are :n" << 'n';
	for (auto pair: huffmanCode) {
		cout << pair.first << " " << pair.second << 'n';
	}

	cout << "nOriginal string was :n" << text << 'n';

	// print encoded string
	string str = "";
	for (char ch: text) {
		str += huffmanCode[ch];
	}

	cout << "nEncoded string is :n" << str << 'n';

	// traverse the Huffman Tree again and this time
	// decode the encoded string
	int index = -1;
	cout << "nDecoded string is: n";
	while (index < (int)str.size() - 2) {
		decode(root, index, str);
	}
}

// Huffman coding algorithm
int main()
{
	string text = "Huffman coding is a data compression algorithm.";

	buildHuffmanTree(text);

	return 0;
}

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

// A Tree node
class Node
{
	char ch;
	int freq;
	Node left = null, right = null;

	Node(char ch, int freq)
	{
		this.ch = ch;
		this.freq = freq;
	}

	public Node(char ch, int freq, Node left, Node right) {
		this.ch = ch;
		this.freq = freq;
		this.left = left;
		this.right = right;
	}
};

class Huffman
{
	// traverse the Huffman Tree and store Huffman Codes
	// in a map.
	public static void encode(Node root, String str,
							  Map<Character, String> huffmanCode)
	{
		if (root == null)
			return;

		// found a leaf node
		if (root.left == null && root.right == null) {
			huffmanCode.put(root.ch, str);
		}


		encode(root.left, str + "0", huffmanCode);
		encode(root.right, str + "1", huffmanCode);
	}

	// traverse the Huffman Tree and decode the encoded string
	public static int decode(Node root, int index, StringBuilder sb)
	{
		if (root == null)
			return index;

		// found a leaf node
		if (root.left == null && root.right == null)
		{
			System.out.print(root.ch);
			return index;
		}

		index++;

		if (sb.charAt(index) == '0')
			index = decode(root.left, index, sb);
		else
			index = decode(root.right, index, sb);

		return index;
	}

	// Builds Huffman Tree and huffmanCode and decode given input text
	public static void buildHuffmanTree(String text)
	{
		// count frequency of appearance of each character
		// and store it in a map
		Map<Character, Integer> freq = new HashMap<>();
		for (int i = 0 ; i < text.length(); i++) {
			if (!freq.containsKey(text.charAt(i))) {
				freq.put(text.charAt(i), 0);
			}
			freq.put(text.charAt(i), freq.get(text.charAt(i)) + 1);
		}

		// Create a priority queue to store live nodes of Huffman tree
		// Notice that highest priority item has lowest frequency
		PriorityQueue<Node> pq = new PriorityQueue<>(
										(l, r) -> l.freq - r.freq);

		// Create a leaf node for each character and add it
		// to the priority queue.
		for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
			pq.add(new Node(entry.getKey(), entry.getValue()));
		}

		// do till there is more than one node in the queue
		while (pq.size() != 1)
		{
			// Remove the two nodes of highest priority
			// (lowest frequency) from the queue
			Node left = pq.poll();
			Node right = pq.poll();

			// Create a new internal node with these two nodes as children 
			// and with frequency equal to the sum of the two nodes
			// frequencies. Add the new node to the priority queue.
			int sum = left.freq + right.freq;
			pq.add(new Node('', sum, left, right));
		}

		// root stores pointer to root of Huffman Tree
		Node root = pq.peek();

		// traverse the Huffman tree and store the Huffman codes in a map
		Map<Character, String> huffmanCode = new HashMap<>();
		encode(root, "", huffmanCode);

		// print the Huffman codes
		System.out.println("Huffman Codes are :n");
		for (Map.Entry<Character, String> entry : huffmanCode.entrySet()) {
			System.out.println(entry.getKey() + " " + entry.getValue());
		}

		System.out.println("nOriginal string was :n" + text);

		// print encoded string
		StringBuilder sb = new StringBuilder();
		for (int i = 0 ; i < text.length(); i++) {
			sb.append(huffmanCode.get(text.charAt(i)));
		}

		System.out.println("nEncoded string is :n" + sb);

		// traverse the Huffman Tree again and this time
		// decode the encoded string
		int index = -1;
		System.out.println("nDecoded string is: n");
		while (index < sb.length() - 2) {
			index = decode(root, index, sb);
		}
	}

	public static void main(String[] args)
	{
		String text = "Huffman coding is a data compression algorithm.";

		buildHuffmanTree(text);
	}
}

Cherechedza: chiyeuchidzo chinoshandiswa netambo yekupinda ndeye 47 * 8 = 376 bits uye tambo yakavharidzirwa inongova 194 bits i.e. data inomanikidzwa neinenge 48%. Muchirongwa cheC ++ pamusoro, tinoshandisa kirasi yetambo kuchengetedza tambo yakavharidzirwa kuti chirongwa chiverengeke.

Nekuti inoshanda yakakosha mutsara data zvimiro zvinoda pane imwe neimwe kuiswa O(log(N)) nguva, asi mune yakazara binary muti ne N anosiya aripo 2N-1 nodes, uye muti weHuffman muti wakakwana webhinari, ipapo algorithm inopinda mukati O(Nlog(N)) nguva, kupi N - Vanhu.

Sources:

en.wikipedia.org/wiki/Huffman_coding
en.wikipedia.org/wiki/Variable-length_code
www.youtube.com/watch?v=5wRPin4oxCo

Dzidza zvakawanda nezvekosi.

Source: www.habr.com

Voeg