Huffman compression algorithm

Pele thupelo e qala "Algorithms bakeng sa Baetsi" e lokiselitse phetolelo ea boitsebiso bo bong bo molemo.

Huffman coding ke data compression algorithm e etsang mohopolo oa mantlha oa compression ea faele. Sehloohong sena, re tla bua ka khouto e tsitsitseng le e fapaneng ea bolelele ba likhoutu, likhoutu tse ikhethileng ka mokhoa o ikhethileng, melao ea sehlomathiso, le ho aha sefate sa Huffman.

Rea tseba hore tlhaku e 'ngoe le e 'ngoe e bolokiloe e le tatellano ea 0's le 1' me e nka li-bits tse 8. Sena se bitsoa "fixed length encoding" hobane sebapali se seng le se seng se sebelisa palo e tsitsitseng ea li-bits ho boloka.

Ha re re re fuoa mongolo. Re ka fokotsa sebaka se hlokahalang joang ho boloka sebapali se le seng?

Taba ea mantlha ke khouto ea bolelele bo fapaneng. Re ka sebelisa taba ea hore litlhaku tse ling li hlaha hangata ho feta tse ling (bona mona) ho hlahisa algorithm e tla emela tatellano e tšoanang ea litlhaku ka likotoana tse fokolang. Ho khouto ea bolelele bo fapaneng, re abela litlhaku palo e fapaneng ea likotoana, ho latela hore na li hlaha hangata hakae temaneng e fanoeng. Qetellong, litlhaku tse ling li ka nka hanyane joalo ka 1 bit, ha tse ling li ka nka likotoana tse 2, tse 3 kapa ho feta. Bothata ba khouto ea bolelele bo fapaneng ke feela ho hlophisoa ho latelang ha tatellano.

Joang, ka ho tseba tatellano ea likotoana, u e tseba joang ka mokhoa o sa tsitsang?

Nahana ka moeli "batlabe". E na le litlhaku tse 8, 'me ha e khouta bolelele bo tsitsitseng, e tla hloka li-bits tse 64 ho e boloka. Hlokomela hore maqhubu a letšoao "a", "b", "c" и "D" e lekana le 4, 2, 1, 1 ka ho latellana. Ha re leke ho nahana "batlabe" likotoana tse fokolang, ho sebelisa taba ea hore "ho" e etsahala hangata hofeta "B"le "B" e etsahala hangata hofeta "c" и "D". Ha re qale ka ho khouta "ho" ka karoloana e le 'ngoe e lekanang le 0, "B" re tla abela khoutu ea 11-bit, 'me re sebelisa likotoana tse tharo 100 le 011 re tla kenyelletsa "c" и "D".

Ka lebaka leo, re tla fumana:

a
0

b
11

c
100

d
011

Kahoo mola "batlabe" re tla encode joalo ka 00110100011011 (0|0|11|0|100|011|0|11)ka ho sebedisa dikhoutu tse ka hodimo. Leha ho le joalo, bothata bo ka sehloohong e tla ba ho decoding. Ha re leka ho hlakisa khoele 00110100011011, re fumana sephetho se sa utloahaleng, kaha se ka emeloa joalo ka:

0|011|0|100|011|0|11    adacdab
0|0|11|0|100|0|11|011   aabacabd
0|011|0|100|0|11|0|11   adacabab 

...
joalo-joalo.

Ho qoba ho se hlaka hona, re tlameha ho etsa bonnete ba hore khouto ea rona e khotsofatsa mohopolo o kang molao oa sehlongoapele, e leng se bolelang hore likhoutu li ka hlalosoa feela ka tsela e le 'ngoe e ikhethang. Molao wa sehlongwapele o netefatsa hore ha ho khoutu eo e leng sehlongwapele sa e nngwe. Ka khoutu, re bolela likotoana tse sebelisoang ho emela sebopeho se itseng. Mohlala o ka holimo 0 ke sehlongoapele 011, e tlolang molao wa sehlongwapele. Kahoo, haeba likhoutu tsa rona li khotsofatsa molao oa sehlongoapele, re ka khetha ka mokhoa o ikhethileng (le ka tsela e fapaneng).

Ha re hlahlobeng mohlala o ka holimo hape. Lekhetlong lena re tla abela matšoao "a", "b", "c" и "D" dikhoutu tse kgotsofatsang molao wa sehlongwapele.

a
0

b
10

c
110

d
111

Ka encoding ena, khoele "batlabe" e tla encoded joalo ka 00100100011010 (0|0|10|0|100|011|0|10). 'Me mona 00100100011010 re tla be re se re khona ho khetholla ka mokhoa o hlakileng le ho khutlela khoeleng ea rona ea pele "batlabe".

Huffman coding

Kaha joale re se re sebetsane le khouto ea bolelele bo fapaneng le molao oa sehlomapele, ha re bueng ka khouto ea Huffman.

Mokhoa o thehiloe ho thehoeng ha lifate tsa binary. Ho eona, node e ka ba ea ho qetela kapa ea ka hare. Qalong, li-node tsohle li nkoa e le makhasi (li-terminals), tse emelang letšoao ka boeona le boima ba lona (ke hore, khafetsa ea ketsahalo). Li-node tse ka hare li na le boima ba sebopeho 'me li bua ka li-node tse peli tsa litloholo. Ka tumellano e akaretsang, bit "0" e emela ho latela lekala le letšehali, le "1" - ka ho le letona. sefateng se tletseng N mahlaku le N-1 nodes ka hare. Ho khothaletsoa hore ha ho etsoa sefate sa Huffman, matšoao a sa sebelisoeng a lahloe ho fumana likhoutu tsa bolelele bo nepahetseng.

Re tla sebelisa mokoloko oa pele ho aha sefate sa Huffman, moo node e nang le maqhubu a tlaase ka ho fetisisa e tla fuoa pele. Mehato ea kaho e hlalositsoe ka tlase:

  1. Theha node ea lekhasi bakeng sa sebapali se seng le se seng 'me u se kenye moleng oa bohlokoa.
  2. Le hoja ho na le maqephe a fetang bonngoe moleng, etsa se latelang:
    • Tlosa li-node tse peli ka pele-pele (maqhubu a tlaase) ho tloha moleng;
    • Theha node e ncha ea ka hare, moo li-node tsena tse peli e tla ba bana, 'me khafetsa ea ketsahalo e tla lekana le kakaretso ea maqhubu a li-node tsena tse peli.
    • Kenya node e ncha moleng oa bohlokoa.
  3. Node e setseng e tla ba motso, 'me sena se tla phetha kaho ea sefate.

Nahana hore re na le mongolo o nang le litlhaku feela "a", "b", "c", "d" и "le", mme maqhubu a tsona a hlahang ke 15, 7, 6, 6, le 5, ka ho latellana. Ka tlase ke litšoantšo tse bontšang mehato ea algorithm.

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Tsela e tsoang motsong ho ea sebakeng leha e le sefe sa ho qetela e tla boloka khoutu e nepahetseng ea pele (eo hape e tsejoang e le khoutu ea Huffman) e tsamaellanang le sebopeho se amanang le node eo ea ho qetela.

Huffman compression algorithm
Sefate sa Huffman

Ka tlase u tla fumana ts'ebetsong ea Huffman compression algorithm ho C++ le Java:

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

// A Tree node
struct Node
{
	char ch;
	int freq;
	Node *left, *right;
};

// Function to allocate a new tree node
Node* getNode(char ch, int freq, Node* left, Node* right)
{
	Node* node = new Node();

	node->ch = ch;
	node->freq = freq;
	node->left = left;
	node->right = right;

	return node;
}

// Comparison object to be used to order the heap
struct comp
{
	bool operator()(Node* l, Node* r)
	{
		// highest priority item has lowest frequency
		return l->freq > r->freq;
	}
};

// traverse the Huffman Tree and store Huffman Codes
// in a map.
void encode(Node* root, string str,
			unordered_map<char, string> &huffmanCode)
{
	if (root == nullptr)
		return;

	// found a leaf node
	if (!root->left && !root->right) {
		huffmanCode[root->ch] = str;
	}

	encode(root->left, str + "0", huffmanCode);
	encode(root->right, str + "1", huffmanCode);
}

// traverse the Huffman Tree and decode the encoded string
void decode(Node* root, int &index, string str)
{
	if (root == nullptr) {
		return;
	}

	// found a leaf node
	if (!root->left && !root->right)
	{
		cout << root->ch;
		return;
	}

	index++;

	if (str[index] =='0')
		decode(root->left, index, str);
	else
		decode(root->right, index, str);
}

// Builds Huffman Tree and decode given input text
void buildHuffmanTree(string text)
{
	// count frequency of appearance of each character
	// and store it in a map
	unordered_map<char, int> freq;
	for (char ch: text) {
		freq[ch]++;
	}

	// Create a priority queue to store live nodes of
	// Huffman tree;
	priority_queue<Node*, vector<Node*>, comp> pq;

	// Create a leaf node for each character and add it
	// to the priority queue.
	for (auto pair: freq) {
		pq.push(getNode(pair.first, pair.second, nullptr, nullptr));
	}

	// do till there is more than one node in the queue
	while (pq.size() != 1)
	{
		// Remove the two nodes of highest priority
		// (lowest frequency) from the queue
		Node *left = pq.top(); pq.pop();
		Node *right = pq.top();	pq.pop();

		// Create a new internal node with these two nodes
		// as children and with frequency equal to the sum
		// of the two nodes' frequencies. Add the new node
		// to the priority queue.
		int sum = left->freq + right->freq;
		pq.push(getNode('', sum, left, right));
	}

	// root stores pointer to root of Huffman Tree
	Node* root = pq.top();

	// traverse the Huffman Tree and store Huffman Codes
	// in a map. Also prints them
	unordered_map<char, string> huffmanCode;
	encode(root, "", huffmanCode);

	cout << "Huffman Codes are :n" << 'n';
	for (auto pair: huffmanCode) {
		cout << pair.first << " " << pair.second << 'n';
	}

	cout << "nOriginal string was :n" << text << 'n';

	// print encoded string
	string str = "";
	for (char ch: text) {
		str += huffmanCode[ch];
	}

	cout << "nEncoded string is :n" << str << 'n';

	// traverse the Huffman Tree again and this time
	// decode the encoded string
	int index = -1;
	cout << "nDecoded string is: n";
	while (index < (int)str.size() - 2) {
		decode(root, index, str);
	}
}

// Huffman coding algorithm
int main()
{
	string text = "Huffman coding is a data compression algorithm.";

	buildHuffmanTree(text);

	return 0;
}

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

// A Tree node
class Node
{
	char ch;
	int freq;
	Node left = null, right = null;

	Node(char ch, int freq)
	{
		this.ch = ch;
		this.freq = freq;
	}

	public Node(char ch, int freq, Node left, Node right) {
		this.ch = ch;
		this.freq = freq;
		this.left = left;
		this.right = right;
	}
};

class Huffman
{
	// traverse the Huffman Tree and store Huffman Codes
	// in a map.
	public static void encode(Node root, String str,
							  Map<Character, String> huffmanCode)
	{
		if (root == null)
			return;

		// found a leaf node
		if (root.left == null && root.right == null) {
			huffmanCode.put(root.ch, str);
		}


		encode(root.left, str + "0", huffmanCode);
		encode(root.right, str + "1", huffmanCode);
	}

	// traverse the Huffman Tree and decode the encoded string
	public static int decode(Node root, int index, StringBuilder sb)
	{
		if (root == null)
			return index;

		// found a leaf node
		if (root.left == null && root.right == null)
		{
			System.out.print(root.ch);
			return index;
		}

		index++;

		if (sb.charAt(index) == '0')
			index = decode(root.left, index, sb);
		else
			index = decode(root.right, index, sb);

		return index;
	}

	// Builds Huffman Tree and huffmanCode and decode given input text
	public static void buildHuffmanTree(String text)
	{
		// count frequency of appearance of each character
		// and store it in a map
		Map<Character, Integer> freq = new HashMap<>();
		for (int i = 0 ; i < text.length(); i++) {
			if (!freq.containsKey(text.charAt(i))) {
				freq.put(text.charAt(i), 0);
			}
			freq.put(text.charAt(i), freq.get(text.charAt(i)) + 1);
		}

		// Create a priority queue to store live nodes of Huffman tree
		// Notice that highest priority item has lowest frequency
		PriorityQueue<Node> pq = new PriorityQueue<>(
										(l, r) -> l.freq - r.freq);

		// Create a leaf node for each character and add it
		// to the priority queue.
		for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
			pq.add(new Node(entry.getKey(), entry.getValue()));
		}

		// do till there is more than one node in the queue
		while (pq.size() != 1)
		{
			// Remove the two nodes of highest priority
			// (lowest frequency) from the queue
			Node left = pq.poll();
			Node right = pq.poll();

			// Create a new internal node with these two nodes as children 
			// and with frequency equal to the sum of the two nodes
			// frequencies. Add the new node to the priority queue.
			int sum = left.freq + right.freq;
			pq.add(new Node('', sum, left, right));
		}

		// root stores pointer to root of Huffman Tree
		Node root = pq.peek();

		// traverse the Huffman tree and store the Huffman codes in a map
		Map<Character, String> huffmanCode = new HashMap<>();
		encode(root, "", huffmanCode);

		// print the Huffman codes
		System.out.println("Huffman Codes are :n");
		for (Map.Entry<Character, String> entry : huffmanCode.entrySet()) {
			System.out.println(entry.getKey() + " " + entry.getValue());
		}

		System.out.println("nOriginal string was :n" + text);

		// print encoded string
		StringBuilder sb = new StringBuilder();
		for (int i = 0 ; i < text.length(); i++) {
			sb.append(huffmanCode.get(text.charAt(i)));
		}

		System.out.println("nEncoded string is :n" + sb);

		// traverse the Huffman Tree again and this time
		// decode the encoded string
		int index = -1;
		System.out.println("nDecoded string is: n");
		while (index < sb.length() - 2) {
			index = decode(root, index, sb);
		}
	}

	public static void main(String[] args)
	{
		String text = "Huffman coding is a data compression algorithm.";

		buildHuffmanTree(text);
	}
}

Ela hloko: mohopolo o sebelisoang ke khoele ea ho kenya ke 47 * 8 = 376 bits le khoele e kentsoeng ke likotoana tsa 194 feela ke hore. data e petelitsoe ke hoo e ka bang 48%. Lenaneong la C ++ le ka holimo, re sebelisa sehlopha sa likhoele ho boloka khoele e kentsoeng ho etsa hore lenaneo le balehe.

Hobane libopeho tse sebetsang hantle tsa lethathamo la data li hloka ho kengoa ka 'ngoe O(log(N)) nako, empa ka ho feletseng binary sefate le N makhasi a teng 2N-1 nodes, 'me sefate sa Huffman ke sefate sa binary se feletseng, joale algorithm e kena O(Nlog(N)) nako, kae N - Litlhaku.

Lisebelisoa:

st.wikipedia.org/wiki/Huffman_coding
en.wikipedia.org/wiki/Variable-length_code
www.youtube.com/watch?v=5wRPin4oxCo

Ithute haholoanyane ka thupelo.

Source: www.habr.com

Eketsa ka tlhaloso