Huffman compression algorithm

Ngaphambi kokuqala kwekhosi "Umgaqo-nkqubo wabaPhuhlisi" ndikulungiselele inguqulelo yomnye umbandela oluncedo.

Ikhowudi ye-Huffman yi-algorithm yoxinzelelo lwedatha eyenza ingcamango esisiseko yoxinzelelo lwefayile. Kweli nqaku, siza kuthetha malunga ne-fixed and variable length encoding, unique code decodable, rules prefix, and building a Huffman tree.

Siyazi ukuba umlinganiswa ngamnye ugcinwa njengolandelelwano lwe-0 kunye ne-1 kwaye ithatha iibhithi ezisi-8. Oku kubizwa ngokuba yi-encoding yobude obumiselweyo ngenxa yokuba unobumba ngamnye usebenzisa inani elimiselweyo lamasuntswana ukugcina.

Masithi sinikwe umbhalo. Sinokuyinciphisa njani indawo efunekayo yokugcina umlinganiswa omnye?

Uluvo oluphambili lubhalo lobude obuguquguqukayo. Singasebenzisa into yokuba abanye abalinganiswa kwisicatshulwa bavela rhoqo kunabanye (bona apha) ukuphuhlisa i-algorithm eya kubonisa ukulandelelana okufanayo kwabasebenzi kumasuntswana ambalwa. Kwi-encoding yobude obuguquguqukayo, sinikezela ngoonobumba inani eliguquguqukayo lamasuntswana, kuxhomekeke ekubeni avela kangaphi kwisicatshulwa esinikiweyo. Ekugqibeleni, abanye abalinganiswa banokuthatha kancinci njenge-1 bit, ngelixa abanye banokuthatha amasuntswana ama-2, u-3 okanye ngaphezulu. Ingxaki ngobude obuguquguqukayo bekhowudi kukuphela kwekhowudi yokulandelelana elandelayo.

Njani, ukwazi ukulandelelana kwamasuntswana, ukucacisa ngokungathandabuzekiyo?

Cinga ngomgca "uxolo". Inoonobumba abasi-8, kwaye xa kufakwa ikhowudi yobude obumiselweyo, iya kufuna amasuntswana angama-64 ukuyigcina. Qaphela ukuba isimboli rhoqo "a", "b", "c" ΠΈ "D" ilingana no-4, 2, 1, 1 ngokulandelelanayo. Makhe sizame ukuba nomfanekiso-ngqondweni "uxolo" amasuntswana ambalwa, usebenzisa into yokuba "ukuya" yenzeka rhoqo kune "B", kwaye "B" yenzeka rhoqo kune "c" ΠΈ "D". Masiqale ngekhowudi "ukuya" ngentwana enye elingana no0, "B" siyakwabela ikhowudi yamasuntswana amabini u-11, kwaye sisebenzisa amasuntswana amathathu i-100 kunye no-011 siya kuhlanganisa "c" ΠΈ "D".

Ngenxa yoko, siya kufumana:

a
0

b
11

c
100

d
011

Ngoko umgca "uxolo" siya kufakwa ikhowudi njenge 00110100011011 (0|0|11|0|100|011|0|11)usebenzisa iikhowudi ezingentla. Nangona kunjalo, eyona ngxaki iphambili iya kuba kukucacisa iikhowudi. Xa sizama ukucacisa umtya 00110100011011, sifumana isiphumo esingacacanga, kuba sinokumelwa njenge:

0|011|0|100|011|0|11    adacdab
0|0|11|0|100|0|11|011   aabacabd
0|011|0|100|0|11|0|11   adacabab 

...
njalo njalo.

Ukunqanda oku kungacaci, kufuneka siqinisekise ukuba usetyenziso lwekhowudi luyanelisa uluvo olunje umthetho wesimaphambili, nto leyo ethetha ukuba iikhowudi zinokuchazwa ngendlela enye eyodwa. Umgaqo wesimaphambili uqinisekisa ukuba akukho khowudi sisimaphambili senye. Ngekhowudi, sithetha amasuntswana asetyenziselwa ukumela umlinganiswa othile. Kulo mzekelo ungasentla 0 sisimaphambili 011, eyaphula umthetho wesimaphambili. Ngoko ke, ukuba iikhowudi zethu ziyanelisa umgaqo wesimaphambili, ngoko ke singakwazi ukukhetha (kunye nangenye indlela).

Masiphinde siwuqwalasele lo mzekelo ungasentla. Ngeli xesha siza kwabela iisimboli "a", "b", "c" ΠΈ "D" iikhowudi ezanelisa umthetho wesimaphambili.

a
0

b
10

c
110

d
111

Ngale khowudi, umtya "uxolo" izakufakwa ngekhowudi njenge 00100100011010 (0|0|10|0|100|011|0|10). Kwaye apha 00100100011010 siya kuba nakho ukucacisa ngokungathandabuzekiyo kwaye sibuyele kumtya wethu wokuqala "uxolo".

Huffman ikhowudi

Ngoku kuba siye sajongana nobhalo lobude obuguquguqukayo kunye nomthetho wesimaphambili, makhe sithethe malunga ne-Huffman encoding.

Indlela isekelwe ekudalweni kwemithi yokubini. Kuyo, i-node ingaba yinto yokugqibela okanye yangaphakathi. Ekuqaleni, zonke iinqununu zibhekwa njengamagqabi (i-terminals), ezimele isimboli ngokwayo kunye nobunzima bayo (oko kukuthi, ukuphindaphinda okwenzekayo). Iingqungquthela zangaphakathi ziqulethe ubunzima bomlingiswa kwaye zibhekiselele kwiindawo ezimbini zokuzala. Ngokuvumelana ngokubanzi, bit "0" imele ukulandela isebe lasekhohlo, kwaye "1" - ngasekunene. emthini ogcweleyo N amagqabi kunye N-1 iindawo zangaphakathi. Kucetyiswa ukuba xa kusakhiwa umthi we-Huffman, iisimboli ezingasetyenziswanga zilahlwe ukuze kufumaneke iikhowudi zobude obufanelekileyo.

Siza kusebenzisa umgca ophambili ukwakha umthi we-Huffman, apho i-node ene-frequency ephantsi iya kunikwa ingqwalasela ephezulu. Amanyathelo okwakha achazwe ngezantsi:

  1. Yenza i-node yegqabi kumlinganiswa ngamnye kwaye wongeze kumgca obalulekileyo.
  2. Ngelixa kukho ngaphezu kwephepha elinye emgceni, yenza oku kulandelayo:
    • Susa ii-nodes ezimbini ngokubaluleka okuphezulu (i-frequency ephantsi) ukusuka kumgca;
    • Yenza i-node entsha yangaphakathi, apho ezi nodi zimbini ziya kuba ngabantwana, kwaye ukwenzeka rhoqo kuya kulingana nenani le-frequencies yezi nodi zimbini.
    • Yongeza i-node entsha kumgca wokubaluleka.
  3. I-node eseleyo kuphela iya kuba yingcambu, kwaye oku kuya kugqiba ukwakhiwa komthi.

Khawucinge ukuba sinesicatshulwa esinamagama kuphela "a B C D" ΠΈ "kwaye", kwaye i-frequencies yazo yi-15, 7, 6, 6, kunye no-5, ngokulandelanayo. Ngezantsi yimifanekiso ebonisa amanyathelo e-algorithm.

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Huffman compression algorithm

Umendo osuka kwingcambu ukuya kuyo nayiphi na isiphelo sendawo iya kugcina eyona khowudi yesimaphambili (ekwaziwa ngokuba yikhowudi ye-Huffman) ehambelana nomlinganiswa onxulumene naloo nodi yesiphelo.

Huffman compression algorithm
Umthi weHuffman

Ngezantsi uya kufumana ukuphunyezwa kwe-algorithm yoxinzelelo lwe-Huffman kwi-C++ kunye neJava:

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

// A Tree node
struct Node
{
	char ch;
	int freq;
	Node *left, *right;
};

// Function to allocate a new tree node
Node* getNode(char ch, int freq, Node* left, Node* right)
{
	Node* node = new Node();

	node->ch = ch;
	node->freq = freq;
	node->left = left;
	node->right = right;

	return node;
}

// Comparison object to be used to order the heap
struct comp
{
	bool operator()(Node* l, Node* r)
	{
		// highest priority item has lowest frequency
		return l->freq > r->freq;
	}
};

// traverse the Huffman Tree and store Huffman Codes
// in a map.
void encode(Node* root, string str,
			unordered_map<char, string> &huffmanCode)
{
	if (root == nullptr)
		return;

	// found a leaf node
	if (!root->left && !root->right) {
		huffmanCode[root->ch] = str;
	}

	encode(root->left, str + "0", huffmanCode);
	encode(root->right, str + "1", huffmanCode);
}

// traverse the Huffman Tree and decode the encoded string
void decode(Node* root, int &index, string str)
{
	if (root == nullptr) {
		return;
	}

	// found a leaf node
	if (!root->left && !root->right)
	{
		cout << root->ch;
		return;
	}

	index++;

	if (str[index] =='0')
		decode(root->left, index, str);
	else
		decode(root->right, index, str);
}

// Builds Huffman Tree and decode given input text
void buildHuffmanTree(string text)
{
	// count frequency of appearance of each character
	// and store it in a map
	unordered_map<char, int> freq;
	for (char ch: text) {
		freq[ch]++;
	}

	// Create a priority queue to store live nodes of
	// Huffman tree;
	priority_queue<Node*, vector<Node*>, comp> pq;

	// Create a leaf node for each character and add it
	// to the priority queue.
	for (auto pair: freq) {
		pq.push(getNode(pair.first, pair.second, nullptr, nullptr));
	}

	// do till there is more than one node in the queue
	while (pq.size() != 1)
	{
		// Remove the two nodes of highest priority
		// (lowest frequency) from the queue
		Node *left = pq.top(); pq.pop();
		Node *right = pq.top();	pq.pop();

		// Create a new internal node with these two nodes
		// as children and with frequency equal to the sum
		// of the two nodes' frequencies. Add the new node
		// to the priority queue.
		int sum = left->freq + right->freq;
		pq.push(getNode('', sum, left, right));
	}

	// root stores pointer to root of Huffman Tree
	Node* root = pq.top();

	// traverse the Huffman Tree and store Huffman Codes
	// in a map. Also prints them
	unordered_map<char, string> huffmanCode;
	encode(root, "", huffmanCode);

	cout << "Huffman Codes are :n" << 'n';
	for (auto pair: huffmanCode) {
		cout << pair.first << " " << pair.second << 'n';
	}

	cout << "nOriginal string was :n" << text << 'n';

	// print encoded string
	string str = "";
	for (char ch: text) {
		str += huffmanCode[ch];
	}

	cout << "nEncoded string is :n" << str << 'n';

	// traverse the Huffman Tree again and this time
	// decode the encoded string
	int index = -1;
	cout << "nDecoded string is: n";
	while (index < (int)str.size() - 2) {
		decode(root, index, str);
	}
}

// Huffman coding algorithm
int main()
{
	string text = "Huffman coding is a data compression algorithm.";

	buildHuffmanTree(text);

	return 0;
}

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

// A Tree node
class Node
{
	char ch;
	int freq;
	Node left = null, right = null;

	Node(char ch, int freq)
	{
		this.ch = ch;
		this.freq = freq;
	}

	public Node(char ch, int freq, Node left, Node right) {
		this.ch = ch;
		this.freq = freq;
		this.left = left;
		this.right = right;
	}
};

class Huffman
{
	// traverse the Huffman Tree and store Huffman Codes
	// in a map.
	public static void encode(Node root, String str,
							  Map<Character, String> huffmanCode)
	{
		if (root == null)
			return;

		// found a leaf node
		if (root.left == null && root.right == null) {
			huffmanCode.put(root.ch, str);
		}


		encode(root.left, str + "0", huffmanCode);
		encode(root.right, str + "1", huffmanCode);
	}

	// traverse the Huffman Tree and decode the encoded string
	public static int decode(Node root, int index, StringBuilder sb)
	{
		if (root == null)
			return index;

		// found a leaf node
		if (root.left == null && root.right == null)
		{
			System.out.print(root.ch);
			return index;
		}

		index++;

		if (sb.charAt(index) == '0')
			index = decode(root.left, index, sb);
		else
			index = decode(root.right, index, sb);

		return index;
	}

	// Builds Huffman Tree and huffmanCode and decode given input text
	public static void buildHuffmanTree(String text)
	{
		// count frequency of appearance of each character
		// and store it in a map
		Map<Character, Integer> freq = new HashMap<>();
		for (int i = 0 ; i < text.length(); i++) {
			if (!freq.containsKey(text.charAt(i))) {
				freq.put(text.charAt(i), 0);
			}
			freq.put(text.charAt(i), freq.get(text.charAt(i)) + 1);
		}

		// Create a priority queue to store live nodes of Huffman tree
		// Notice that highest priority item has lowest frequency
		PriorityQueue<Node> pq = new PriorityQueue<>(
										(l, r) -> l.freq - r.freq);

		// Create a leaf node for each character and add it
		// to the priority queue.
		for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
			pq.add(new Node(entry.getKey(), entry.getValue()));
		}

		// do till there is more than one node in the queue
		while (pq.size() != 1)
		{
			// Remove the two nodes of highest priority
			// (lowest frequency) from the queue
			Node left = pq.poll();
			Node right = pq.poll();

			// Create a new internal node with these two nodes as children 
			// and with frequency equal to the sum of the two nodes
			// frequencies. Add the new node to the priority queue.
			int sum = left.freq + right.freq;
			pq.add(new Node('', sum, left, right));
		}

		// root stores pointer to root of Huffman Tree
		Node root = pq.peek();

		// traverse the Huffman tree and store the Huffman codes in a map
		Map<Character, String> huffmanCode = new HashMap<>();
		encode(root, "", huffmanCode);

		// print the Huffman codes
		System.out.println("Huffman Codes are :n");
		for (Map.Entry<Character, String> entry : huffmanCode.entrySet()) {
			System.out.println(entry.getKey() + " " + entry.getValue());
		}

		System.out.println("nOriginal string was :n" + text);

		// print encoded string
		StringBuilder sb = new StringBuilder();
		for (int i = 0 ; i < text.length(); i++) {
			sb.append(huffmanCode.get(text.charAt(i)));
		}

		System.out.println("nEncoded string is :n" + sb);

		// traverse the Huffman Tree again and this time
		// decode the encoded string
		int index = -1;
		System.out.println("nDecoded string is: n");
		while (index < sb.length() - 2) {
			index = decode(root, index, sb);
		}
	}

	public static void main(String[] args)
	{
		String text = "Huffman coding is a data compression algorithm.";

		buildHuffmanTree(text);
	}
}

Qaphela: imemori esetyenziswe yintambo yokufaka i-47 * 8 = i-376 bits kunye nentambo ye-encoded kuphela i-194 bits i.e. idatha icinezelwe malunga ne-48%. Kwiprogram ye-C ++ ngasentla, sisebenzisa iklasi yomtya ukugcina umtya we-encoded ukwenza inkqubo ifundeke.

Ngenxa yokuba iziseko zedatha yoluhlu oluphambili olusebenzayo lufuna ufako ngalunye O(log(N)) ixesha, kodwa kumthi wokubini opheleleyo nge N amagqabi akhoyo 2N-1 iindawo, kwaye umthi we-Huffman ngumthi wokubini opheleleyo, emva koko i-algorithm ingena O(Nlog(N)) ixesha, phi N - Abalinganiswa.

Imithombo:

xh.wikipedia.org/wiki/Huffman_coding
en.wikipedia.org/wiki/Variable-length_code
www.youtube.com/watch?v=5wRPin4oxCo

Funda ngakumbi malunga nekhosi.

umthombo: www.habr.com

Yongeza izimvo