Huffman matsawa algorithm

Kafin a fara karatun "Algorithms don Developers" an shirya muku fassarar wani abu mai amfani.

Huffman codeing shine algorithm na matsa bayanai wanda ke tsara ainihin ra'ayin matsar fayil. A cikin wannan labarin, za mu yi magana game da ƙayyadaddun bayanai masu tsayi da masu canzawa, lambobi na musamman waɗanda za'a iya yankewa, ƙa'idodin ƙa'ida, da gina bishiyar Huffman.

Mun san cewa kowane hali ana adana shi azaman jerin 0's da 1's kuma yana ɗaukar ragi 8. Ana kiran wannan ƙayyadaddun ɓoye bayanan tsayi saboda kowane hali yana amfani da ƙayyadadden adadin rago don adanawa.

A ce an ba mu rubutu. Ta yaya za mu iya rage adadin sararin da ake buƙata don adana hali ɗaya?

Babban ra'ayi shine canza yanayin tsawon tsayi. Za mu iya amfani da gaskiyar cewa wasu haruffa a cikin rubutun suna faruwa sau da yawa fiye da wasu (gani nan) don haɓaka algorithm wanda zai wakilci jerin haruffa iri ɗaya a cikin ƴan kaɗan. A madaidaicin rikodi na tsawon, muna ba da haruffa nau'ikan nau'ikan ragi, dangane da sau nawa suke bayyana a cikin rubutun da aka bayar. A ƙarshe, wasu haruffa na iya ɗaukar kaɗan kamar 1 bit, yayin da wasu na iya ɗaukar 2 rago, 3 ko fiye. Matsala tare da madaidaicin rikodi na tsawon shine kawai yankewar jeri na gaba.

Ta yaya, sanin jerin ragowa, yanke shi ba tare da wata shakka ba?

Yi la'akari da layi "abinci". Yana da haruffa 8, kuma lokacin shigar da tsayayyen tsayi, zai buƙaci rago 64 don adana shi. Lura cewa mitar alamar "a", "b", "c" и "D" ba yayi daidai da 4, 2, 1, 1. Bari mu yi ƙoƙari mu yi tunanin "abinci" 'yan kaɗan, ta amfani da gaskiyar cewa "zuwa" yana faruwa akai-akai fiye da "B" bada kuma "B" ba yana faruwa akai-akai fiye da "c" и "D" ba. Bari mu fara da codeing "zuwa" tare da bit guda daidai da 0, "B" ba Za mu sanya code-bit 11 guda biyu, kuma ta amfani da ragowa uku 100 da 011 za mu ɓoye. "c" и "D" ba.

A sakamakon haka, za mu samu:

a
0

b
11

c
100

d
011

Don haka layin "abinci" za mu encode kamar yadda 00110100011011 (0|0|11|0|100|011|0|11)ta amfani da lambobin da ke sama. Koyaya, babbar matsalar zata kasance a cikin yanke hukunci. Lokacin da muka yi ƙoƙarin ƙaddamar da kirtani 00110100011011, za mu sami sakamako mai ma'ana, tun da ana iya wakilta shi kamar:

0|011|0|100|011|0|11    adacdab
0|0|11|0|100|0|11|011   aabacabd
0|011|0|100|0|11|0|11   adacabab 

...
da sauransu.

Don guje wa wannan shubuha, dole ne mu tabbatar da cewa shigar da mu ya gamsar da irin wannan ra'ayi kamar mulkin prefix, wanda hakan ke nuna cewa lambobin za a iya yanke su ta hanya ɗaya ta musamman. Ƙa'idar prefix ta tabbatar da cewa babu lambar da ke zama prefix na wani. Ta lamba, muna nufin raƙuman da aka yi amfani da su don wakiltar wani hali. A cikin misalin da ke sama 0 prefix ne 011, wanda ya saba wa ka'idar prefix. Don haka, idan lambobin mu sun gamsar da ƙa'idar prefix, to za mu iya yanke hukunci na musamman (kuma akasin haka).

Bari mu sake duba misalin da ke sama. A wannan karon za mu sanya alamomin "a", "b", "c" и "D" ba lambobin da suka gamsar da ƙa'idar prefix.

a
0

b
10

c
110

d
111

Tare da wannan rikodin, kirtani "abinci" za a yi rikodin kamar yadda 00100100011010 (0|0|10|0|100|011|0|10). Kuma a nan 00100100011010 za mu riga mun sami damar yanke maƙasudi ba tare da wata shakka ba kuma mu dawo zuwa asalin asalin mu "abinci".

Huffman coding

Yanzu da muka yi magana game da maɓalli na tsawon tsayi da ƙa'idar prefix, bari mu yi magana game da shigar da huffman.

Hanyar ta dogara ne akan ƙirƙirar bishiyoyin binary. A ciki, kumburi na iya zama ko dai na ƙarshe ko na ciki. Da farko, duk nodes ana la'akari da ganye (tashoshi), wanda ke wakiltar alamar kanta da nauyinta (wato, yawan abin da ya faru). Ƙungiyoyin ciki sun ƙunshi nauyin halayen kuma suna nufin nodes na zuriya guda biyu. By general yarjejeniya, bit «0» wakiltar bin reshe na hagu, da «1» - a dama. a cikin cikakken itace N ganye da N-1 nodes na ciki. Ana ba da shawarar cewa lokacin gina bishiyar Huffman, a jefar da alamomin da ba a yi amfani da su ba don samun ingantattun lambobi masu tsayi.

Za mu yi amfani da layin fifiko don gina bishiyar Huffman, inda za a ba da kullin tare da mafi ƙarancin mita mafi girma. An bayyana matakan ginin a ƙasa:

  1. Ƙirƙirar kumburin ganye don kowane hali kuma ƙara su zuwa layin fifiko.
  2. Yayin da akwai takarda fiye da ɗaya a cikin jerin gwano, yi haka:
    • Cire nodes biyu tare da mafi girman fifiko (mafi ƙarancin mitar) daga jerin gwano;
    • Ƙirƙirar sabon kumburi na ciki, inda waɗannan nodes biyu za su kasance yara, kuma yawan abin da ya faru zai kasance daidai da jimlar mitoci na waɗannan nodes biyu.
    • Ƙara sabon kumburi zuwa layin fifiko.
  3. Kullin da ya rage kawai shine tushen, kuma wannan zai kammala aikin ginin bishiyar.

Ka yi tunanin cewa muna da wani rubutu wanda ya ƙunshi haruffa kawai "a", "b", "c", "d" и "da", kuma mitoci na faruwarsu shine 15, 7, 6, 6, da 5, bi da bi. A ƙasa akwai zane-zane waɗanda ke nuna matakan algorithm.

Huffman matsawa algorithm

Huffman matsawa algorithm

Huffman matsawa algorithm

Huffman matsawa algorithm

Huffman matsawa algorithm

Hanya daga tushen zuwa kowane kullin ƙarshen zai adana mafi kyawun lambar prefix (wanda kuma aka sani da lambar Huffman) daidai da halin da ke da alaƙa da wannan kullin ƙarshen.

Huffman matsawa algorithm
Itacen Huffman

A ƙasa zaku sami aiwatar da Huffman matsawa algorithm a cikin C++ da Java:

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

// A Tree node
struct Node
{
	char ch;
	int freq;
	Node *left, *right;
};

// Function to allocate a new tree node
Node* getNode(char ch, int freq, Node* left, Node* right)
{
	Node* node = new Node();

	node->ch = ch;
	node->freq = freq;
	node->left = left;
	node->right = right;

	return node;
}

// Comparison object to be used to order the heap
struct comp
{
	bool operator()(Node* l, Node* r)
	{
		// highest priority item has lowest frequency
		return l->freq > r->freq;
	}
};

// traverse the Huffman Tree and store Huffman Codes
// in a map.
void encode(Node* root, string str,
			unordered_map<char, string> &huffmanCode)
{
	if (root == nullptr)
		return;

	// found a leaf node
	if (!root->left && !root->right) {
		huffmanCode[root->ch] = str;
	}

	encode(root->left, str + "0", huffmanCode);
	encode(root->right, str + "1", huffmanCode);
}

// traverse the Huffman Tree and decode the encoded string
void decode(Node* root, int &index, string str)
{
	if (root == nullptr) {
		return;
	}

	// found a leaf node
	if (!root->left && !root->right)
	{
		cout << root->ch;
		return;
	}

	index++;

	if (str[index] =='0')
		decode(root->left, index, str);
	else
		decode(root->right, index, str);
}

// Builds Huffman Tree and decode given input text
void buildHuffmanTree(string text)
{
	// count frequency of appearance of each character
	// and store it in a map
	unordered_map<char, int> freq;
	for (char ch: text) {
		freq[ch]++;
	}

	// Create a priority queue to store live nodes of
	// Huffman tree;
	priority_queue<Node*, vector<Node*>, comp> pq;

	// Create a leaf node for each character and add it
	// to the priority queue.
	for (auto pair: freq) {
		pq.push(getNode(pair.first, pair.second, nullptr, nullptr));
	}

	// do till there is more than one node in the queue
	while (pq.size() != 1)
	{
		// Remove the two nodes of highest priority
		// (lowest frequency) from the queue
		Node *left = pq.top(); pq.pop();
		Node *right = pq.top();	pq.pop();

		// Create a new internal node with these two nodes
		// as children and with frequency equal to the sum
		// of the two nodes' frequencies. Add the new node
		// to the priority queue.
		int sum = left->freq + right->freq;
		pq.push(getNode('', sum, left, right));
	}

	// root stores pointer to root of Huffman Tree
	Node* root = pq.top();

	// traverse the Huffman Tree and store Huffman Codes
	// in a map. Also prints them
	unordered_map<char, string> huffmanCode;
	encode(root, "", huffmanCode);

	cout << "Huffman Codes are :n" << 'n';
	for (auto pair: huffmanCode) {
		cout << pair.first << " " << pair.second << 'n';
	}

	cout << "nOriginal string was :n" << text << 'n';

	// print encoded string
	string str = "";
	for (char ch: text) {
		str += huffmanCode[ch];
	}

	cout << "nEncoded string is :n" << str << 'n';

	// traverse the Huffman Tree again and this time
	// decode the encoded string
	int index = -1;
	cout << "nDecoded string is: n";
	while (index < (int)str.size() - 2) {
		decode(root, index, str);
	}
}

// Huffman coding algorithm
int main()
{
	string text = "Huffman coding is a data compression algorithm.";

	buildHuffmanTree(text);

	return 0;
}

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

// A Tree node
class Node
{
	char ch;
	int freq;
	Node left = null, right = null;

	Node(char ch, int freq)
	{
		this.ch = ch;
		this.freq = freq;
	}

	public Node(char ch, int freq, Node left, Node right) {
		this.ch = ch;
		this.freq = freq;
		this.left = left;
		this.right = right;
	}
};

class Huffman
{
	// traverse the Huffman Tree and store Huffman Codes
	// in a map.
	public static void encode(Node root, String str,
							  Map<Character, String> huffmanCode)
	{
		if (root == null)
			return;

		// found a leaf node
		if (root.left == null && root.right == null) {
			huffmanCode.put(root.ch, str);
		}


		encode(root.left, str + "0", huffmanCode);
		encode(root.right, str + "1", huffmanCode);
	}

	// traverse the Huffman Tree and decode the encoded string
	public static int decode(Node root, int index, StringBuilder sb)
	{
		if (root == null)
			return index;

		// found a leaf node
		if (root.left == null && root.right == null)
		{
			System.out.print(root.ch);
			return index;
		}

		index++;

		if (sb.charAt(index) == '0')
			index = decode(root.left, index, sb);
		else
			index = decode(root.right, index, sb);

		return index;
	}

	// Builds Huffman Tree and huffmanCode and decode given input text
	public static void buildHuffmanTree(String text)
	{
		// count frequency of appearance of each character
		// and store it in a map
		Map<Character, Integer> freq = new HashMap<>();
		for (int i = 0 ; i < text.length(); i++) {
			if (!freq.containsKey(text.charAt(i))) {
				freq.put(text.charAt(i), 0);
			}
			freq.put(text.charAt(i), freq.get(text.charAt(i)) + 1);
		}

		// Create a priority queue to store live nodes of Huffman tree
		// Notice that highest priority item has lowest frequency
		PriorityQueue<Node> pq = new PriorityQueue<>(
										(l, r) -> l.freq - r.freq);

		// Create a leaf node for each character and add it
		// to the priority queue.
		for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
			pq.add(new Node(entry.getKey(), entry.getValue()));
		}

		// do till there is more than one node in the queue
		while (pq.size() != 1)
		{
			// Remove the two nodes of highest priority
			// (lowest frequency) from the queue
			Node left = pq.poll();
			Node right = pq.poll();

			// Create a new internal node with these two nodes as children 
			// and with frequency equal to the sum of the two nodes
			// frequencies. Add the new node to the priority queue.
			int sum = left.freq + right.freq;
			pq.add(new Node('', sum, left, right));
		}

		// root stores pointer to root of Huffman Tree
		Node root = pq.peek();

		// traverse the Huffman tree and store the Huffman codes in a map
		Map<Character, String> huffmanCode = new HashMap<>();
		encode(root, "", huffmanCode);

		// print the Huffman codes
		System.out.println("Huffman Codes are :n");
		for (Map.Entry<Character, String> entry : huffmanCode.entrySet()) {
			System.out.println(entry.getKey() + " " + entry.getValue());
		}

		System.out.println("nOriginal string was :n" + text);

		// print encoded string
		StringBuilder sb = new StringBuilder();
		for (int i = 0 ; i < text.length(); i++) {
			sb.append(huffmanCode.get(text.charAt(i)));
		}

		System.out.println("nEncoded string is :n" + sb);

		// traverse the Huffman Tree again and this time
		// decode the encoded string
		int index = -1;
		System.out.println("nDecoded string is: n");
		while (index < sb.length() - 2) {
			index = decode(root, index, sb);
		}
	}

	public static void main(String[] args)
	{
		String text = "Huffman coding is a data compression algorithm.";

		buildHuffmanTree(text);
	}
}

Note: Ƙwaƙwalwar ajiyar da igiyar shigarwa ke amfani da ita ita ce 47 * 8 = 376 bits kuma igiyar da aka yi rikodin ita ce kawai 194 bits watau. An matsa bayanai da kusan 48%. A cikin shirin C++ da ke sama, muna amfani da ajin kirtani don adana kirtani da aka ƙulla don sanya shirin a iya karantawa.

Saboda ingantaccen tsarin bayanan layin fifiko yana buƙatar kowane sakawa O(log(N)) lokaci, amma a cikin cikakken binaryar itace tare da N ganye ba 2N-1 nodes, kuma itacen Huffman cikakkiyar bishiyar binaryar ce, sannan algorithm yana shiga O(Nlog(N)) lokaci, ku N - Halaye.

Sources:

en.wikipedia.org/wiki/Huffman_coding
en.wikipedia.org/wiki/Variable-length_code
www.youtube.com/watch?v=5wRPin4oxCo

Koyi game da kwas.

source: www.habr.com

Add a comment