Algoritma kompresi Huffman

Sadurunge miwiti kursus "Algoritma kanggo Pangembang" disiapake kanggo sampeyan terjemahan saka materi liyane migunani.

Huffman coding minangka algoritma kompresi data sing ngrumusake ide dhasar kompresi file. Ing artikel iki, kita bakal ngomong babagan enkoding dawa tetep lan variabel, kode unik sing bisa didekode, aturan awalan, lan mbangun wit Huffman.

Kita ngerti yen saben karakter disimpen minangka urutan 0 lan 1 lan njupuk 8 bit. Iki diarani enkoding dawa tetep amarga saben karakter nggunakake jumlah bit sing padha kanggo disimpen.

Ayo kita duwe teks. Kepiye carane bisa nyuda jumlah papan sing dibutuhake kanggo nyimpen karakter siji?

Ide utama yaiku enkoding dawa variabel. Kita bisa nggunakake kasunyatan sing sawetara karakter ing teks kedadeyan luwih kerep tinimbang liyane (ndeleng kene) kanggo ngembangake algoritma sing bakal makili urutan karakter sing padha ing bit sing luwih sithik. Ing enkoding dawa variabel, kita nemtokake karakter nomer variabel bit, gumantung sepira kerepe padha katon ing teks tartamtu. Pungkasane, sawetara karakter bisa njupuk sethithik 1 bit, dene liyane bisa njupuk 2 bit, 3 utawa luwih. Masalah karo enkoding dawa variabel mung dekoding urutan sabanjure.

Carane ngerti urutan bit, decode iku unambiguously?

Coba baris "abacdab". Wis 8 karakter, lan nalika ngodhe dawa tetep, iku perlu 64 bit kanggo nyimpen. Elinga yen frekuensi simbol "a", "b", "c" ΠΈ "D" padha karo 4, 2, 1, 1. Coba mbayangno "abacdab" bit kurang, nggunakake kasunyatan sing "kanggo" dumadi luwih kerep tinimbang "B"lan "B" dumadi luwih kerep tinimbang "c" ΠΈ "D". Ayo diwiwiti kanthi coding "kanggo" kanthi siji bit padha karo 0, "B" kita bakal nemtokake kode loro-bit 11, lan nggunakake telung bit 100 lan 011 kita bakal encode "c" ΠΈ "D".

AkibatΓ©, kita bakal entuk:

a
0

b
11

c
100

d
011

Dadi baris "abacdab" kita bakal encode minangka 00110100011011 (0|0|11|0|100|011|0|11)nggunakake kode ing ndhuwur. Nanging, masalah utama bakal ana ing dekoding. Nalika kita nyoba kanggo decode senar 00110100011011, kita entuk asil sing ambigu, amarga bisa dituduhake minangka:

0|011|0|100|011|0|11    adacdab
0|0|11|0|100|0|11|011   aabacabd
0|011|0|100|0|11|0|11   adacabab 

...
lan liya-liyane.

Kanggo ngindhari ambiguitas iki, kita kudu mesthekake yen enkoding kita nyukupi konsep kayata aturan awalan, sing nuduhake manawa kode kasebut mung bisa didekode kanthi cara sing unik. Aturan awalan mesthekake yen ora ana kode minangka awalan liyane. Miturut kode, kita tegese bit sing digunakake kanggo makili karakter tartamtu. Ing conto ing ndhuwur 0 yaiku ater-ater 011, sing nglanggar aturan awalan. Dadi, yen kode kita marem aturan awalan, banjur kita bisa unik decode (lan kosok balene).

Coba deleng maneh conto ing ndhuwur. Wektu iki kita bakal nemtokake kanggo simbol "a", "b", "c" ΠΈ "D" kode sing marem aturan awalan.

a
0

b
10

c
110

d
111

Kanthi enkoding iki, string "abacdab" bakal dienkode minangka 00100100011010 (0|0|10|0|100|011|0|10). Lan kene 00100100011010 kita wis bisa decode unambiguously lan bali menyang senar asli kita "abacdab".

Huffman coding

Saiki kita wis ngrampungake enkoding dawa variabel lan aturan awalan, ayo ngomong babagan enkoding Huffman.

Cara kasebut adhedhasar nggawe wit binar. Ing kono, simpul bisa dadi final utawa internal. Kaping pisanan, kabeh simpul dianggep godhong (terminal), sing makili simbol dhewe lan bobote (yaiku frekuensi kedadeyan). Node internal ngemot bobot karakter lan nuduhake rong node turunan. Miturut persetujuan umum, dicokot "0" nggantosi nderek cabang kiwa, lan "1" - ing sisih tengen. ing wit lengkap N godhong lan N-1 node internal. Disaranake nalika mbangun wit Huffman, simbol sing ora digunakake dibuwak kanggo entuk kode dawa sing optimal.

Kita bakal nggunakake antrian prioritas kanggo mbangun wit Huffman, ing ngendi simpul kanthi frekuensi paling murah bakal diwenehi prioritas paling dhuwur. Langkah-langkah konstruksi diterangake ing ngisor iki:

  1. Gawe simpul godhong kanggo saben karakter lan tambahake menyang antrian prioritas.
  2. Nalika ana luwih saka siji lembar ing antrian, tindakake ing ngisor iki:
    • Copot rong simpul kanthi prioritas paling dhuwur (frekuensi paling murah) saka antrian;
    • Nggawe simpul internal anyar, ing ngendi rong simpul kasebut bakal dadi bocah, lan frekuensi kedadeyan bakal padha karo jumlah frekuensi saka rong simpul kasebut.
    • Tambah simpul anyar menyang antrian prioritas.
  3. Siji-sijine simpul sing isih ana bakal dadi oyod, lan iki bakal ngrampungake pambangunan wit kasebut.

Mbayangno yen kita duwe sawetara teks sing mung kalebu karakter "a b c d" ΠΈ "lan", lan frekuensi kedadeyane yaiku 15, 7, 6, 6, lan 5. Ing ngisor iki ana ilustrasi sing nggambarake langkah-langkah algoritma.

Algoritma kompresi Huffman

Algoritma kompresi Huffman

Algoritma kompresi Huffman

Algoritma kompresi Huffman

Algoritma kompresi Huffman

Path saka oyod menyang simpul pungkasan bakal nyimpen kode awalan sing optimal (uga dikenal minangka kode Huffman) sing cocog karo karakter sing ana gandhengane karo simpul pungkasan kasebut.

Algoritma kompresi Huffman
wit Huffman

Ing ngisor iki sampeyan bakal nemokake implementasine algoritma kompresi Huffman ing C++ lan Jawa:

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

// A Tree node
struct Node
{
	char ch;
	int freq;
	Node *left, *right;
};

// Function to allocate a new tree node
Node* getNode(char ch, int freq, Node* left, Node* right)
{
	Node* node = new Node();

	node->ch = ch;
	node->freq = freq;
	node->left = left;
	node->right = right;

	return node;
}

// Comparison object to be used to order the heap
struct comp
{
	bool operator()(Node* l, Node* r)
	{
		// highest priority item has lowest frequency
		return l->freq > r->freq;
	}
};

// traverse the Huffman Tree and store Huffman Codes
// in a map.
void encode(Node* root, string str,
			unordered_map<char, string> &huffmanCode)
{
	if (root == nullptr)
		return;

	// found a leaf node
	if (!root->left && !root->right) {
		huffmanCode[root->ch] = str;
	}

	encode(root->left, str + "0", huffmanCode);
	encode(root->right, str + "1", huffmanCode);
}

// traverse the Huffman Tree and decode the encoded string
void decode(Node* root, int &index, string str)
{
	if (root == nullptr) {
		return;
	}

	// found a leaf node
	if (!root->left && !root->right)
	{
		cout << root->ch;
		return;
	}

	index++;

	if (str[index] =='0')
		decode(root->left, index, str);
	else
		decode(root->right, index, str);
}

// Builds Huffman Tree and decode given input text
void buildHuffmanTree(string text)
{
	// count frequency of appearance of each character
	// and store it in a map
	unordered_map<char, int> freq;
	for (char ch: text) {
		freq[ch]++;
	}

	// Create a priority queue to store live nodes of
	// Huffman tree;
	priority_queue<Node*, vector<Node*>, comp> pq;

	// Create a leaf node for each character and add it
	// to the priority queue.
	for (auto pair: freq) {
		pq.push(getNode(pair.first, pair.second, nullptr, nullptr));
	}

	// do till there is more than one node in the queue
	while (pq.size() != 1)
	{
		// Remove the two nodes of highest priority
		// (lowest frequency) from the queue
		Node *left = pq.top(); pq.pop();
		Node *right = pq.top();	pq.pop();

		// Create a new internal node with these two nodes
		// as children and with frequency equal to the sum
		// of the two nodes' frequencies. Add the new node
		// to the priority queue.
		int sum = left->freq + right->freq;
		pq.push(getNode('', sum, left, right));
	}

	// root stores pointer to root of Huffman Tree
	Node* root = pq.top();

	// traverse the Huffman Tree and store Huffman Codes
	// in a map. Also prints them
	unordered_map<char, string> huffmanCode;
	encode(root, "", huffmanCode);

	cout << "Huffman Codes are :n" << 'n';
	for (auto pair: huffmanCode) {
		cout << pair.first << " " << pair.second << 'n';
	}

	cout << "nOriginal string was :n" << text << 'n';

	// print encoded string
	string str = "";
	for (char ch: text) {
		str += huffmanCode[ch];
	}

	cout << "nEncoded string is :n" << str << 'n';

	// traverse the Huffman Tree again and this time
	// decode the encoded string
	int index = -1;
	cout << "nDecoded string is: n";
	while (index < (int)str.size() - 2) {
		decode(root, index, str);
	}
}

// Huffman coding algorithm
int main()
{
	string text = "Huffman coding is a data compression algorithm.";

	buildHuffmanTree(text);

	return 0;
}

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

// A Tree node
class Node
{
	char ch;
	int freq;
	Node left = null, right = null;

	Node(char ch, int freq)
	{
		this.ch = ch;
		this.freq = freq;
	}

	public Node(char ch, int freq, Node left, Node right) {
		this.ch = ch;
		this.freq = freq;
		this.left = left;
		this.right = right;
	}
};

class Huffman
{
	// traverse the Huffman Tree and store Huffman Codes
	// in a map.
	public static void encode(Node root, String str,
							  Map<Character, String> huffmanCode)
	{
		if (root == null)
			return;

		// found a leaf node
		if (root.left == null && root.right == null) {
			huffmanCode.put(root.ch, str);
		}


		encode(root.left, str + "0", huffmanCode);
		encode(root.right, str + "1", huffmanCode);
	}

	// traverse the Huffman Tree and decode the encoded string
	public static int decode(Node root, int index, StringBuilder sb)
	{
		if (root == null)
			return index;

		// found a leaf node
		if (root.left == null && root.right == null)
		{
			System.out.print(root.ch);
			return index;
		}

		index++;

		if (sb.charAt(index) == '0')
			index = decode(root.left, index, sb);
		else
			index = decode(root.right, index, sb);

		return index;
	}

	// Builds Huffman Tree and huffmanCode and decode given input text
	public static void buildHuffmanTree(String text)
	{
		// count frequency of appearance of each character
		// and store it in a map
		Map<Character, Integer> freq = new HashMap<>();
		for (int i = 0 ; i < text.length(); i++) {
			if (!freq.containsKey(text.charAt(i))) {
				freq.put(text.charAt(i), 0);
			}
			freq.put(text.charAt(i), freq.get(text.charAt(i)) + 1);
		}

		// Create a priority queue to store live nodes of Huffman tree
		// Notice that highest priority item has lowest frequency
		PriorityQueue<Node> pq = new PriorityQueue<>(
										(l, r) -> l.freq - r.freq);

		// Create a leaf node for each character and add it
		// to the priority queue.
		for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
			pq.add(new Node(entry.getKey(), entry.getValue()));
		}

		// do till there is more than one node in the queue
		while (pq.size() != 1)
		{
			// Remove the two nodes of highest priority
			// (lowest frequency) from the queue
			Node left = pq.poll();
			Node right = pq.poll();

			// Create a new internal node with these two nodes as children 
			// and with frequency equal to the sum of the two nodes
			// frequencies. Add the new node to the priority queue.
			int sum = left.freq + right.freq;
			pq.add(new Node('', sum, left, right));
		}

		// root stores pointer to root of Huffman Tree
		Node root = pq.peek();

		// traverse the Huffman tree and store the Huffman codes in a map
		Map<Character, String> huffmanCode = new HashMap<>();
		encode(root, "", huffmanCode);

		// print the Huffman codes
		System.out.println("Huffman Codes are :n");
		for (Map.Entry<Character, String> entry : huffmanCode.entrySet()) {
			System.out.println(entry.getKey() + " " + entry.getValue());
		}

		System.out.println("nOriginal string was :n" + text);

		// print encoded string
		StringBuilder sb = new StringBuilder();
		for (int i = 0 ; i < text.length(); i++) {
			sb.append(huffmanCode.get(text.charAt(i)));
		}

		System.out.println("nEncoded string is :n" + sb);

		// traverse the Huffman Tree again and this time
		// decode the encoded string
		int index = -1;
		System.out.println("nDecoded string is: n");
		while (index < sb.length() - 2) {
			index = decode(root, index, sb);
		}
	}

	public static void main(String[] args)
	{
		String text = "Huffman coding is a data compression algorithm.";

		buildHuffmanTree(text);
	}
}

Wigati: memori digunakake dening senar input 47 * 8 = 376 bit lan senar dienkode mung 194 bit i.e. data dikompres udakara 48%. Ing program C ++ ndhuwur, kita nggunakake kelas senar kanggo nyimpen senar dienkode supaya program bisa diwaca.

Amarga struktur data antrian prioritas sing efisien mbutuhake saben sisipan O(log(N)) wektu, nanging ing wit binar lengkap karo N godhong saiki 2N-1 simpul, lan wit Huffman iku wit binar lengkap, banjur algoritma mlaku ing O(Nlog(N)) wektu, ngendi N - Karakter.

Sumber:

en.wikipedia.org/wiki/Huffman_coding
en.wikipedia.org/wiki/Variable-length_code
www.youtube.com/watch?v=5wRPin4oxCo

Sinau luwih lengkap babagan kursus kasebut.

Source: www.habr.com

Add a comment