Algoritma komprési Huffman

Sateuacan ngamimitian kursus "Algoritma pikeun Pamekar" disiapkeun pikeun anjeun tarjamahan tina bahan mangpaat séjén.

Huffman coding mangrupikeun algoritma komprési data anu ngarumuskeun ide dasar komprési file. Dina artikel ieu, urang bakal ngobrol ngeunaan encoding panjang tetep sarta variabel, kode uniquely decodable, aturan awalan, sarta ngawangun tangkal Huffman.

Urang terang yén unggal karakter disimpen salaku sekuen 0 sareng 1 sareng nyandak 8 bit. Ieu disebut encoding panjang tetep sabab unggal karakter ngagunakeun jumlah tetep bit sarua pikeun nyimpen.

Anggap we dibere téks. Kumaha urang bisa ngurangan jumlah spasi diperlukeun pikeun nyimpen hiji karakter tunggal?

Gagasan utama nyaéta encoding panjang variabel. Urang bisa ngagunakeun kanyataan yén sababaraha karakter dina téks lumangsung leuwih sering ti batur (tingali di dieu) pikeun ngembangkeun hiji algoritma anu bakal ngagambarkeun runtuyan karakter sarua dina bit pangsaeutikna. Dina encoding panjang variabel, urang nangtukeun karakter jumlah variabel bit, gumantung kana sabaraha sering aranjeunna muncul dina téks nu tangtu. Antukna, sababaraha karakter tiasa nyandak sakedik 1 bit, sedengkeun anu sanésna nyandak 2 bit, 3 atanapi langkung. Masalah sareng panyandian panjang variabel ngan ukur dekoding salajengna tina sekuen.

Kumaha, nyaho runtuyan bit, decode eta unambiguously?

Mertimbangkeun garis "abacdab". Éta ngagaduhan 8 karakter, sareng nalika ngodekeun panjangna tetep, peryogi 64 bit pikeun nyimpen éta. Catet yén frékuénsi simbol "a", "b", "c" и "D" sarua 4, 2, 1, 1. Coba bayangkeun "abacdab" saeutik bit, ngagunakeun kanyataan yén "ka" lumangsung leuwih sering ti "B"jeung "B" lumangsung leuwih sering ti "c" и "D". Hayu urang mimitian ku coding "ka" kalawan hiji bit sarua jeung 0, "B" urang bakal napelkeun kode dua-bit 11, sarta ngagunakeun tilu bit 100 jeung 011 urang bakal encode. "c" и "D".

Hasilna, urang bakal meunang:

a
0

b
11

c
100

d
011

Jadi garis "abacdab" urang bakal encode salaku 00110100011011 (0|0|11|0|100|011|0|11)ngagunakeun kodeu di luhur. Nanging, masalah utama nyaéta dina decoding. Nalika urang nyobian decode string 00110100011011, urang meunang hasil ambigu, sabab bisa digambarkeun salaku:

0|011|0|100|011|0|11    adacdab
0|0|11|0|100|0|11|011   aabacabd
0|011|0|100|0|11|0|11   adacabab 

...
jeung sajabana

Pikeun ngahindarkeun ambiguitas ieu, urang kedah mastikeun yén encoding urang nyugemakeun konsép sapertos aturan awalan, anu dina gilirannana nunjukkeun yén kodeu ngan ukur tiasa dikodekeun dina hiji cara anu unik. Aturan awalan mastikeun yén euweuh kode anu mangrupa awalan sejen. Ku kode, urang hartosna bit dipaké pikeun ngagambarkeun karakter nu tangtu. Dina conto di luhur 0 mangrupa awalan 011, nu ngalanggar aturan awalan. Janten, upami kodeu urang nyugemakeun aturan awalan, maka urang tiasa sacara unik decode (sabalikna).

Hayu urang balikan deui conto di luhur. waktos Ieu urang bakal napelkeun pikeun simbol "a", "b", "c" и "D" kode anu nyugemakeun aturan awalan.

a
0

b
10

c
110

d
111

Kalayan encoding ieu, string "abacdab" bakal disandikeun salaku 00100100011010 (0|0|10|0|100|011|0|10). sarta di dieu 00100100011010 urang bakal geus bisa unambiguously decode sarta balik deui ka string aslina urang "abacdab".

Huffman coding

Ayeuna urang geus diurus encoding panjang variabel jeung aturan awalan, hayu urang ngobrol ngeunaan encoding Huffman.

Metoda ieu dumasar kana kreasi tangkal binér. Di jerona, titik tiasa janten final atanapi internal. Mimitina, sadaya titik dianggap daun (terminal), anu ngagambarkeun simbol sorangan sareng beuratna (nyaéta, frékuénsi kajadian). Titik internal ngandung beurat karakter sareng ngarujuk kana dua titik turunan. Ku perjangjian umum, bit "0" ngagambarkeun handap cabang kénca, jeung "1" - dibeulah katuhu. dina tangkal pinuh N daun jeung N-1 titik internal. Dianjurkeun yén nalika ngawangun tangkal Huffman, simbol henteu kapake dipiceun pikeun ménta kode panjang optimal.

Urang bakal ngagunakeun antrian prioritas pikeun ngawangun tangkal Huffman, dimana titik kalayan frékuénsi panghandapna bakal dibéré prioritas pangluhurna. Léngkah-léngkah pangwangunan digambarkeun di handap ieu:

  1. Jieun titik daun pikeun tiap karakter tur tambahkeun kana antrian prioritas.
  2. Bari aya leuwih ti hiji lambar dina antrian, ngalakukeun di handap:
    • Cabut dua titik kalayan prioritas pangluhurna (frekuensi panghandapna) tina antrian;
    • Jieun titik internal anyar, dimana dua titik ieu bakal barudak, sarta frékuénsi lumangsungna bakal sarua jeung jumlah tina frékuénsi dua titik ieu.
    • Tambahkeun titik anyar kana antrian prioritas.
  3. Hiji-hijina titik sésana bakal akar, sarta ieu bakal ngalengkepan pangwangunan tangkal.

Bayangkeun yén urang gaduh sababaraha téks anu ngan ukur diwangun ku karakter "a", "b", "c", "d" и "sareng", sarta frékuénsi lumangsungna maranéhanana nyaéta 15, 7, 6, 6, jeung 5, masing-masing. Ieu di handap aya ilustrasi anu ngagambarkeun léngkah-léngkah algoritma.

Algoritma komprési Huffman

Algoritma komprési Huffman

Algoritma komprési Huffman

Algoritma komprési Huffman

Algoritma komprési Huffman

Jalur ti akar ka titik tungtung bakal nyimpen kode awalan optimal (ogé katelah kode Huffman) pakait jeung karakter pakait sareng titik tungtung éta.

Algoritma komprési Huffman
tangkal Huffman

Di handap ieu anjeun bakal mendakan palaksanaan algoritma komprési Huffman dina C ++ sareng Java:

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

// A Tree node
struct Node
{
	char ch;
	int freq;
	Node *left, *right;
};

// Function to allocate a new tree node
Node* getNode(char ch, int freq, Node* left, Node* right)
{
	Node* node = new Node();

	node->ch = ch;
	node->freq = freq;
	node->left = left;
	node->right = right;

	return node;
}

// Comparison object to be used to order the heap
struct comp
{
	bool operator()(Node* l, Node* r)
	{
		// highest priority item has lowest frequency
		return l->freq > r->freq;
	}
};

// traverse the Huffman Tree and store Huffman Codes
// in a map.
void encode(Node* root, string str,
			unordered_map<char, string> &huffmanCode)
{
	if (root == nullptr)
		return;

	// found a leaf node
	if (!root->left && !root->right) {
		huffmanCode[root->ch] = str;
	}

	encode(root->left, str + "0", huffmanCode);
	encode(root->right, str + "1", huffmanCode);
}

// traverse the Huffman Tree and decode the encoded string
void decode(Node* root, int &index, string str)
{
	if (root == nullptr) {
		return;
	}

	// found a leaf node
	if (!root->left && !root->right)
	{
		cout << root->ch;
		return;
	}

	index++;

	if (str[index] =='0')
		decode(root->left, index, str);
	else
		decode(root->right, index, str);
}

// Builds Huffman Tree and decode given input text
void buildHuffmanTree(string text)
{
	// count frequency of appearance of each character
	// and store it in a map
	unordered_map<char, int> freq;
	for (char ch: text) {
		freq[ch]++;
	}

	// Create a priority queue to store live nodes of
	// Huffman tree;
	priority_queue<Node*, vector<Node*>, comp> pq;

	// Create a leaf node for each character and add it
	// to the priority queue.
	for (auto pair: freq) {
		pq.push(getNode(pair.first, pair.second, nullptr, nullptr));
	}

	// do till there is more than one node in the queue
	while (pq.size() != 1)
	{
		// Remove the two nodes of highest priority
		// (lowest frequency) from the queue
		Node *left = pq.top(); pq.pop();
		Node *right = pq.top();	pq.pop();

		// Create a new internal node with these two nodes
		// as children and with frequency equal to the sum
		// of the two nodes' frequencies. Add the new node
		// to the priority queue.
		int sum = left->freq + right->freq;
		pq.push(getNode('', sum, left, right));
	}

	// root stores pointer to root of Huffman Tree
	Node* root = pq.top();

	// traverse the Huffman Tree and store Huffman Codes
	// in a map. Also prints them
	unordered_map<char, string> huffmanCode;
	encode(root, "", huffmanCode);

	cout << "Huffman Codes are :n" << 'n';
	for (auto pair: huffmanCode) {
		cout << pair.first << " " << pair.second << 'n';
	}

	cout << "nOriginal string was :n" << text << 'n';

	// print encoded string
	string str = "";
	for (char ch: text) {
		str += huffmanCode[ch];
	}

	cout << "nEncoded string is :n" << str << 'n';

	// traverse the Huffman Tree again and this time
	// decode the encoded string
	int index = -1;
	cout << "nDecoded string is: n";
	while (index < (int)str.size() - 2) {
		decode(root, index, str);
	}
}

// Huffman coding algorithm
int main()
{
	string text = "Huffman coding is a data compression algorithm.";

	buildHuffmanTree(text);

	return 0;
}

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

// A Tree node
class Node
{
	char ch;
	int freq;
	Node left = null, right = null;

	Node(char ch, int freq)
	{
		this.ch = ch;
		this.freq = freq;
	}

	public Node(char ch, int freq, Node left, Node right) {
		this.ch = ch;
		this.freq = freq;
		this.left = left;
		this.right = right;
	}
};

class Huffman
{
	// traverse the Huffman Tree and store Huffman Codes
	// in a map.
	public static void encode(Node root, String str,
							  Map<Character, String> huffmanCode)
	{
		if (root == null)
			return;

		// found a leaf node
		if (root.left == null && root.right == null) {
			huffmanCode.put(root.ch, str);
		}


		encode(root.left, str + "0", huffmanCode);
		encode(root.right, str + "1", huffmanCode);
	}

	// traverse the Huffman Tree and decode the encoded string
	public static int decode(Node root, int index, StringBuilder sb)
	{
		if (root == null)
			return index;

		// found a leaf node
		if (root.left == null && root.right == null)
		{
			System.out.print(root.ch);
			return index;
		}

		index++;

		if (sb.charAt(index) == '0')
			index = decode(root.left, index, sb);
		else
			index = decode(root.right, index, sb);

		return index;
	}

	// Builds Huffman Tree and huffmanCode and decode given input text
	public static void buildHuffmanTree(String text)
	{
		// count frequency of appearance of each character
		// and store it in a map
		Map<Character, Integer> freq = new HashMap<>();
		for (int i = 0 ; i < text.length(); i++) {
			if (!freq.containsKey(text.charAt(i))) {
				freq.put(text.charAt(i), 0);
			}
			freq.put(text.charAt(i), freq.get(text.charAt(i)) + 1);
		}

		// Create a priority queue to store live nodes of Huffman tree
		// Notice that highest priority item has lowest frequency
		PriorityQueue<Node> pq = new PriorityQueue<>(
										(l, r) -> l.freq - r.freq);

		// Create a leaf node for each character and add it
		// to the priority queue.
		for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
			pq.add(new Node(entry.getKey(), entry.getValue()));
		}

		// do till there is more than one node in the queue
		while (pq.size() != 1)
		{
			// Remove the two nodes of highest priority
			// (lowest frequency) from the queue
			Node left = pq.poll();
			Node right = pq.poll();

			// Create a new internal node with these two nodes as children 
			// and with frequency equal to the sum of the two nodes
			// frequencies. Add the new node to the priority queue.
			int sum = left.freq + right.freq;
			pq.add(new Node('', sum, left, right));
		}

		// root stores pointer to root of Huffman Tree
		Node root = pq.peek();

		// traverse the Huffman tree and store the Huffman codes in a map
		Map<Character, String> huffmanCode = new HashMap<>();
		encode(root, "", huffmanCode);

		// print the Huffman codes
		System.out.println("Huffman Codes are :n");
		for (Map.Entry<Character, String> entry : huffmanCode.entrySet()) {
			System.out.println(entry.getKey() + " " + entry.getValue());
		}

		System.out.println("nOriginal string was :n" + text);

		// print encoded string
		StringBuilder sb = new StringBuilder();
		for (int i = 0 ; i < text.length(); i++) {
			sb.append(huffmanCode.get(text.charAt(i)));
		}

		System.out.println("nEncoded string is :n" + sb);

		// traverse the Huffman Tree again and this time
		// decode the encoded string
		int index = -1;
		System.out.println("nDecoded string is: n");
		while (index < sb.length() - 2) {
			index = decode(root, index, sb);
		}
	}

	public static void main(String[] args)
	{
		String text = "Huffman coding is a data compression algorithm.";

		buildHuffmanTree(text);
	}
}

Catetan: mémori dipaké ku string input 47 * 8 = 376 bit jeung string disandikeun ngan 194 bit i.e. data dikomprés ku kira 48%. Dina program C ++ di luhur, kami nganggo kelas string pikeun nyimpen string disandikeun sangkan program bisa dibaca.

Kusabab struktur data antrian prioritas efisien merlukeun per sisipan O(log(N)) waktos, tapi dina tangkal binér kumplit jeung N daun hadir 2N-1 titik, sarta tangkal Huffman mangrupakeun tangkal binér lengkep, lajeng algoritma dijalankeun dina O(Nlog(N)) waktos, dimana N - Aksara.

sumber:

en.wikipedia.org/wiki/Huffman_coding
en.wikipedia.org/wiki/Variable-length_code
www.youtube.com/watch?v=5wRPin4oxCo

Diajar langkung seueur ngeunaan kursus.

sumber: www.habr.com

Tambahkeun komentar