Huffman siqish algoritmi

Kurs boshlanishidan oldin "Ishlab chiquvchilar uchun algoritmlar" siz uchun boshqa foydali materialning tarjimasini tayyorladi.

Huffman kodlash - bu fayllarni siqishning asosiy g'oyasini shakllantiradigan ma'lumotlarni siqish algoritmi. Ushbu maqolada biz qattiq va o'zgaruvchan uzunlikdagi kodlash, noyob dekodlanadigan kodlar, prefiks qoidalari va Huffman daraxtini qurish haqida gapiramiz.

Biz bilamizki, har bir belgi 0 va 1 lar ketma-ketligi sifatida saqlanadi va 8 bitni egallaydi. Bu qattiq uzunlikdagi kodlash deb ataladi, chunki har bir belgi saqlash uchun bir xil sobit sonli bitlardan foydalanadi.

Aytaylik, bizda matn bor. Bitta belgini saqlash uchun zarur bo'lgan joy miqdorini qanday kamaytirishimiz mumkin?

Asosiy g'oya o'zgaruvchan uzunlikdagi kodlashdir. Biz matndagi ba'zi belgilar boshqalarga qaraganda tez-tez sodir bo'lishidan foydalanishimiz mumkin (bu yerga qarang) bir xil belgilar ketma-ketligini kamroq bitlarda ifodalaydigan algoritmni ishlab chiqish. O'zgaruvchan uzunlikdagi kodlashda biz belgilarga berilgan matnda qanchalik tez-tez paydo bo'lishiga qarab, o'zgaruvchan sonli bitlarni tayinlaymiz. Oxir-oqibat, ba'zi belgilar 1 bitgacha, boshqalari esa 2 bit, 3 yoki undan ko'proq vaqt olishi mumkin. O'zgaruvchan uzunlikdagi kodlash bilan bog'liq muammo faqat ketma-ketlikning keyingi dekodlashidir.

Qanday qilib bitlar ketma-ketligini bilib, uni aniq dekodlash mumkin?

Chiziqni ko'rib chiqing "abacdab". U 8 ta belgidan iborat va belgilangan uzunlikni kodlashda uni saqlash uchun 64 bit kerak bo'ladi. Belgilangan chastotaga e'tibor bering "a", "b", "c" ΠΈ "D" mos ravishda 4, 2, 1, 1 ga teng. Keling, tasavvur qilishga harakat qilaylik "abacdab" kamroq bit, bu haqiqatdan foydalanib "to" nisbatan tez-tez uchraydi "B"va "B" nisbatan tez-tez uchraydi "c" ΠΈ "D". Keling, kodlashdan boshlaylik "to" bir bit 0 ga teng, "B" biz ikki bitli kod 11ni tayinlaymiz va uchta bit 100 va 011 yordamida biz kodlaymiz "c" ΠΈ "D".

Natijada biz quyidagilarni olamiz:

a
0

b
11

c
100

d
011

Shunday qilib, chiziq "abacdab" sifatida kodlaymiz 00110100011011 (0|0|11|0|100|011|0|11)yuqoridagi kodlardan foydalanish. Biroq, asosiy muammo dekodlashda bo'ladi. Biz satrni dekodlashga harakat qilganimizda 00110100011011, biz noaniq natijaga erishamiz, chunki u quyidagicha ifodalanishi mumkin:

0|011|0|100|011|0|11    adacdab
0|0|11|0|100|0|11|011   aabacabd
0|011|0|100|0|11|0|11   adacabab 

...
va hokazo.

Ushbu noaniqlikni oldini olish uchun biz kodlashimiz kabi tushunchaga mos kelishini ta'minlashimiz kerak prefiks qoidasi, bu o'z navbatida kodlarni faqat bitta noyob usulda dekodlash mumkinligini anglatadi. Prefiks qoidasi hech qanday kod boshqasining prefiksi bo'lmasligini ta'minlaydi. Kod deganda biz ma'lum bir belgini ifodalash uchun ishlatiladigan bitlarni nazarda tutamiz. Yuqoridagi misolda 0 prefiks hisoblanadi 011, bu prefiks qoidasini buzadi. Shunday qilib, agar bizning kodlarimiz prefiks qoidasiga javob bersa, biz noyob tarzda dekodlashimiz mumkin (va aksincha).

Keling, yuqoridagi misolni qayta ko'rib chiqaylik. Bu safar biz belgilar uchun belgilaymiz "a", "b", "c" ΠΈ "D" prefiks qoidasini qondiradigan kodlar.

a
0

b
10

c
110

d
111

Ushbu kodlash bilan string "abacdab" sifatida kodlanadi 00100100011010 (0|0|10|0|100|011|0|10). Ammo 00100100011010 biz allaqachon aniq dekodlashimiz va asl satrimizga qaytishimiz mumkin bo'ladi "abacdab".

Huffman kodlash

Endi biz o'zgaruvchan uzunlikdagi kodlash va prefiks qoidasi bilan shug'ullanganimizdan so'ng, Huffman kodlash haqida gapiraylik.

Usul ikkilik daraxtlarni yaratishga asoslangan. Unda tugun yakuniy yoki ichki bo'lishi mumkin. Dastlab, barcha tugunlar ramzning o'zini va uning og'irligini (ya'ni paydo bo'lish chastotasini) ifodalovchi barglar (terminallar) hisoblanadi. Ichki tugunlar xarakterning og'irligini o'z ichiga oladi va ikkita nasl tuguniga ishora qiladi. Umumiy kelishuvga ko'ra, bit Β«0Β» chap novdadan keyin ifodalaydi va Β«1Β» - o'ngda. to'liq daraxtda N barglari va N-1 ichki tugunlar. Huffman daraxtini qurishda optimal uzunlik kodlarini olish uchun foydalanilmagan belgilarni olib tashlash tavsiya etiladi.

Biz Huffman daraxtini qurish uchun ustuvor navbatdan foydalanamiz, bu erda eng past chastotali tugunga eng yuqori ustunlik beriladi. Qurilish bosqichlari quyida tavsiflanadi:

  1. Har bir belgi uchun barg tugunini yarating va ularni ustuvor navbatga qo'shing.
  2. Navbatda bir nechta varaq bo'lsa, quyidagilarni bajaring:
    • Navbatdan eng yuqori ustuvorlik (eng past chastota) bo'lgan ikkita tugunni olib tashlang;
    • Yangi ichki tugunni yarating, bu erda bu ikki tugun bolalar bo'ladi va paydo bo'lish chastotasi bu ikki tugunning chastotalari yig'indisiga teng bo'ladi.
    • Ustuvor navbatga yangi tugun qo'shing.
  3. Qolgan yagona tugun ildiz bo'ladi va bu daraxtning qurilishini yakunlaydi.

Tasavvur qiling-a, bizda faqat belgilardan iborat matn bor "a B C D" ΠΈ "va", va ularning paydo bo'lish chastotalari mos ravishda 15, 7, 6, 6 va 5 ga teng. Quyida algoritm qadamlarini aks ettiruvchi rasmlar keltirilgan.

Huffman siqish algoritmi

Huffman siqish algoritmi

Huffman siqish algoritmi

Huffman siqish algoritmi

Huffman siqish algoritmi

Ildizdan istalgan so'nggi tugungacha bo'lgan yo'l ushbu tugun bilan bog'langan belgiga mos keladigan optimal prefiks kodini (Huffman kodi deb ham ataladi) saqlaydi.

Huffman siqish algoritmi
Huffman daraxti

Quyida C++ va Java-da Huffman siqish algoritmini amalga oshirishni topasiz:

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

// A Tree node
struct Node
{
	char ch;
	int freq;
	Node *left, *right;
};

// Function to allocate a new tree node
Node* getNode(char ch, int freq, Node* left, Node* right)
{
	Node* node = new Node();

	node->ch = ch;
	node->freq = freq;
	node->left = left;
	node->right = right;

	return node;
}

// Comparison object to be used to order the heap
struct comp
{
	bool operator()(Node* l, Node* r)
	{
		// highest priority item has lowest frequency
		return l->freq > r->freq;
	}
};

// traverse the Huffman Tree and store Huffman Codes
// in a map.
void encode(Node* root, string str,
			unordered_map<char, string> &huffmanCode)
{
	if (root == nullptr)
		return;

	// found a leaf node
	if (!root->left && !root->right) {
		huffmanCode[root->ch] = str;
	}

	encode(root->left, str + "0", huffmanCode);
	encode(root->right, str + "1", huffmanCode);
}

// traverse the Huffman Tree and decode the encoded string
void decode(Node* root, int &index, string str)
{
	if (root == nullptr) {
		return;
	}

	// found a leaf node
	if (!root->left && !root->right)
	{
		cout << root->ch;
		return;
	}

	index++;

	if (str[index] =='0')
		decode(root->left, index, str);
	else
		decode(root->right, index, str);
}

// Builds Huffman Tree and decode given input text
void buildHuffmanTree(string text)
{
	// count frequency of appearance of each character
	// and store it in a map
	unordered_map<char, int> freq;
	for (char ch: text) {
		freq[ch]++;
	}

	// Create a priority queue to store live nodes of
	// Huffman tree;
	priority_queue<Node*, vector<Node*>, comp> pq;

	// Create a leaf node for each character and add it
	// to the priority queue.
	for (auto pair: freq) {
		pq.push(getNode(pair.first, pair.second, nullptr, nullptr));
	}

	// do till there is more than one node in the queue
	while (pq.size() != 1)
	{
		// Remove the two nodes of highest priority
		// (lowest frequency) from the queue
		Node *left = pq.top(); pq.pop();
		Node *right = pq.top();	pq.pop();

		// Create a new internal node with these two nodes
		// as children and with frequency equal to the sum
		// of the two nodes' frequencies. Add the new node
		// to the priority queue.
		int sum = left->freq + right->freq;
		pq.push(getNode('', sum, left, right));
	}

	// root stores pointer to root of Huffman Tree
	Node* root = pq.top();

	// traverse the Huffman Tree and store Huffman Codes
	// in a map. Also prints them
	unordered_map<char, string> huffmanCode;
	encode(root, "", huffmanCode);

	cout << "Huffman Codes are :n" << 'n';
	for (auto pair: huffmanCode) {
		cout << pair.first << " " << pair.second << 'n';
	}

	cout << "nOriginal string was :n" << text << 'n';

	// print encoded string
	string str = "";
	for (char ch: text) {
		str += huffmanCode[ch];
	}

	cout << "nEncoded string is :n" << str << 'n';

	// traverse the Huffman Tree again and this time
	// decode the encoded string
	int index = -1;
	cout << "nDecoded string is: n";
	while (index < (int)str.size() - 2) {
		decode(root, index, str);
	}
}

// Huffman coding algorithm
int main()
{
	string text = "Huffman coding is a data compression algorithm.";

	buildHuffmanTree(text);

	return 0;
}

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

// A Tree node
class Node
{
	char ch;
	int freq;
	Node left = null, right = null;

	Node(char ch, int freq)
	{
		this.ch = ch;
		this.freq = freq;
	}

	public Node(char ch, int freq, Node left, Node right) {
		this.ch = ch;
		this.freq = freq;
		this.left = left;
		this.right = right;
	}
};

class Huffman
{
	// traverse the Huffman Tree and store Huffman Codes
	// in a map.
	public static void encode(Node root, String str,
							  Map<Character, String> huffmanCode)
	{
		if (root == null)
			return;

		// found a leaf node
		if (root.left == null && root.right == null) {
			huffmanCode.put(root.ch, str);
		}


		encode(root.left, str + "0", huffmanCode);
		encode(root.right, str + "1", huffmanCode);
	}

	// traverse the Huffman Tree and decode the encoded string
	public static int decode(Node root, int index, StringBuilder sb)
	{
		if (root == null)
			return index;

		// found a leaf node
		if (root.left == null && root.right == null)
		{
			System.out.print(root.ch);
			return index;
		}

		index++;

		if (sb.charAt(index) == '0')
			index = decode(root.left, index, sb);
		else
			index = decode(root.right, index, sb);

		return index;
	}

	// Builds Huffman Tree and huffmanCode and decode given input text
	public static void buildHuffmanTree(String text)
	{
		// count frequency of appearance of each character
		// and store it in a map
		Map<Character, Integer> freq = new HashMap<>();
		for (int i = 0 ; i < text.length(); i++) {
			if (!freq.containsKey(text.charAt(i))) {
				freq.put(text.charAt(i), 0);
			}
			freq.put(text.charAt(i), freq.get(text.charAt(i)) + 1);
		}

		// Create a priority queue to store live nodes of Huffman tree
		// Notice that highest priority item has lowest frequency
		PriorityQueue<Node> pq = new PriorityQueue<>(
										(l, r) -> l.freq - r.freq);

		// Create a leaf node for each character and add it
		// to the priority queue.
		for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
			pq.add(new Node(entry.getKey(), entry.getValue()));
		}

		// do till there is more than one node in the queue
		while (pq.size() != 1)
		{
			// Remove the two nodes of highest priority
			// (lowest frequency) from the queue
			Node left = pq.poll();
			Node right = pq.poll();

			// Create a new internal node with these two nodes as children 
			// and with frequency equal to the sum of the two nodes
			// frequencies. Add the new node to the priority queue.
			int sum = left.freq + right.freq;
			pq.add(new Node('', sum, left, right));
		}

		// root stores pointer to root of Huffman Tree
		Node root = pq.peek();

		// traverse the Huffman tree and store the Huffman codes in a map
		Map<Character, String> huffmanCode = new HashMap<>();
		encode(root, "", huffmanCode);

		// print the Huffman codes
		System.out.println("Huffman Codes are :n");
		for (Map.Entry<Character, String> entry : huffmanCode.entrySet()) {
			System.out.println(entry.getKey() + " " + entry.getValue());
		}

		System.out.println("nOriginal string was :n" + text);

		// print encoded string
		StringBuilder sb = new StringBuilder();
		for (int i = 0 ; i < text.length(); i++) {
			sb.append(huffmanCode.get(text.charAt(i)));
		}

		System.out.println("nEncoded string is :n" + sb);

		// traverse the Huffman Tree again and this time
		// decode the encoded string
		int index = -1;
		System.out.println("nDecoded string is: n");
		while (index < sb.length() - 2) {
			index = decode(root, index, sb);
		}
	}

	public static void main(String[] args)
	{
		String text = "Huffman coding is a data compression algorithm.";

		buildHuffmanTree(text);
	}
}

Eslatma: kirish satri tomonidan ishlatiladigan xotira 47 * 8 = 376 bit va kodlangan satr faqat 194 bit, ya'ni. ma'lumotlar taxminan 48% ga siqiladi. Yuqoridagi C++ dasturida biz dasturni o'qilishi mumkin bo'lishi uchun kodlangan satrni saqlash uchun string sinfidan foydalanamiz.

Chunki samarali ustuvor navbat ma'lumotlar tuzilmalari har bir qo'shishni talab qiladi O(log(N)) vaqt, lekin bilan to'liq ikkilik daraxtda N barglar mavjud 2N-1 tugunlar va Huffman daraxti to'liq ikkilik daraxtdir, keyin algoritm ishlaydi O(Nlog(N)) vaqt, qayerda N - Qahramonlar.

Manbalar:

en.wikipedia.org/wiki/Huffman_coding
en.wikipedia.org/wiki/Variable-length_code
www.youtube.com/watch?v=5wRPin4oxCo

Kurs haqida ko'proq bilib oling.

Manba: www.habr.com

a Izoh qo'shish