Pirms kursa sÄkuma
Huffman kodÄÅ”ana ir datu saspieÅ”anas algoritms, kas formulÄ failu saspieÅ”anas pamatideju. Å ajÄ rakstÄ mÄs runÄsim par fiksÄta un mainÄ«ga garuma kodÄÅ”anu, unikÄli dekodÄjamiem kodiem, prefiksu kÄrtulÄm un Hafmena koka izveidi.
MÄs zinÄm, ka katra rakstzÄ«me tiek saglabÄta kÄ 0 un 1 secÄ«ba un aizÅem 8 bitus. To sauc par fiksÄta garuma kodÄjumu, jo katra rakstzÄ«me saglabÄÅ”anai izmanto tÄdu paÅ”u fiksÄtu bitu skaitu.
PieÅemsim, ka mums ir dots teksts. KÄ mÄs varam samazinÄt vietas daudzumu, kas nepiecieÅ”ams vienas rakstzÄ«mes glabÄÅ”anai?
GalvenÄ ideja ir mainÄ«ga garuma kodÄÅ”ana. MÄs varam izmantot faktu, ka dažas rakstzÄ«mes tekstÄ parÄdÄs biežÄk nekÄ citas (
KÄ, zinot bitu secÄ«bu, to viennozÄ«mÄ«gi atÅ”ifrÄt?
Apsveriet lÄ«niju "abacdab". Tam ir 8 rakstzÄ«mes, un, kodÄjot fiksÄtu garumu, tÄ glabÄÅ”anai bÅ«s nepiecieÅ”ami 64 biti. Å emiet vÄrÄ, ka simbolu biežums "a", "b", "c" Šø "D" ir vienÄds ar attiecÄ«gi 4, 2, 1, 1. MÄÄ£inÄsim iedomÄties "abacdab" mazÄk bitu, izmantojot faktu, ka "uz" notiek biežÄk nekÄ "B"Un "B" notiek biežÄk nekÄ "c" Šø "D". SÄksim ar kodÄÅ”anu "uz" ar vienu bitu, kas vienÄds ar 0, "B" mÄs pieŔķirsim divu bitu kodu 11, un, izmantojot trÄ«s bitus 100 un 011, mÄs kodÄsim "c" Šø "D".
RezultÄtÄ mÄs iegÅ«sim:
a
0
b
11
c
100
d
011
TÄtad lÄ«nija "abacdab" mÄs iekodÄsim kÄ 00110100011011 (0|0|11|0|100|011|0|11)izmantojot iepriekÅ” minÄtos kodus. TomÄr galvenÄ problÄma bÅ«s dekodÄÅ”ana. Kad mÄs cenÅ”amies atÅ”ifrÄt virkni 00110100011011, mÄs iegÅ«stam neviennozÄ«mÄ«gu rezultÄtu, jo to var attÄlot Å”Ädi:
0|011|0|100|011|0|11 adacdab
0|0|11|0|100|0|11|011 aabacabd
0|011|0|100|0|11|0|11 adacabab
...
uc
Lai izvairÄ«tos no Ŕīs neskaidrÄ«bas, mums ir jÄnodroÅ”ina, lai mÅ«su kodÄjums atbilstu tÄdam jÄdzienam kÄ prefiksa noteikums, kas savukÄrt nozÄ«mÄ, ka kodus var atÅ”ifrÄt tikai vienÄ unikÄlÄ veidÄ. Prefiksa noteikums nodroÅ”ina, ka neviens kods nav cita prefikss. Ar kodu mÄs saprotam bitus, ko izmanto, lai attÄlotu noteiktu rakstzÄ«mi. IepriekÅ” minÄtajÄ piemÄrÄ 0 ir prefikss 011, kas pÄrkÄpj prefiksa noteikumu. TÄtad, ja mÅ«su kodi atbilst prefiksa likumam, mÄs varam unikÄli atÅ”ifrÄt (un otrÄdi).
ApskatÄ«sim iepriekÅ” minÄto piemÄru. Å oreiz mÄs pieŔķirsim simboliem "a", "b", "c" Šø "D" kodi, kas atbilst prefiksa noteikumam.
a
0
b
10
c
110
d
111
Izmantojot Å”o kodÄjumu, virkne "abacdab" tiks kodÄts kÄ 00100100011010 (0|0|10|0|100|011|0|10). Bet 00100100011010 mÄs jau varÄsim viennozÄ«mÄ«gi atÅ”ifrÄt un atgriezties pie savas sÄkotnÄjÄs virknes "abacdab".
Hafmena kodÄÅ”ana
Tagad, kad esam tikuÅ”i galÄ ar mainÄ«ga garuma kodÄjumu un prefiksa noteikumu, parunÄsim par Hafmena kodÄjumu.
Metodes pamatÄ ir binÄro koku izveide. TajÄ mezgls var bÅ«t gan galÄ«gs, gan iekÅ”Äjs. SÄkotnÄji visi mezgli tiek uzskatÄ«ti par lapÄm (terminÄļiem), kas attÄlo paÅ”u simbolu un tÄ svaru (tas ir, parÄdÄ«Å”anÄs biežumu). IekÅ”Äjie mezgli satur rakstzÄ«mes svaru un attiecas uz diviem pÄcnÄcÄjiem. PÄc vispÄrÄjas vienoÅ”anÄs, bit Ā«0Ā» apzÄ«mÄ sekoÅ”anu kreisajam zaram un Ā«1Ā» - pa labi. pilnÄ kokÄ N lapas un N-1 iekÅ”Äjie mezgli. Veidojot Hafmena koku, neizmantotos simbolus ieteicams izmest, lai iegÅ«tu optimÄlus garuma kodus.
MÄs izmantosim prioritÄro rindu, lai izveidotu Huffman koku, kur mezglam ar zemÄko frekvenci tiks pieŔķirta augstÄkÄ prioritÄte. BÅ«vniecÄ«bas soļi ir aprakstÄ«ti zemÄk:
- Katrai rakstzÄ«mei izveidojiet lapas mezglu un pievienojiet to prioritÄrajai rindai.
- KamÄr rindÄ ir vairÄk nekÄ viena lapa, rÄ«kojieties Å”Ädi:
- NoÅemiet no rindas divus mezglus ar augstÄko prioritÄti (zemÄko frekvenci);
- Izveidojiet jaunu iekÅ”Äjo mezglu, kur Å”ie divi mezgli bÅ«s bÄrni, un raÅ”anÄs biežums bÅ«s vienÄds ar Å”o divu mezglu frekvenÄu summu.
- Pievienojiet jaunu mezglu prioritÄrajai rindai.
- Vienīgais atlikuŔais mezgls būs sakne, un tas pabeigs koka būvniecību.
IedomÄjieties, ka mums ir kÄds teksts, kas sastÄv tikai no rakstzÄ«mÄm "a", "b", "c", "d" Šø "un", un to sastopamÄ«bas biežums ir attiecÄ«gi 15, 7, 6, 6 un 5. ZemÄk ir ilustrÄcijas, kas atspoguļo algoritma darbÄ«bas.
CeÄ¼Ä no saknes uz jebkuru gala mezglu tiks saglabÄts optimÄlais prefiksa kods (pazÄ«stams arÄ« kÄ Hafmena kods), kas atbilst ar Å”o gala mezglu saistÄ«tajai rakstzÄ«mei.
Huffman koks
ZemÄk jÅ«s atradÄ«sit Huffman saspieÅ”anas algoritma ievieÅ”anu C++ un Java:
#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;
// A Tree node
struct Node
{
char ch;
int freq;
Node *left, *right;
};
// Function to allocate a new tree node
Node* getNode(char ch, int freq, Node* left, Node* right)
{
Node* node = new Node();
node->ch = ch;
node->freq = freq;
node->left = left;
node->right = right;
return node;
}
// Comparison object to be used to order the heap
struct comp
{
bool operator()(Node* l, Node* r)
{
// highest priority item has lowest frequency
return l->freq > r->freq;
}
};
// traverse the Huffman Tree and store Huffman Codes
// in a map.
void encode(Node* root, string str,
unordered_map<char, string> &huffmanCode)
{
if (root == nullptr)
return;
// found a leaf node
if (!root->left && !root->right) {
huffmanCode[root->ch] = str;
}
encode(root->left, str + "0", huffmanCode);
encode(root->right, str + "1", huffmanCode);
}
// traverse the Huffman Tree and decode the encoded string
void decode(Node* root, int &index, string str)
{
if (root == nullptr) {
return;
}
// found a leaf node
if (!root->left && !root->right)
{
cout << root->ch;
return;
}
index++;
if (str[index] =='0')
decode(root->left, index, str);
else
decode(root->right, index, str);
}
// Builds Huffman Tree and decode given input text
void buildHuffmanTree(string text)
{
// count frequency of appearance of each character
// and store it in a map
unordered_map<char, int> freq;
for (char ch: text) {
freq[ch]++;
}
// Create a priority queue to store live nodes of
// Huffman tree;
priority_queue<Node*, vector<Node*>, comp> pq;
// Create a leaf node for each character and add it
// to the priority queue.
for (auto pair: freq) {
pq.push(getNode(pair.first, pair.second, nullptr, nullptr));
}
// do till there is more than one node in the queue
while (pq.size() != 1)
{
// Remove the two nodes of highest priority
// (lowest frequency) from the queue
Node *left = pq.top(); pq.pop();
Node *right = pq.top(); pq.pop();
// Create a new internal node with these two nodes
// as children and with frequency equal to the sum
// of the two nodes' frequencies. Add the new node
// to the priority queue.
int sum = left->freq + right->freq;
pq.push(getNode('', sum, left, right));
}
// root stores pointer to root of Huffman Tree
Node* root = pq.top();
// traverse the Huffman Tree and store Huffman Codes
// in a map. Also prints them
unordered_map<char, string> huffmanCode;
encode(root, "", huffmanCode);
cout << "Huffman Codes are :n" << 'n';
for (auto pair: huffmanCode) {
cout << pair.first << " " << pair.second << 'n';
}
cout << "nOriginal string was :n" << text << 'n';
// print encoded string
string str = "";
for (char ch: text) {
str += huffmanCode[ch];
}
cout << "nEncoded string is :n" << str << 'n';
// traverse the Huffman Tree again and this time
// decode the encoded string
int index = -1;
cout << "nDecoded string is: n";
while (index < (int)str.size() - 2) {
decode(root, index, str);
}
}
// Huffman coding algorithm
int main()
{
string text = "Huffman coding is a data compression algorithm.";
buildHuffmanTree(text);
return 0;
}
import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;
// A Tree node
class Node
{
char ch;
int freq;
Node left = null, right = null;
Node(char ch, int freq)
{
this.ch = ch;
this.freq = freq;
}
public Node(char ch, int freq, Node left, Node right) {
this.ch = ch;
this.freq = freq;
this.left = left;
this.right = right;
}
};
class Huffman
{
// traverse the Huffman Tree and store Huffman Codes
// in a map.
public static void encode(Node root, String str,
Map<Character, String> huffmanCode)
{
if (root == null)
return;
// found a leaf node
if (root.left == null && root.right == null) {
huffmanCode.put(root.ch, str);
}
encode(root.left, str + "0", huffmanCode);
encode(root.right, str + "1", huffmanCode);
}
// traverse the Huffman Tree and decode the encoded string
public static int decode(Node root, int index, StringBuilder sb)
{
if (root == null)
return index;
// found a leaf node
if (root.left == null && root.right == null)
{
System.out.print(root.ch);
return index;
}
index++;
if (sb.charAt(index) == '0')
index = decode(root.left, index, sb);
else
index = decode(root.right, index, sb);
return index;
}
// Builds Huffman Tree and huffmanCode and decode given input text
public static void buildHuffmanTree(String text)
{
// count frequency of appearance of each character
// and store it in a map
Map<Character, Integer> freq = new HashMap<>();
for (int i = 0 ; i < text.length(); i++) {
if (!freq.containsKey(text.charAt(i))) {
freq.put(text.charAt(i), 0);
}
freq.put(text.charAt(i), freq.get(text.charAt(i)) + 1);
}
// Create a priority queue to store live nodes of Huffman tree
// Notice that highest priority item has lowest frequency
PriorityQueue<Node> pq = new PriorityQueue<>(
(l, r) -> l.freq - r.freq);
// Create a leaf node for each character and add it
// to the priority queue.
for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
pq.add(new Node(entry.getKey(), entry.getValue()));
}
// do till there is more than one node in the queue
while (pq.size() != 1)
{
// Remove the two nodes of highest priority
// (lowest frequency) from the queue
Node left = pq.poll();
Node right = pq.poll();
// Create a new internal node with these two nodes as children
// and with frequency equal to the sum of the two nodes
// frequencies. Add the new node to the priority queue.
int sum = left.freq + right.freq;
pq.add(new Node('', sum, left, right));
}
// root stores pointer to root of Huffman Tree
Node root = pq.peek();
// traverse the Huffman tree and store the Huffman codes in a map
Map<Character, String> huffmanCode = new HashMap<>();
encode(root, "", huffmanCode);
// print the Huffman codes
System.out.println("Huffman Codes are :n");
for (Map.Entry<Character, String> entry : huffmanCode.entrySet()) {
System.out.println(entry.getKey() + " " + entry.getValue());
}
System.out.println("nOriginal string was :n" + text);
// print encoded string
StringBuilder sb = new StringBuilder();
for (int i = 0 ; i < text.length(); i++) {
sb.append(huffmanCode.get(text.charAt(i)));
}
System.out.println("nEncoded string is :n" + sb);
// traverse the Huffman Tree again and this time
// decode the encoded string
int index = -1;
System.out.println("nDecoded string is: n");
while (index < sb.length() - 2) {
index = decode(root, index, sb);
}
}
public static void main(String[] args)
{
String text = "Huffman coding is a data compression algorithm.";
buildHuffmanTree(text);
}
}
PiezÄ«me: atmiÅa, ko izmanto ievades virkne, ir 47 * 8 = 376 biti, un kodÄtÄ virkne ir tikai 194 biti, t.i. dati tiek saspiesti par aptuveni 48%. IepriekÅ” minÄtajÄ C++ programmÄ mÄs izmantojam virkÅu klasi, lai saglabÄtu kodÄto virkni, lai programma bÅ«tu lasÄma.
TÄ kÄ efektÄ«vai prioritÄro rindu datu struktÅ«rÄm ir nepiecieÅ”ama katra ievietoÅ”ana O(log(N)) laikÄ, bet pilnÄ«gÄ binÄrÄ kokÄ ar N lapas klÄt 2N-1 mezgliem, un Hafmena koks ir pilnÄ«gs binÄrs koks, tad tiek palaists algoritms O(Nlog(N)) laiks, kur N - PersonÄži.
Avoti:
Avots: www.habr.com