Comments
Description
Transcript
ppt
Huffman coding Optimal codes - I A code is optimal if it has the shortest codeword length L m L pi li i 1 This can be seen as an optimization problem m min li pi i 1 m subject to D li 1 i 1 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - II Let’s make two simplifying assumptions no integer constraint on the codelengths Kraft inequality holds with equality Lagrange-multiplier problem m li J pi li D 1 i 1 i 1 m pj J l j l j 0 p j D log D 0 D l j log D Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 3 Optimal codes - III Substitute D inequality l j pj log D into the Kraft m pi 1 li 1 p D i log D log D i 1 that is li* log D pi the entropy, when we use base D for logarithms Note that m m L p l pi log D pi H D ( X ) !! * i 1 * i i i 1 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 4 Optimal codes - IV In practice the codeword lengths must be integer value, so obtained results is a lower bound Theorem The expected length of any istantaneous D-ary code for a r.v. X satisfies L H D ( x) this fundamental result derives frow the work of Shannon Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 5 Optimal codes - V What about the upper bound? Theorem Given a source alphabet (i.e. a r.v.) of entropy H ( X )it is possible to find an instantaneous binary code which length satisfies H ( X ) L H ( X ) 1 A similar theorem could be stated if we use the wrong probabilities qi instead of the true ones pi ; the only difference is a term which accounts for the relative entropy Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 6 The redundance It is defined as the average codeword legths minus the entropy Redundancy L pi log pi i Note that 0 redundancy 1 (why?) Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 7 Compression ratio It is the ratio between the average number of bit/symbol in the original message and the same quantity for the coded message, i.e. average original symbol length C average compressed symbol length L( X )!! Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 8 Uniquely decodable codes The set of the instantaneous codes are a small subset of the uniquely decodable codes. It is possible to obtain a lower average code length L using a uniquely decodable code that is not instantaneous? NO So we use instantaneous codes that are easier to decode Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 9 Summary Average codeword length L L H ( X ) for uniquely decodable codes (and for instantaneous codes) In practice for each r.v. X with entropy H ( X )a code with average we can build codeword length that satisfies H ( X ) L H ( X ) 1 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 10 Shannon-Fano coding The main advantage of the Shannon-Fano technique is its semplicity Source symbols are listed in order of nonincreasing probability. The list is divided in such a way to form two groups of as nearly equal probabilities as possible Each symbol in the first group receives a 0 as first digit of its codeword, while the others receive a 1 Each of these group is then divided according to the same criterion and additional code digits are appended The process is continued until each group contains only one message Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 11 example a 12 b 14 c 18 d 1 16 e 1 32 f 1 32 0 10 11 0 111 0 1111 0 11111 H=1.9375 bits L=1.9375 bits Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 12 Shannon-Fano coding - exercise Symb. Prob. * ? 12% 5% ! 13% & $ 2% 29% € 13% § ° 10% 6% @ 10% Encode, using Shannon-Fano algorithm Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 13 Is Shannon-Fano coding optimal? a 0.35 00 0 b 0.17 01 100 c 0.17 10 101 d 0.16 110 110 L=2.31 bits e 0.15 111 111 L1=2.3 bits H=2.2328 bits Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 14 Huffman coding - I There is another algorithm which performances are slightly better than Shanno-Fano, the famous Huffman coding It works constructing bottom-up a tree, that has symbols in the leafs The two leafs with the smallest probabilities becomes sibling under a parent node with probabilities equal to the two children’s probabilities Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 15 Huffman coding - II At this time the operation is repeated, considering also the new parent node and ignoring its children The process continue until there is only parent node with probability 1, that is the root of the tree Then the two branches for every non-leaf node are labeled 0 and 1 (typically, 0 on the left branch, but the order is not important) Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 16 Huffman coding - example Symbol a b c d e f g Prob. 0.05 0.05 0.1 0.2 0.3 0.2 0.1 1.0 0 0 0.2 0 0 a 0.05 0.1 0.4 1 1 1 0 1 b 0.05 0.6 1 0 c 0.1 d 0.2 e 0.3 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 f 0.2 0.3 1 g 0.1 17 Huffman coding - example Symbol a b c d e f g Prob. Codeword 0.05 0000 0.05 0001 0.1 001 0.2 01 0.3 10 0.2 110 0.1 111 Exercise: evaluate H(X) and L(X) H(X)=2.5464 bits L(X)=2.6 bits !! Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 18 Huffman coding - exercise Symbol a b c d e f g Prob. Codeword 0.05 0000 0.05 0001 0.1 001 0.2 01 0.3 10 0.2 110 0.1 111 Code the sequence aeebcddegfced and calculate the compression ratio Sol: 0000 10 10 0001 001 01 01 10 111 110 001 10 01 Aver. orig. symb. length = 3 bits Aver. compr. symb. length = 34/13 C=..... Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 19 Huffman coding - exercise Symbol a b c d e f g Prob. Codeword 0.05 0000 0.05 0001 0.1 001 0.2 01 0.3 10 0.2 110 0.1 111 Decode the sequence 0111001001000001111110 Sol: dfdcadgf Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 20 Huffman coding - exercise Symb. Prob. a 0.10 b c 0.03 0.14 0 0.4 1 0.22 2 $ 0.04 0.07 Encode with Huffman the sequence 01$cc0a02ba10 and evaluate entropy, average codeword length and compression ratio Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 21 Huffman coding - exercise Symb. Prob. 0 1 0.16 0.02 2 0.15 3 4 0.29 0.17 5 0.04 % 0.17 Decode (if possible) the Huffman coded bit streaming 01001011010011110101... Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 22 Huffman coding - notes In the huffman coding, if, at any time, there is more than one way to choose a smallest pair of probabilities, any such pair may be chosen Sometimes, the list of probabilities is inizialized to be non-increasing and reordered after each node creation. This details doesn’t affect the correctness of the algorithm, but it provides a more efficient implementation Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 23 Huffman coding - notes There are cases in which the Huffman coding does not uniquely determine codeword lengths, due to the arbitrary choice among equal minimum probabilities. For example for a source with probabilities 0.4, 0.2, 0.2, 0.1, 0.1 it is possible to obtain codeword lengths of 1, 2, 3, 4, 4 and of 2, 2, 2, 3, 3 It would be better to have a code which codelength has the minimum variance, as this solution will need the minimum buffer space in the transmitter and in the receiver Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 24 Huffman coding - notes Schwarz defines a variant of the Huffman algorithm that allows to build the code with minimum lmax . There are several other variants, we will explain the most important in a while. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 25 Optimality of Huffman coding - I It is possible to prove that, in case of character coding (one symbol, one codeword), Huffman coding is optimal In another terms Huffman code has minimum redundancy An upper bound for redundancy has been found redundancy p1 1 log2 e log2 log2 e p1 0.086 where p1 is the probability of the most likely simbol Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 26 Optimality of Huffman coding - II Why Huffman code “suffers” when there is one symbol with very high probability? Remember the notion of uncertainty... p( x) 1 log( p( x)) 0 The main problem is given by the integer constraint on codelengths!! This consideration opens the way to a more powerful coding... we will see it later Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 27 Huffman coding - implementation Huffman coding can be generated in O(n) time, where n is the number of source symbols, provided that probabilities have been presorted (however this sort costs O(nlogn)...) Nevertheless, encoding is very fast Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 28 Huffman coding - implementation However, spatial and temporal complexity of the decoding phase are far more important, because, on average, decoding will happen more frequently. Consider a Huffman tree with n symbols n leafs and n-1 internal nodes has the pointer to a symbol and the info that it is a leaf 2n 2(n 1) has two pointers 4n words (32 bits) Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 29 Huffman coding - implementation 1 million symbols 16 MB of memory! Moreover traversing a tree from root to leaf involves follow a lot of pointers, with little locality of reference. This causes several page faults or cache misses. To solve this problem a variant of Huffman coding has been proposed: canonical Huffman coding Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 30 canonical Huffman coding - I Symb. Prob. Code 1 Code 2 Code 3 a b c d e f 0.11 0.12 0.13 0.14 0.24 0.26 000 001 100 101 01 11 111 110 011 010 10 00 1.0 000 001 010 011 10 11 0 (1) 0.53 0.47 0 (1) 0 (1) 1 (0) 1 (0) 0.23 (1) 0 ? 1 (0) a 0.11 0.27 1(0) 0(1) 1 (0) b 0.12 c 0.13 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 d 0.14 e 0.24 f 0.26 31 canonical Huffman coding - II Symb. Code 3 a b c d e f 000 001 010 011 10 11 This code cannot be obtained through a Huffman tree! We do call it an Huffman code because it is instantaneous and the codeword lengths are the same than a valid Huffman code numerical sequence property codewords with the same length are ordered lexicographically when the codewords are sorted in lexical order they are also in order from the longest to the shortest codeword Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 32 canonical Huffman coding - III The main advantage is that it is not necessary to store a tree, in order to decoding We need a list of the symbols ordered according to the lexical order of the codewords an array with the first codeword of each distinct length Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 33 canonical Huffman coding - IV Encoding. Suppose there are n disctinct symbols, that for symbol i we have calculated huffman codelength li and i li maxlength numl[k] = number of codewords with length k firstcode[k] = integer for first code of length k nextcode[k] = integer for the next codeword of length k to be assigned symbol[-,-] used for decoding codeword[i] the rightmost li bits of this integer are the code for symbol i for k 1 to maxlength { numl[k ] 0; } for i 1 to n { numl[li ] numl[li ] 1; } firstcode[maxlength] 0; for k maxlength 1 downto 1 { firstcode[k ] ( firstcode[k 1] numl[k 1]) / 2 ; } for k 1 to maxlength { nextcode[k ]=firstcode[k ]; } for i 1 to n { codeword [i ] nextcode[li ]; symbol li , nextcode[li ] - firstcode[li ] i; nextcode[li ] nextcode[li ] 1; } 34 canonical Huffman - example Symb. length code i li word 1 a 2 0 b 5 1 c 5 1 d 3 2 e 2 2 f 5 3 g 5 3 h 2 bits 1. Evaluate array numl for k maxlength 1 downto 1 { numl : [0 3 1 0 4] 01 001 10 numl[ k 1]) / 2 ; } 2. Evaluate array firstcode firstcode : [2 1 1 2 0] 00000 00001 firstcode[k ] ( firstcode[k 1] 3. Construct array codeword and symbol for k 1 to maxlength { nextcode[k ]=firstcode[k ]; } symbol 0 1 2 3 - - - - 1 00010 for i 1 to n { 00011 codeword [i] nextcode[li ]; a e h - 2 d - - - 3 11 - - - - 4 b c f g 5 symbol li , nextcode[li ]- firstcode[li ] i; nextcode[li ] nextcode[li ] 1; } 35 canonical Huffman coding - V Decoding. We have the arrays firstcode and symbols nextinputbit() v nextinputbit (); firstcode[k] = integer for first while v firstcode[k ] { function that returns next input bit code of length k symbol[k,n] returns the symbol number n with codelength k k 1; v 2* v nextinputbit (); k k 1; } Return symbol k , v firstcode[k ] ; Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 36 canonical Huffman - example 00 1 1 1 1 0 0 0 0 000 1 00 1 v nextinputbit (); k 1; while v firstcode[k ] { v 2* v nextinputbit (); k k 1; } Return symbol k , v firstcode[k ] ; symbol 0 1 2 3 - - - - 1 a e h - 2 d - - - 3 - - - - 4 b c f g 5 symbol[3,0] symbol[2,2] symbol[2,1] symbol[5,0] symbol[2,0] symbol[3,0] = = = = = = d h e b a d Decoded: dhebad firstcode : [2 1 1 2 0] 37