1 00:00:00,750 --> 00:00:04,059 This problems presents Huffman Encoding 2 00:00:04,059 --> 00:00:06,950 which produces variable-length encodings based 3 00:00:06,950 --> 00:00:08,820 on the probabilities of the occurrence 4 00:00:08,820 --> 00:00:10,710 of each different choice. 5 00:00:10,710 --> 00:00:13,870 More likely choices will end up with shorter encodings 6 00:00:13,870 --> 00:00:15,010 than less likely choices. 7 00:00:20,030 --> 00:00:22,160 The goal of this problem is to produce 8 00:00:22,160 --> 00:00:26,280 a Huffman code to encode student choices of majors. 9 00:00:26,280 --> 00:00:29,930 There are a total of 4 majors, and each has a probability 10 00:00:29,930 --> 00:00:32,049 associated with it. 11 00:00:32,049 --> 00:00:35,790 To produce a Huffman encoding, one begins with the 2 choices 12 00:00:35,790 --> 00:00:37,950 with lowest probability. 13 00:00:37,950 --> 00:00:43,560 In this example, major 6-7, has a probability of 0.06, 14 00:00:43,560 --> 00:00:47,080 and major 6-1, has a probability of 0.09. 15 00:00:47,080 --> 00:00:51,900 Since these are the two lowest probabilities, these two 16 00:00:51,900 --> 00:00:54,350 choices are selected as the starting point 17 00:00:54,350 --> 00:00:56,900 for building our encoding tree. 18 00:00:56,900 --> 00:00:59,400 The root node that combines the two choices, 19 00:00:59,400 --> 00:01:02,750 has a probability equal to the sum of the leaf nodes, 20 00:01:02,750 --> 00:01:06,440 or 0.15 in this example. 21 00:01:06,440 --> 00:01:09,170 We then label one side of this tree with a 0 22 00:01:09,170 --> 00:01:10,290 and the other with a 1. 23 00:01:10,290 --> 00:01:14,880 The next step is to find the next two smallest probabilities 24 00:01:14,880 --> 00:01:17,270 out of the remaining set of probabilities where 25 00:01:17,270 --> 00:01:21,730 majors 6-1 and 6-7 have been replaced by node A which 26 00:01:21,730 --> 00:01:25,450 has probability 0.15. 27 00:01:25,450 --> 00:01:28,180 In this case, our lowest probabilities 28 00:01:28,180 --> 00:01:31,940 are 0.15 (which is the probability of node A) 29 00:01:31,940 --> 00:01:36,760 and 0.41 (which is the probability of major 6-3). 30 00:01:36,760 --> 00:01:41,990 So we create a new node B that merges nodes A and 6-3. 31 00:01:41,990 --> 00:01:46,100 This new node has probability 0.56. 32 00:01:46,100 --> 00:01:48,800 Again we label the two branches, one with a 0 33 00:01:48,800 --> 00:01:50,920 and the other with a 1. 34 00:01:50,920 --> 00:01:53,690 We now repeat this process one last time 35 00:01:53,690 --> 00:01:56,170 with the only two remaining choices which 36 00:01:56,170 --> 00:01:59,290 are now node B and major 6-2. 37 00:01:59,290 --> 00:02:02,390 This means that we should make a new node C that merges 38 00:02:02,390 --> 00:02:04,990 node B and major 6-2. 39 00:02:04,990 --> 00:02:07,030 Note that the probability of this node 40 00:02:07,030 --> 00:02:11,550 is 1.0 because we have reached the top of our tree. 41 00:02:11,550 --> 00:02:16,290 Our final step is to label these last two branches. 42 00:02:16,290 --> 00:02:18,120 Now that all the branches are labeled, 43 00:02:18,120 --> 00:02:21,750 we can traverse the tree from the root node to each leaf node 44 00:02:21,750 --> 00:02:24,390 in order to identify the encoding that 45 00:02:24,390 --> 00:02:28,750 has been assigned to the major associated with that leaf node. 46 00:02:28,750 --> 00:02:33,950 We find that for major 6-1 the encoding is 101. 47 00:02:33,950 --> 00:02:38,770 For major 6-2, we end up with a 1-bit encoding of 0. 48 00:02:38,770 --> 00:02:42,570 Next we traverse the tree to identify the last two encodings 49 00:02:42,570 --> 00:02:47,590 and find that for major 6-3 the encoding 11 has been assigned, 50 00:02:47,590 --> 00:02:53,560 and for major 6-7 the encoding 100 has been assigned. 51 00:02:53,560 --> 00:02:55,560 These encodings make sense because we 52 00:02:55,560 --> 00:02:58,190 expect the major with the highest probability, 53 00:02:58,190 --> 00:03:03,110 in this case major 6-2 to end up with the shortest encoding. 54 00:03:03,110 --> 00:03:05,070 The next highest probability major 55 00:03:05,070 --> 00:03:08,870 is 6-3 so it ends up with the second shortest encoding, 56 00:03:08,870 --> 00:03:09,720 and so on. 57 00:03:12,990 --> 00:03:16,180 We just saw that the encodings resulting from this Huffman 58 00:03:16,180 --> 00:03:22,440 encoding tree are: 101 for major 6-1, a 0 for major 6-2, 59 00:03:22,440 --> 00:03:28,930 11 for major 6-3, and 100 for major 6-7. 60 00:03:28,930 --> 00:03:31,950 Note that the Huffman encoding tree for this problem 61 00:03:31,950 --> 00:03:35,460 could have also been drawn like this. 62 00:03:35,460 --> 00:03:38,390 These two trees are identical in structure and result 63 00:03:38,390 --> 00:03:40,570 in the same encodings for the four majors. 64 00:03:44,130 --> 00:03:47,420 Furthermore, a Huffman tree can result in more than one 65 00:03:47,420 --> 00:03:48,930 valid encoding. 66 00:03:48,930 --> 00:03:51,500 The only constraint in labeling the edges 67 00:03:51,500 --> 00:03:55,730 is that from each node, there is both a 0 branch and a 1 branch 68 00:03:55,730 --> 00:03:58,480 but there are no constraints about which side has to be 69 00:03:58,480 --> 00:04:01,390 labeled 0 and which side 1. 70 00:04:01,390 --> 00:04:03,800 So for example, we could have chosen 71 00:04:03,800 --> 00:04:06,510 to label the left side of the B node with a 1 72 00:04:06,510 --> 00:04:09,830 and the right side with a 0 instead of the way we 73 00:04:09,830 --> 00:04:11,760 originally labeled them. 74 00:04:11,760 --> 00:04:13,880 Note, however, that this would result 75 00:04:13,880 --> 00:04:17,209 in a different but also valid Huffman encoding. 76 00:04:17,209 --> 00:04:21,600 In this case the encoding for major 6-1 is 111, 77 00:04:21,600 --> 00:04:27,150 major 6-2 remains 0, major 6-3 becomes 10, 78 00:04:27,150 --> 00:04:30,600 and major 6-7 becomes 110. 79 00:04:30,600 --> 00:04:34,300 As long as one maintains consistency 80 00:04:34,300 --> 00:04:37,860 across the selected tree, a valid Huffman encoding 81 00:04:37,860 --> 00:04:40,650 is produced. 82 00:04:40,650 --> 00:04:44,430 We now add one more column to our table which gives p * 83 00:04:44,430 --> 00:04:48,850 log2(1/p) for each of the majors. 84 00:04:48,850 --> 00:04:52,110 Using this information we can calculate the entropy 85 00:04:52,110 --> 00:04:54,180 which is the average amount of information 86 00:04:54,180 --> 00:04:56,360 contained in each message. 87 00:04:56,360 --> 00:05:02,850 This is calculated by taking the sum of p * log2(1/p) for all 88 00:05:02,850 --> 00:05:04,750 choices of majors. 89 00:05:04,750 --> 00:05:07,150 For this problem, the entropy is 1.6. 90 00:05:07,150 --> 00:05:11,020 We can now also calculate the average bits 91 00:05:11,020 --> 00:05:14,430 per major of the encodings that we have identified. 92 00:05:14,430 --> 00:05:16,210 This is calculated by multiplying 93 00:05:16,210 --> 00:05:19,580 the number of bits in each encoding times the probability 94 00:05:19,580 --> 00:05:20,980 of that major. 95 00:05:20,980 --> 00:05:23,310 Recall that our encoding for major 6-1 96 00:05:23,310 --> 00:05:29,110 was 111, for major 6-2 it was 0, for major 6-3 it was 10, 97 00:05:29,110 --> 00:05:34,190 and finally for major 6-7 it was 110. 98 00:05:34,190 --> 00:05:37,600 This means that the average bits per major is 3 times 99 00:05:37,600 --> 00:05:41,020 0.09 plus 1 times 0.44 plus 2 times 100 00:05:41,020 --> 00:05:47,350 0.41 plus 3 times 0.06 = 0.27 plus 0.44 plus 0.82 101 00:05:47,350 --> 00:05:50,150 plus 0.18 which equals 1.71. 102 00:05:50,150 --> 00:05:56,820 Note that this is slightly larger 103 00:05:56,820 --> 00:06:03,150 than the entropy which is 1.6. 104 00:06:03,150 --> 00:06:05,730 This occurs because while Huffman encoding is 105 00:06:05,730 --> 00:06:08,380 an efficient encoding which gets us most of the way 106 00:06:08,380 --> 00:06:11,510 there, there are still some inefficiencies present 107 00:06:11,510 --> 00:06:14,330 in Huffman encoding because it is only 108 00:06:14,330 --> 00:06:18,440 encoding one major at a time rather than also considering 109 00:06:18,440 --> 00:06:20,760 the probabilities associated with seeing 110 00:06:20,760 --> 00:06:24,050 a particular sequence of majors in a message that 111 00:06:24,050 --> 00:06:28,471 conveys the major selected for a large number of students.