1 00:00:00,000 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:04,059 Commons license. 3 00:00:04,059 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,350 To make a donation or view additional materials 6 00:00:13,350 --> 00:00:17,290 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,290 --> 00:00:18,294 at ocw.mit.edu. 8 00:00:28,437 --> 00:00:29,770 PROFESSOR: OK let's get started. 9 00:00:32,390 --> 00:00:34,410 Let's get started, please. 10 00:00:34,410 --> 00:00:38,630 All right, last time we talked about information and entropy. 11 00:00:38,630 --> 00:00:40,970 The picture we had was of some kind 12 00:00:40,970 --> 00:00:44,435 of a source emitting symbols. 13 00:00:50,360 --> 00:00:54,230 Symbols-- let's say n of them. 14 00:00:54,230 --> 00:01:00,875 So it chooses from these symbols with probabilities P1 up to Pn. 15 00:01:04,110 --> 00:01:16,330 And then we talked about the expected information here, 16 00:01:16,330 --> 00:01:24,500 or the entropy, so the expected information 17 00:01:24,500 --> 00:01:27,800 you get when you see the symbol that's emitted by the source. 18 00:01:27,800 --> 00:01:31,670 And that was the average value of the information. 19 00:01:31,670 --> 00:01:36,590 So it was-- let's see, you take 1 over log of 1 20 00:01:36,590 --> 00:01:39,782 over P i for each of the possible symbols. 21 00:01:39,782 --> 00:01:41,240 And then you've got to weight it by 22 00:01:41,240 --> 00:01:44,960 the corresponding probability to get an expectation. 23 00:01:44,960 --> 00:01:48,480 And this was the entropy of the source. 24 00:01:48,480 --> 00:01:50,720 Or if you want to make explicit the source, 25 00:01:50,720 --> 00:01:55,190 you could say H of S for source-- 26 00:01:55,190 --> 00:01:59,413 capital S. All right? 27 00:01:59,413 --> 00:02:01,580 And then we were actually thinking of this operating 28 00:02:01,580 --> 00:02:02,360 repeatedly. 29 00:02:02,360 --> 00:02:06,440 So in the model we had last time, the source at each time 30 00:02:06,440 --> 00:02:08,870 chooses from one of these symbols with this probability. 31 00:02:08,870 --> 00:02:11,760 And it does it independently of choices at other times. 32 00:02:11,760 --> 00:02:14,390 So what the source actually generates 33 00:02:14,390 --> 00:02:20,810 is what's referred to as an iid sequence of symbols, 34 00:02:20,810 --> 00:02:25,852 independent, identically distributed. 35 00:02:25,852 --> 00:02:26,810 You'll see this a lot-- 36 00:02:32,570 --> 00:02:34,790 Or iid sequence of symbols. 37 00:02:39,860 --> 00:02:43,927 So the independent part of this refers to the fact 38 00:02:43,927 --> 00:02:45,510 that it makes the choice independently 39 00:02:45,510 --> 00:02:46,980 at each time instant. 40 00:02:46,980 --> 00:02:49,075 The identically distributed means 41 00:02:49,075 --> 00:02:50,700 that at each time instant, it goes back 42 00:02:50,700 --> 00:02:51,900 to these same probabilities. 43 00:02:51,900 --> 00:02:55,020 It's the same distribution that it uses each time. 44 00:02:55,020 --> 00:02:56,430 So that's what iid means-- 45 00:02:56,430 --> 00:02:59,220 so sort of a stationary probabilistic source 46 00:02:59,220 --> 00:03:01,560 with no dependence from one time instant to the next. 47 00:03:06,150 --> 00:03:10,320 Average information was measured in bits per symbol. 48 00:03:13,560 --> 00:03:16,260 And what we wanted to do was take those symbols 49 00:03:16,260 --> 00:03:24,450 and compress them to binary digits. 50 00:03:30,740 --> 00:03:32,780 OK, so we were going to-- 51 00:03:32,780 --> 00:03:34,790 you can compress them to other things. 52 00:03:34,790 --> 00:03:36,957 We were going to think of compressing them to binary 53 00:03:36,957 --> 00:03:41,000 digits because we're thinking of a channel that can take 0s 1s 54 00:03:41,000 --> 00:03:43,710 or a signal that's in two possible states. 55 00:03:43,710 --> 00:03:46,820 So what we wanted to do was take each symbol or sequence 56 00:03:46,820 --> 00:03:50,640 of symbols and code it in the form of binary digits. 57 00:03:50,640 --> 00:03:51,140 Right? 58 00:03:53,960 --> 00:03:57,530 Now, each binary digit can, at most, 59 00:03:57,530 --> 00:03:59,300 carry one bit of information. 60 00:03:59,300 --> 00:04:03,525 If the binary digit is equally likely to be a 0 or a 1, 61 00:04:03,525 --> 00:04:05,150 then it carries one bit of information. 62 00:04:05,150 --> 00:04:07,400 So that tells you really that if you're going 63 00:04:07,400 --> 00:04:10,760 to code this, the code length-- 64 00:04:13,530 --> 00:04:17,190 let's see-- compress to binary digits, let's say, or encode. 65 00:04:20,140 --> 00:04:22,500 And what we need is the expected code length. 66 00:04:29,960 --> 00:04:37,880 L should be greater than or equal to H. So 67 00:04:37,880 --> 00:04:42,680 you need to transmit at least this many binary digits 68 00:04:42,680 --> 00:04:44,840 on average to convey the information that's 69 00:04:44,840 --> 00:04:46,430 coming out of the source-- 70 00:04:46,430 --> 00:04:50,015 per symbol or per timestamp. 71 00:04:50,015 --> 00:04:51,640 All right, so that was the basic setup. 72 00:04:56,040 --> 00:05:00,210 I've given you one of these bounds here. 73 00:05:00,210 --> 00:05:02,010 When we talked about codes, by the way, 74 00:05:02,010 --> 00:05:06,240 we decided that if we're talking about binary codes, 75 00:05:06,240 --> 00:05:12,670 we want to limit ourselves to what are called instantaneously 76 00:05:12,670 --> 00:05:19,420 decodable or prefix-free codes. 77 00:05:19,420 --> 00:05:21,220 And these are codes that correspond 78 00:05:21,220 --> 00:05:24,940 to the leaves of a code tree. 79 00:05:24,940 --> 00:05:27,820 So we had examples of this type. 80 00:05:27,820 --> 00:05:29,860 You want your symbols to be associated 81 00:05:29,860 --> 00:05:34,000 with the leaves of-- the end of the tree, not 82 00:05:34,000 --> 00:05:35,660 intermediate points. 83 00:05:35,660 --> 00:05:38,200 The reason being that, as you work 84 00:05:38,200 --> 00:05:40,990 your way down to the tree-- 85 00:05:40,990 --> 00:05:44,800 by the way, I'm assuming that this picture makes sense 86 00:05:44,800 --> 00:05:48,520 to you in some fashion from recitation. 87 00:05:48,520 --> 00:05:50,800 But as you work your way down to the symbol, 88 00:05:50,800 --> 00:05:52,940 you don't encounter any other symbols on the way. 89 00:05:52,940 --> 00:05:54,398 So as soon as you hit the leaf, you 90 00:05:54,398 --> 00:05:55,840 know what symbol you've got. 91 00:05:55,840 --> 00:05:58,330 So we're limiting ourselves to codes 92 00:05:58,330 --> 00:06:00,190 of that type because some of the statements 93 00:06:00,190 --> 00:06:03,050 I make are not true if you don't have codes of this type. 94 00:06:03,050 --> 00:06:06,250 So I won't comment on that again. 95 00:06:06,250 --> 00:06:11,380 All right, so we've got that, the first inequality 96 00:06:11,380 --> 00:06:13,210 that I've put up there. 97 00:06:13,210 --> 00:06:16,000 And it turns out that Shannon showed 98 00:06:16,000 --> 00:06:21,078 how to actually construct codes that will give you 99 00:06:21,078 --> 00:06:22,120 a band on the other side. 100 00:06:22,120 --> 00:06:26,840 Let me actually write it the way it is on the slide. 101 00:06:26,840 --> 00:06:30,730 So Shannon showed how to get codes that satisfy this-- so 102 00:06:30,730 --> 00:06:36,430 can get code satisfying this. 103 00:06:40,060 --> 00:06:41,830 So Shannon showed how to get within one 104 00:06:41,830 --> 00:06:44,740 of the lower bound in terms of the expected length 105 00:06:44,740 --> 00:06:45,260 of the code. 106 00:06:45,260 --> 00:06:47,860 So that was pretty good. 107 00:06:47,860 --> 00:06:51,250 But after coming up with this paper in '48 108 00:06:51,250 --> 00:06:54,760 and working on this for a while, neither he nor other luminaries 109 00:06:54,760 --> 00:06:58,300 in the field had found how to get the best such code, 110 00:06:58,300 --> 00:07:00,130 and that's what Huffman ended up doing. 111 00:07:00,130 --> 00:07:05,380 So we've talked about that already. 112 00:07:05,380 --> 00:07:07,900 OK, so Huffman showed how to get a code 113 00:07:07,900 --> 00:07:10,690 of minimum expected length per symbol 114 00:07:10,690 --> 00:07:12,632 with a very simple construction. 115 00:07:15,940 --> 00:07:22,890 Now, you can actually extend Huffman-- 116 00:07:22,890 --> 00:07:26,550 and maybe you talked about this in recitation as well. 117 00:07:26,550 --> 00:07:28,050 So you can code per symbol, or you 118 00:07:28,050 --> 00:07:31,890 can decide you're going to create super-symbols. 119 00:07:31,890 --> 00:07:37,590 Take the same source, but say that the symbols that it emits 120 00:07:37,590 --> 00:07:41,110 are the symbols from here grouped two at a time. 121 00:07:41,110 --> 00:07:43,650 So you're going to take the symbol emitted 122 00:07:43,650 --> 00:07:45,510 at some particular time and then the symbol 123 00:07:45,510 --> 00:07:48,930 at the following time and call that a super-symbol. 124 00:07:48,930 --> 00:07:51,750 And then take the next pair, and that's 125 00:07:51,750 --> 00:07:52,962 a super-symbol and so on. 126 00:07:52,962 --> 00:07:54,420 So you're doing the Huffman coding, 127 00:07:54,420 --> 00:07:57,225 but on pairs of symbols. 128 00:07:57,225 --> 00:08:00,480 So you can go through the same kind of construction. 129 00:08:00,480 --> 00:08:04,410 If you assuming an iid source, then the probability 130 00:08:04,410 --> 00:08:07,983 of a paired super-symbol is easy to compute. 131 00:08:07,983 --> 00:08:09,900 It's just a probability of the individual ones 132 00:08:09,900 --> 00:08:12,610 because they're independently emitted. 133 00:08:12,610 --> 00:08:16,530 And then the entropy of the resulting source 134 00:08:16,530 --> 00:08:19,440 here turns out to be twice the entropy of the source 135 00:08:19,440 --> 00:08:22,330 here because these are independent emissions, 136 00:08:22,330 --> 00:08:24,450 so the entropies will just add. 137 00:08:24,450 --> 00:08:28,200 So you can do the Huffman construction again. 138 00:08:28,200 --> 00:08:30,450 And what you discover is the same kind of thing 139 00:08:30,450 --> 00:08:37,830 except this is now the inequality, right? 140 00:08:37,830 --> 00:08:39,000 And the reason is-- 141 00:08:39,000 --> 00:08:42,750 well, here L is still the expected length per symbol. 142 00:08:42,750 --> 00:08:45,210 But you're doing pairs now, so the expected length 143 00:08:45,210 --> 00:08:47,310 for the pair is 2L. 144 00:08:47,310 --> 00:08:48,570 Right? 145 00:08:48,570 --> 00:08:50,670 The lower bound is the entropy of the source. 146 00:08:50,670 --> 00:08:52,110 That's 2H. 147 00:08:52,110 --> 00:08:55,090 The upper bound is the entropy of that source plus 1. 148 00:08:55,090 --> 00:08:56,940 So you can construct a code of that type. 149 00:08:56,940 --> 00:08:59,707 You can do it with Shannon's construction or Huffman's. 150 00:08:59,707 --> 00:09:01,290 And now see what you've managed to do. 151 00:09:01,290 --> 00:09:05,610 You've got a little titer of a squeeze on the expected length. 152 00:09:14,100 --> 00:09:19,040 So we've gone from H plus 1 to H plus 1/2 153 00:09:19,040 --> 00:09:20,810 with this construction. 154 00:09:20,810 --> 00:09:25,910 If you took triples, this would just change to 1 over 3. 155 00:09:25,910 --> 00:09:29,030 If you took K-tuples, you'd get 1 over K. 156 00:09:29,030 --> 00:09:31,970 So if you encode larger and larger blocks, 157 00:09:31,970 --> 00:09:33,680 you can squeeze the expected length 158 00:09:33,680 --> 00:09:37,140 down to essentially what the entropy band tells you. 159 00:09:41,250 --> 00:09:44,670 Now, Huffman-- you've spent time in recitation. 160 00:09:44,670 --> 00:09:48,380 I just thought I would quickly run through an example 161 00:09:48,380 --> 00:09:53,520 so that you have this fresh in your minds. 162 00:09:53,520 --> 00:09:56,440 So we start off with a set of symbols. 163 00:09:56,440 --> 00:09:58,890 This is kind of weak, but I hope you can it. 164 00:09:58,890 --> 00:10:03,060 A set of symbols, A through D in this case, with probabilities 165 00:10:03,060 --> 00:10:04,560 associated with it. 166 00:10:04,560 --> 00:10:06,930 The Huffman process is to first sort 167 00:10:06,930 --> 00:10:09,570 these symbols in descending order of probability. 168 00:10:09,570 --> 00:10:11,580 So that's what I really start with. 169 00:10:11,580 --> 00:10:13,290 You take the two smallest ones and lump 170 00:10:13,290 --> 00:10:19,548 them together to get a paired symbol, rearrange, reorder. 171 00:10:19,548 --> 00:10:21,090 And then you do the same thing again. 172 00:10:21,090 --> 00:10:24,210 You take the two, combine them, reorder. 173 00:10:24,210 --> 00:10:27,660 Take the two smallest ones, combine them, reorder. 174 00:10:27,660 --> 00:10:31,440 And that's what you have for your reduction phase. 175 00:10:31,440 --> 00:10:33,210 And then you start to trace back. 176 00:10:33,210 --> 00:10:37,380 So when you trace back, you can pick the upper one to be zero, 177 00:10:37,380 --> 00:10:39,060 the lower one to be 1. 178 00:10:39,060 --> 00:10:41,950 And then every time you get a bifurcation, as you go back, 179 00:10:41,950 --> 00:10:46,110 you'll pick the upper one to be zero and the lower one to be 1. 180 00:10:46,110 --> 00:10:48,990 And you start to build up your code word, right? 181 00:10:48,990 --> 00:10:51,070 So this one traces back. 182 00:10:51,070 --> 00:10:52,730 There's no bifurcation. 183 00:10:52,730 --> 00:10:53,620 This traces back. 184 00:10:53,620 --> 00:10:56,430 The 0 becomes 0001. 185 00:10:56,430 --> 00:10:59,410 And you go all the way like that. 186 00:10:59,410 --> 00:10:59,910 OK? 187 00:10:59,910 --> 00:11:06,475 So trace back-- let's see. 188 00:11:06,475 --> 00:11:07,430 Oh, was there a-- 189 00:11:07,430 --> 00:11:07,930 yeah. 190 00:11:07,930 --> 00:11:10,657 So the 1 here becomes a 1 0 and a 1 1. 191 00:11:10,657 --> 00:11:12,740 And then at the next step, you're all the way back 192 00:11:12,740 --> 00:11:15,590 with the Huffman code. 193 00:11:15,590 --> 00:11:16,640 Right? 194 00:11:16,640 --> 00:11:19,580 So that's the Huffman code for that set of symbols. 195 00:11:19,580 --> 00:11:20,960 It's a Huffman code. 196 00:11:20,960 --> 00:11:23,620 I shouldn't say the Huffman code because, if you notice, 197 00:11:23,620 --> 00:11:26,480 at various stages we had probabilities 198 00:11:26,480 --> 00:11:30,920 that were identical, like over here and over here and over 199 00:11:30,920 --> 00:11:31,700 here. 200 00:11:31,700 --> 00:11:34,790 And we could have chosen how to order these things 201 00:11:34,790 --> 00:11:37,787 and then how to do the subsequent grouping. 202 00:11:37,787 --> 00:11:39,620 And all of those will give you Huffman codes 203 00:11:39,620 --> 00:11:41,720 with the same minimum expected length. 204 00:11:47,270 --> 00:11:47,770 All right. 205 00:11:53,062 --> 00:11:55,270 All right, I want to give you another way of thinking 206 00:11:55,270 --> 00:11:59,245 about entropy and why it enters into coding. 207 00:12:02,470 --> 00:12:04,150 And here's the basic idea. 208 00:12:04,150 --> 00:12:07,570 All right, so we're still thinking about the source 209 00:12:07,570 --> 00:12:09,640 emitting independent symbols. 210 00:12:09,640 --> 00:12:11,630 It's an iid source. 211 00:12:11,630 --> 00:12:13,750 And we've got a very long string of emissions. 212 00:12:13,750 --> 00:12:23,910 So we've got a very long string of symbols emitted, 213 00:12:23,910 --> 00:12:31,800 maybe S1 of the first time, S17 here, S2 here, and so on. 214 00:12:31,800 --> 00:12:34,100 And the question is, in a very long string of symbols, 215 00:12:34,100 --> 00:12:37,130 how many times do you expect to see symbol S1? 216 00:12:37,130 --> 00:12:39,793 How many times you expect to see a symbol S2 and so on? 217 00:12:39,793 --> 00:12:41,210 Well, if you actually work it out, 218 00:12:41,210 --> 00:12:43,340 it turns out that the expected number 219 00:12:43,340 --> 00:12:56,150 of times, number of times we see SI in the [INAUDIBLE] symbol 220 00:12:56,150 --> 00:13:01,260 is K times the probability of seeing SI. 221 00:13:01,260 --> 00:13:03,030 So it's what you'd expect. 222 00:13:03,030 --> 00:13:03,770 All right? 223 00:13:03,770 --> 00:13:06,140 So the expected number of times is that. 224 00:13:06,140 --> 00:13:09,470 Well, but that doesn't tell you what the number of times 225 00:13:09,470 --> 00:13:12,117 is that you'll see in any given experiment. 226 00:13:12,117 --> 00:13:14,450 We know that you need to think about standard deviations 227 00:13:14,450 --> 00:13:16,050 as well. 228 00:13:16,050 --> 00:13:24,910 So what this is saying is, for instance, for symbol SI, 229 00:13:24,910 --> 00:13:32,382 that we expect to get that many of symbol SI. 230 00:13:32,382 --> 00:13:34,340 But actually, there's a distribution around it. 231 00:13:34,340 --> 00:13:39,050 So you'll get a little histogram here. 232 00:13:39,050 --> 00:13:41,990 I'm not making any attempt to draw it very carefully, 233 00:13:41,990 --> 00:13:43,958 but there's a distribution. 234 00:13:43,958 --> 00:13:45,500 You run different experiments, you're 235 00:13:45,500 --> 00:13:50,300 going to get different numbers of SI in that run of K. Right? 236 00:13:50,300 --> 00:13:51,510 So there's a distribution. 237 00:13:51,510 --> 00:13:53,450 And it turns out you can actually 238 00:13:53,450 --> 00:13:56,360 write an explicit formula for the standard deviation. 239 00:14:02,820 --> 00:14:05,960 This is something you'll see if you do a probability course. 240 00:14:05,960 --> 00:14:08,540 It's actually very simple. 241 00:14:11,730 --> 00:14:13,170 So that's the standard deviation. 242 00:14:13,170 --> 00:14:18,750 So the standard deviation goes as root K. 243 00:14:18,750 --> 00:14:21,510 So the interesting thing is that the standard deviation is 244 00:14:21,510 --> 00:14:24,270 a fraction of-- 245 00:14:24,270 --> 00:14:27,270 the number of successes get smaller and smaller 246 00:14:27,270 --> 00:14:29,490 as K becomes larger and larger. 247 00:14:29,490 --> 00:14:34,560 Or another way to see that is, if I normalize this, 248 00:14:34,560 --> 00:14:37,680 so I'm going to do a number of successes 249 00:14:37,680 --> 00:14:43,990 divided by K. This histogram is going to cluster around P i. 250 00:14:47,340 --> 00:14:52,453 And the standard deviation now, because I've divided by K, 251 00:14:52,453 --> 00:14:54,120 the standard deviation actually now ends 252 00:14:54,120 --> 00:15:02,040 up being P1 minus P square root of K. All right? 253 00:15:04,800 --> 00:15:09,450 So what this says is if you get a run of K 254 00:15:09,450 --> 00:15:11,670 emissions of the symbol and you try and estimate 255 00:15:11,670 --> 00:15:16,560 the probability, P i by taking the ratio of times SI appears 256 00:15:16,560 --> 00:15:18,570 over the total number of runs, you'll 257 00:15:18,570 --> 00:15:23,070 actually get a little spread here centered on P i. 258 00:15:23,070 --> 00:15:25,747 But the spread actually goes down as 1 over root K. 259 00:15:25,747 --> 00:15:28,330 So this is really what the law of large numbers is telling us. 260 00:15:28,330 --> 00:15:30,610 It's telling us that if you take a very long run, 261 00:15:30,610 --> 00:15:35,880 you almost certainly get a number of successes, 262 00:15:35,880 --> 00:15:37,830 well, Kp i in this case. 263 00:15:37,830 --> 00:15:39,570 It's very tightly concentrated. 264 00:15:42,410 --> 00:15:44,910 All right, we don't want you to remember all these formulas. 265 00:15:44,910 --> 00:15:46,110 I have them on the slides. 266 00:15:46,110 --> 00:15:48,010 It's just there for fun. 267 00:15:48,010 --> 00:15:50,580 There's something else that I put on there 268 00:15:50,580 --> 00:15:51,900 that you can try out for fun. 269 00:15:51,900 --> 00:15:55,560 I don't want to talk through it, but you can use exactly this 270 00:15:55,560 --> 00:15:58,500 to analyze things like polling. 271 00:15:58,500 --> 00:16:02,460 Why is it that you can poll 2,500 people 272 00:16:02,460 --> 00:16:04,170 and say that I've got a margin of error 273 00:16:04,170 --> 00:16:07,770 of 1% as to how the election is going to turn out? 274 00:16:07,770 --> 00:16:10,778 Well, the answer is, actually, in exactly this. 275 00:16:10,778 --> 00:16:12,820 If we have time at the end, I'll come back to it. 276 00:16:12,820 --> 00:16:17,500 But it's easy enough that you can look at it yourself. 277 00:16:17,500 --> 00:16:20,730 So let's focus on what it is I wanted to show you. 278 00:16:25,150 --> 00:16:28,970 I picked Obama 0.55, but that was just as illustration. 279 00:16:28,970 --> 00:16:29,670 [LAUGHTER] 280 00:16:29,670 --> 00:16:31,710 No. 281 00:16:31,710 --> 00:16:35,070 No political views to be imputed to that. 282 00:16:35,070 --> 00:16:37,650 All right, so what we're saying is you've 283 00:16:37,650 --> 00:16:43,200 got K emissions of this symbol. 284 00:16:43,200 --> 00:16:45,210 And with very high probability, you've 285 00:16:45,210 --> 00:16:56,260 got Kp1 one of S1, Kp2 of S2, and so on. 286 00:16:56,260 --> 00:16:58,350 So this is really what you're expecting to get, 287 00:16:58,350 --> 00:17:05,030 provided you've tossed this a large number of times. 288 00:17:05,030 --> 00:17:09,319 What's the probability of getting a sequence that has Kp1 289 00:17:09,319 --> 00:17:13,349 of S1, Kp2 of S2, and so on? 290 00:17:13,349 --> 00:17:17,780 So you've got to get S1 and Kp1 positions. 291 00:17:17,780 --> 00:17:19,640 What's the probability of that? 292 00:17:19,640 --> 00:17:23,480 And you've got to get S2 and Kp2 positions. 293 00:17:23,480 --> 00:17:27,109 So how do you work out those probabilities? 294 00:17:27,109 --> 00:17:29,220 We're invoking independence of all the emissions. 295 00:17:29,220 --> 00:17:31,560 So you can multiply probabilities. 296 00:17:31,560 --> 00:17:35,690 So what you have is S1 occurring with probability. 297 00:17:35,690 --> 00:17:39,830 P1 to the power Kp1, because P1 is 298 00:17:39,830 --> 00:17:41,640 the probability with which S1 occurs, 299 00:17:41,640 --> 00:17:43,370 and it's happening Kp1 times. 300 00:17:43,370 --> 00:17:51,410 So you take it to that power, and then P2 to the Kp2, 301 00:17:51,410 --> 00:17:55,660 all the way up to Pn to the Kpn. 302 00:17:55,660 --> 00:17:56,540 OK? 303 00:17:56,540 --> 00:18:03,140 So this is the probability of getting a sequence like this. 304 00:18:05,417 --> 00:18:07,750 And what we've said is this is the only kind of sequence 305 00:18:07,750 --> 00:18:09,160 you're typically going to get. 306 00:18:09,160 --> 00:18:12,920 All the rest have very low probability of occurrence. 307 00:18:12,920 --> 00:18:15,190 So it must be that when I add up all these sequences, 308 00:18:15,190 --> 00:18:17,920 I get, essentially, probability 1. 309 00:18:17,920 --> 00:18:21,760 So the question then is how many such sequences are there. 310 00:18:21,760 --> 00:18:23,800 If a single sequence of this type 311 00:18:23,800 --> 00:18:27,690 has this probability, and the only sequences I'm going to get 312 00:18:27,690 --> 00:18:31,290 are sequences of this type effectively, 313 00:18:31,290 --> 00:18:33,220 and the probabilities have to sum to 1. 314 00:18:33,220 --> 00:18:35,680 How many sequences do I have of this type? 315 00:18:38,740 --> 00:18:40,825 Do you agree that it's 1 over the probability? 316 00:18:43,402 --> 00:18:44,610 The number of such sequences? 317 00:18:44,610 --> 00:18:49,620 Because I've got to take the number of sequences times 318 00:18:49,620 --> 00:18:53,420 this individual probability has to come up to be one. 319 00:18:53,420 --> 00:18:54,060 Right? 320 00:18:54,060 --> 00:18:57,012 The number of sequences-- let me write this down. 321 00:18:57,012 --> 00:18:58,470 So that you see it a little better. 322 00:19:07,800 --> 00:19:12,240 The number of such-- 323 00:19:12,240 --> 00:19:14,130 let me call them typical sequences-- 324 00:19:18,550 --> 00:19:21,700 times the probability of any such sequence 325 00:19:21,700 --> 00:19:24,560 has got to be approximately 1. 326 00:19:24,560 --> 00:19:27,420 I say approximately because there are a few other sequences 327 00:19:27,420 --> 00:19:28,740 whose probabilities might-- 328 00:19:28,740 --> 00:19:30,430 I would have to take a count of. 329 00:19:30,430 --> 00:19:32,210 But this is essentially it. 330 00:19:32,210 --> 00:19:35,300 So the number of such sequences is 1 over this number. 331 00:19:35,300 --> 00:19:43,530 So the number of such sequences is P1 to the minus K1, 332 00:19:43,530 --> 00:19:51,520 P1, P2 to the minus Kp2 and so on. 333 00:19:56,600 --> 00:19:58,297 That's the number of such sequences. 334 00:19:58,297 --> 00:20:00,130 And essentially, these are all the sequences 335 00:20:00,130 --> 00:20:01,047 that I'm going to get. 336 00:20:04,120 --> 00:20:06,070 Well, if I take the log of this-- 337 00:20:11,210 --> 00:20:13,190 just visualize how the log works. 338 00:20:13,190 --> 00:20:15,850 Now I've got the log of a product, 339 00:20:15,850 --> 00:20:18,480 so that going to be a sum of the individual logs. 340 00:20:18,480 --> 00:20:20,300 I've got the log of a power of something, 341 00:20:20,300 --> 00:20:24,340 so the power will come down to multiply the log. 342 00:20:24,340 --> 00:20:29,870 This comes out to be K times H of S exactly. 343 00:20:29,870 --> 00:20:33,410 OK, so the log of the number of sequences 344 00:20:33,410 --> 00:20:37,940 is K times H of S, K times the entropy. 345 00:20:40,930 --> 00:20:44,310 This is log to the base 2. 346 00:20:44,310 --> 00:20:52,050 So the number of sequences is equal to 2 to the KH. 347 00:20:52,050 --> 00:20:53,225 I'm saying equal to. 348 00:20:53,225 --> 00:20:54,600 I should be putting approximately 349 00:20:54,600 --> 00:20:57,700 equal to signs everywhere, but you get the idea. 350 00:20:57,700 --> 00:21:03,150 So the number of typical sequences is 2 to the KH. 351 00:21:03,150 --> 00:21:07,290 How many binary digits does it take to count 2 the KH things? 352 00:21:11,210 --> 00:21:12,760 KH, right? 353 00:21:12,760 --> 00:21:13,915 So what I need is-- 354 00:21:16,954 --> 00:21:26,890 so I just need KHS binary digits to count the typical sequences. 355 00:21:34,980 --> 00:21:37,935 So how many binary digits do I need per symbol? 356 00:21:40,790 --> 00:21:42,500 It's just that divided by K because I've 357 00:21:42,500 --> 00:21:44,370 got a string of K symbols. 358 00:21:44,370 --> 00:21:48,450 So I need a number of binary digits equal to the entropy. 359 00:21:48,450 --> 00:21:51,320 So this is a quick way of seeing that entropy 360 00:21:51,320 --> 00:21:54,950 is very relevant to minimal coding of sequences of outputs 361 00:21:54,950 --> 00:21:57,330 from a source like this. 362 00:21:57,330 --> 00:22:01,500 All right, now I swept a lot of math under the rug. 363 00:22:01,500 --> 00:22:04,650 The math that makes this rigorous exists. 364 00:22:04,650 --> 00:22:08,490 We don't want to have any part of it here. 365 00:22:08,490 --> 00:22:10,110 But for those of you who are inclined, 366 00:22:10,110 --> 00:22:14,390 you can look in a book on information theory. 367 00:22:14,390 --> 00:22:15,530 There's a nice name to it. 368 00:22:15,530 --> 00:22:24,740 It's called the Asymptotic Equipartition-- 369 00:22:24,740 --> 00:22:32,220 Equipartition Property. 370 00:22:32,220 --> 00:22:34,300 OK? 371 00:22:34,300 --> 00:22:37,010 It's basically saying that, asymptotically the probability 372 00:22:37,010 --> 00:22:39,140 partitions into equal probabilities for all 373 00:22:39,140 --> 00:22:41,380 these typical sequences. 374 00:22:41,380 --> 00:22:41,880 All right. 375 00:22:45,560 --> 00:22:52,430 So all that is for Huffman and its application 376 00:22:52,430 --> 00:22:57,680 to symbols emitted independently by a source over time. 377 00:22:57,680 --> 00:22:59,960 But there are limitations to this. 378 00:22:59,960 --> 00:23:05,270 We've been working with Huffman coding under the assumption 379 00:23:05,270 --> 00:23:07,340 that the probabilities are given to us. 380 00:23:07,340 --> 00:23:10,010 But it's typically the case that the probabilities are not 381 00:23:10,010 --> 00:23:15,170 known for some arbitrary source that you're trying to code for. 382 00:23:15,170 --> 00:23:17,420 The probabilities might change with time as the source 383 00:23:17,420 --> 00:23:18,830 characteristics change. 384 00:23:18,830 --> 00:23:22,040 So you would need to detect that and recode, 385 00:23:22,040 --> 00:23:24,430 if you're going to do Huffman. 386 00:23:24,430 --> 00:23:26,990 And then the more important point 387 00:23:26,990 --> 00:23:30,350 perhaps is that sources are generally not iid. 388 00:23:30,350 --> 00:23:32,690 The sources of interest are not really 389 00:23:32,690 --> 00:23:36,950 generating independent identically 390 00:23:36,950 --> 00:23:38,870 distributed symbols. 391 00:23:38,870 --> 00:23:42,810 What's perhaps more true is that-- 392 00:23:42,810 --> 00:23:43,310 let's see. 393 00:23:43,310 --> 00:23:50,600 Oh, here-- once you're done compressing your source 394 00:23:50,600 --> 00:23:52,910 to binary digits where each binary digit carries 395 00:23:52,910 --> 00:23:54,950 a bit of information, then you've 396 00:23:54,950 --> 00:24:02,210 got something that essentially is not correlated over time. 397 00:24:02,210 --> 00:24:04,400 You've managed to kind of decouple it. 398 00:24:04,400 --> 00:24:08,660 But at this stage, these symbols are not really independent 399 00:24:08,660 --> 00:24:10,890 in typical cases of interest. 400 00:24:10,890 --> 00:24:15,110 So one important case, of course, is just English text. 401 00:24:15,110 --> 00:24:17,810 You can still code it symbol by symbol, 402 00:24:17,810 --> 00:24:19,910 but it's a very inefficient coding. 403 00:24:19,910 --> 00:24:22,420 If you wanted to do it symbol by symbol, 404 00:24:22,420 --> 00:24:24,140 let's just ignore uppercase. 405 00:24:24,140 --> 00:24:27,010 You've got 26 letters plus a space. 406 00:24:27,010 --> 00:24:30,110 So that's 27 symbols. 407 00:24:30,110 --> 00:24:32,810 Well, you could certainly code that with five binary digits 408 00:24:32,810 --> 00:24:36,050 because that would give you 32 things to count. 409 00:24:36,050 --> 00:24:38,630 You can do better with a code that 410 00:24:38,630 --> 00:24:40,280 approaches the entropy associated 411 00:24:40,280 --> 00:24:42,710 with a source of this type. 412 00:24:42,710 --> 00:24:45,730 That would be 4.755 bits. 413 00:24:45,730 --> 00:24:50,780 OK, so if you ignored dependence in English text 414 00:24:50,780 --> 00:24:54,620 and just treated each symbol is equally likely, 415 00:24:54,620 --> 00:24:56,152 you'd say that that's the entropy, 416 00:24:56,152 --> 00:24:58,610 and you could attempt to code it with something approaching 417 00:24:58,610 --> 00:24:59,110 that. 418 00:24:59,110 --> 00:25:02,310 But actually, not all symbols are equally likely. 419 00:25:02,310 --> 00:25:04,700 If you look at a typical distribution of frequencies-- 420 00:25:04,700 --> 00:25:07,010 and we saw this with Morse already. 421 00:25:07,010 --> 00:25:12,660 The E is much more common than T, than A, O, I, N and so on. 422 00:25:12,660 --> 00:25:16,160 So there is a distribution to this. 423 00:25:16,160 --> 00:25:20,300 But you can take account of that distribution and compute 424 00:25:20,300 --> 00:25:23,960 the associated entropy, and you get something a little bit 425 00:25:23,960 --> 00:25:27,590 smaller, 4.177 instead of the 4.7-something that we 426 00:25:27,590 --> 00:25:29,480 hadn't before. 427 00:25:29,480 --> 00:25:31,610 Because not all letters are equally likely. 428 00:25:31,610 --> 00:25:35,720 But this is still thinking of it symbol by symbol, 429 00:25:35,720 --> 00:25:38,390 not recognizing dependence over time. 430 00:25:41,640 --> 00:25:45,230 But English and other languages are full of context. 431 00:25:45,230 --> 00:25:45,730 Right? 432 00:25:45,730 --> 00:25:50,260 If you know the preceding part of the text, 433 00:25:50,260 --> 00:25:55,100 you have a very good way to guess the next letter. 434 00:25:55,100 --> 00:25:57,070 Nothing can be said to be certain except death 435 00:25:57,070 --> 00:25:58,528 and-- well, you can-- in this case, 436 00:25:58,528 --> 00:26:00,260 you can give me the next three letters. 437 00:26:00,260 --> 00:26:01,330 Right? 438 00:26:01,330 --> 00:26:02,060 Anyone? 439 00:26:02,060 --> 00:26:02,430 AUDIENCE: It's taxes 440 00:26:02,430 --> 00:26:03,388 PROFESSOR: Taxes, yeah. 441 00:26:07,440 --> 00:26:10,520 So even though X taken in isolation 442 00:26:10,520 --> 00:26:12,740 has a very low probability of occurrence, 443 00:26:12,740 --> 00:26:15,290 if you look at the histogram on the previous page, 444 00:26:15,290 --> 00:26:19,190 you see that the probability is 0.0017. 445 00:26:19,190 --> 00:26:21,273 Letters are not independently generated. 446 00:26:21,273 --> 00:26:23,690 Now, it turns out Shannon was actually one of the earliest 447 00:26:23,690 --> 00:26:27,300 to study this in experiments on his wife. 448 00:26:27,300 --> 00:26:30,680 He had her-- he presented her with bits of text 449 00:26:30,680 --> 00:26:32,145 from one particular book and asked 450 00:26:32,145 --> 00:26:33,770 her to guess the next letter and so on. 451 00:26:33,770 --> 00:26:37,820 And he had a 1951 paper that actually launched 452 00:26:37,820 --> 00:26:39,680 a lot of this, because he had developed now 453 00:26:39,680 --> 00:26:42,170 the tools for talking about it. 454 00:26:42,170 --> 00:26:45,100 His estimate was much lower than the 4-point-something. 455 00:26:45,100 --> 00:26:49,280 It was more in the vicinity of one bit, 1 to 1.5 bits. 456 00:26:49,280 --> 00:26:55,100 So there's a lot of compression possible with English text 457 00:26:55,100 --> 00:26:57,510 because there's this kind of a dependence here. 458 00:27:06,500 --> 00:27:09,250 And just to tell you what it is that we're 459 00:27:09,250 --> 00:27:11,980 trying to compute when we compute entropy 460 00:27:11,980 --> 00:27:14,200 for these long sequences of symbols, 461 00:27:14,200 --> 00:27:18,700 we're sort of saying what's the joint entropy of a sequence 462 00:27:18,700 --> 00:27:24,440 of K symbols divided by K in the limit of K going to infinity. 463 00:27:24,440 --> 00:27:27,490 So this is what you might call H under bar. 464 00:27:27,490 --> 00:27:28,990 It's not over bar because I couldn't 465 00:27:28,990 --> 00:27:31,030 see how to do an over bar on my PowerPoint. 466 00:27:31,030 --> 00:27:34,897 But it's usually an over bar in the books. 467 00:27:34,897 --> 00:27:36,730 But this is really the object that you would 468 00:27:36,730 --> 00:27:38,250 like to get your hands on. 469 00:27:38,250 --> 00:27:41,470 For sequential text that has context in it, 470 00:27:41,470 --> 00:27:43,510 this is the kind of entropy that you really 471 00:27:43,510 --> 00:27:46,480 would like to be working with. 472 00:27:46,480 --> 00:27:47,800 OK. 473 00:27:47,800 --> 00:27:51,130 So that brings us to an approach to coding 474 00:27:51,130 --> 00:27:53,092 that's really focused-- 475 00:27:53,092 --> 00:27:54,550 coding or compression that's really 476 00:27:54,550 --> 00:27:56,180 focused on sequential text. 477 00:27:56,180 --> 00:28:00,740 And this is the Lempel-Ziv-Welch algorithm that's in the notes. 478 00:28:00,740 --> 00:28:03,610 Turns out that Lempel and Ziv or Ziv and Lempel 479 00:28:03,610 --> 00:28:05,860 had two earlier papers. 480 00:28:05,860 --> 00:28:09,640 And then Welch improved on it in an '84 paper. 481 00:28:09,640 --> 00:28:12,700 And what's in blue over there is a bit of a mouthful. 482 00:28:12,700 --> 00:28:14,840 And each word kind of means something, 483 00:28:14,840 --> 00:28:17,320 so I thought I'd list it all there. 484 00:28:17,320 --> 00:28:19,600 Maybe I've used too many of these words-- 485 00:28:19,600 --> 00:28:23,470 universal lossless compression of sequential or streaming data 486 00:28:23,470 --> 00:28:25,630 by adaptive variable length coding. 487 00:28:25,630 --> 00:28:29,580 And I'll come to talk about those terms on the next slide. 488 00:28:29,580 --> 00:28:32,500 And it turns out that this is a very widely used compression 489 00:28:32,500 --> 00:28:35,050 algorithm for all sorts of files. 490 00:28:35,050 --> 00:28:36,880 Sometimes it's for a part of it. 491 00:28:36,880 --> 00:28:38,650 Sometimes it's optional. 492 00:28:38,650 --> 00:28:40,300 Sometimes it's combined with Huffman, 493 00:28:40,300 --> 00:28:43,870 but all of these things that do compression 494 00:28:43,870 --> 00:28:48,880 pay homage to Lempel and Ziv at least. 495 00:28:48,880 --> 00:28:50,260 They were also patented. 496 00:28:50,260 --> 00:28:54,565 Actually, Unisys owned the patent on LZW for many years. 497 00:28:54,565 --> 00:28:55,690 These have all expired now. 498 00:28:55,690 --> 00:29:00,370 But while the patents were held, it made for a lot of heartburn 499 00:29:00,370 --> 00:29:02,697 because there were things being done 500 00:29:02,697 --> 00:29:04,780 without knowledge of the existence of the patents. 501 00:29:04,780 --> 00:29:09,230 And then people got hit with lawsuits and so on. 502 00:29:09,230 --> 00:29:14,200 Jacob Ziv, again part of this incredible heritage from MIT 503 00:29:14,200 --> 00:29:17,140 of people working here in the early days of information 504 00:29:17,140 --> 00:29:17,950 theory. 505 00:29:17,950 --> 00:29:20,560 He was a graduate student here at the same time as Huffman 506 00:29:20,560 --> 00:29:23,500 and many other people whose names surface in all of this. 507 00:29:26,470 --> 00:29:29,020 I was actually at an award ceremony of the IEEE, 508 00:29:29,020 --> 00:29:32,870 where Lempel got an award for his compression work. 509 00:29:32,870 --> 00:29:36,980 And people were given a whole minute for a thank you speech, 510 00:29:36,980 --> 00:29:37,990 a mini thank you speech. 511 00:29:37,990 --> 00:29:41,860 And everyone took their minute to mention this person and that 512 00:29:41,860 --> 00:29:43,820 and talk about the origins of the work. 513 00:29:43,820 --> 00:29:45,403 It's a lot to say in a minute but they 514 00:29:45,403 --> 00:29:47,440 managed to convey a lot. 515 00:29:47,440 --> 00:29:49,582 Lempel came up and said, "thank you." 516 00:29:49,582 --> 00:29:51,220 [LAUGHTER] 517 00:29:51,220 --> 00:29:53,220 It seemed kind of fitting for someone whose life 518 00:29:53,220 --> 00:29:54,303 is devoted to compression. 519 00:29:54,303 --> 00:29:57,790 [LAUGHTER] 520 00:29:57,790 --> 00:30:00,610 I just couldn't help but crack up there. 521 00:30:00,610 --> 00:30:04,060 That was-- all right. 522 00:30:04,060 --> 00:30:05,590 Now the interesting thing about this 523 00:30:05,590 --> 00:30:09,112 is that there are theoretical guarantees 524 00:30:09,112 --> 00:30:10,570 that, under appropriate assumptions 525 00:30:10,570 --> 00:30:14,290 on the source, asymptotically, this process will 526 00:30:14,290 --> 00:30:16,630 attain that bound. 527 00:30:16,630 --> 00:30:21,160 Now the thing is, the word asymptotically hides many sins. 528 00:30:21,160 --> 00:30:23,490 Lots of things happen at infinity that 529 00:30:23,490 --> 00:30:25,802 are not supposed to happen. 530 00:30:25,802 --> 00:30:27,760 Or lots of things happen at infinity that never 531 00:30:27,760 --> 00:30:29,290 happen when you're watching. 532 00:30:29,290 --> 00:30:31,870 So the theoretical performance perhaps 533 00:30:31,870 --> 00:30:34,720 is not as important as the fact that it works exceedingly well 534 00:30:34,720 --> 00:30:36,565 in practice. 535 00:30:36,565 --> 00:30:38,440 So we're going to talk a little bit about it. 536 00:30:38,440 --> 00:30:39,910 You've got a lab on it as well. 537 00:30:44,280 --> 00:30:48,050 So let me just say a little bit about what these words mean. 538 00:30:48,050 --> 00:30:50,150 So this is universal in the sense 539 00:30:50,150 --> 00:30:51,830 that it doesn't necessarily-- it doesn't 540 00:30:51,830 --> 00:30:54,230 need any knowledge of the particular statistics 541 00:30:54,230 --> 00:30:55,730 of the source that it's compressing. 542 00:30:55,730 --> 00:30:59,830 It's willing to take its hand at anything. 543 00:30:59,830 --> 00:31:01,790 OK? 544 00:31:01,790 --> 00:31:04,040 So it doesn't need to know the source statistics. 545 00:31:04,040 --> 00:31:05,990 It actually learns the source statistics 546 00:31:05,990 --> 00:31:09,650 in the course of implementing the algorithm. 547 00:31:09,650 --> 00:31:11,870 And it does that by actually building up 548 00:31:11,870 --> 00:31:14,300 a dictionary for strings of symbols 549 00:31:14,300 --> 00:31:16,790 that it discovers in the source text. 550 00:31:16,790 --> 00:31:21,500 So it's built around construction of a dictionary. 551 00:31:21,500 --> 00:31:23,960 What it then does is it compresses the text, 552 00:31:23,960 --> 00:31:27,950 not to things that we've seen here in Huffman, 553 00:31:27,950 --> 00:31:29,800 but to actually dictionary entries. 554 00:31:29,800 --> 00:31:32,210 So it's sort of like Morse's original idea, 555 00:31:32,210 --> 00:31:34,850 which was communicate the address in the dictionary 556 00:31:34,850 --> 00:31:36,660 rather than communicating the word itself 557 00:31:36,660 --> 00:31:38,900 or some compressed version of the word. 558 00:31:38,900 --> 00:31:40,712 So it compresses the text to sequences 559 00:31:40,712 --> 00:31:42,920 of dictionary addresses, and those are the code words 560 00:31:42,920 --> 00:31:46,650 that it sends to the receiver. 561 00:31:46,650 --> 00:31:49,380 It's also a variable length compression scheme. 562 00:31:49,380 --> 00:31:51,660 But it's interesting that it doesn't 563 00:31:51,660 --> 00:31:56,150 take a fixed length of symbols to varying lengths of code 564 00:31:56,150 --> 00:31:56,650 words. 565 00:31:56,650 --> 00:31:58,140 It actually takes varying lengths 566 00:31:58,140 --> 00:32:00,030 of symbols to fixed length of code words. 567 00:32:00,030 --> 00:32:01,680 So it's a little bit backwards. 568 00:32:01,680 --> 00:32:05,700 But it's still a variable length in that sense. 569 00:32:05,700 --> 00:32:09,150 So the way this works is that the sender and the receiver 570 00:32:09,150 --> 00:32:12,210 start off with a core dictionary that they both agreed on. 571 00:32:12,210 --> 00:32:17,790 And for our illustrations, we might 572 00:32:17,790 --> 00:32:20,100 say that they've agreed on the letters A 573 00:32:20,100 --> 00:32:30,390 through Z, lowercase A through Z. 574 00:32:30,390 --> 00:32:33,390 So what they have is these letters or this core dictionary 575 00:32:33,390 --> 00:32:35,340 stored in some register. 576 00:32:35,340 --> 00:32:39,280 Well, actually let me show you what it might look like. 577 00:32:39,280 --> 00:32:42,690 So there's the register with, let's say, 578 00:32:42,690 --> 00:32:45,210 you have an 8-bit table. 579 00:32:45,210 --> 00:32:47,490 This is the dictionary that you have at both ends. 580 00:32:47,490 --> 00:32:50,550 So you can store 256 different things. 581 00:32:50,550 --> 00:32:54,000 And you've both agreed on what's going to go into those slots. 582 00:32:54,000 --> 00:32:56,700 So somewhere-- I think it's slot 97 in one 583 00:32:56,700 --> 00:32:59,820 of these particular codes, you've got the letter A. 584 00:32:59,820 --> 00:33:01,920 And somewhere else you've got B, and so on. 585 00:33:01,920 --> 00:33:03,570 Or the next position you've got B. 586 00:33:03,570 --> 00:33:06,840 You can store a bunch of standard symbols. 587 00:33:06,840 --> 00:33:09,090 So we'll consider that all the single letter 588 00:33:09,090 --> 00:33:14,580 symbols are already stored in designated positions 589 00:33:14,580 --> 00:33:17,880 in this dictionary that's known to the sender and the receiver. 590 00:33:17,880 --> 00:33:24,180 So if the sender just sends 252, the receiver 591 00:33:24,180 --> 00:33:27,030 knows what 252 refers to because they've 592 00:33:27,030 --> 00:33:29,963 got that core dictionary that they've agreed on. 593 00:33:29,963 --> 00:33:31,380 Some of the text here, by the way, 594 00:33:31,380 --> 00:33:34,710 is stuff I've said already. 595 00:33:34,710 --> 00:33:35,850 So I'll actually go back. 596 00:33:48,260 --> 00:33:51,470 And then what happens is that the source starts 597 00:33:51,470 --> 00:33:54,650 to sequentially scan the text that's arriving 598 00:33:54,650 --> 00:34:01,160 or that it's looking at and puts new strings that it's 599 00:34:01,160 --> 00:34:05,750 found into new locations in this table 600 00:34:05,750 --> 00:34:09,199 and then communicates the address for the receiver. 601 00:34:09,199 --> 00:34:12,380 The magic of this-- and I mean it's fiendishly clever, very 602 00:34:12,380 --> 00:34:17,030 simple, but very clever, is that the receiver can build up 603 00:34:17,030 --> 00:34:21,895 its dictionary in tandem with the transmitter building up 604 00:34:21,895 --> 00:34:22,520 the dictionary. 605 00:34:22,520 --> 00:34:24,360 It's just a one-step delay. 606 00:34:24,360 --> 00:34:25,940 So one step later, the receiver has 607 00:34:25,940 --> 00:34:30,060 figured out what that dictionary entry is. 608 00:34:30,060 --> 00:34:34,400 So the transmitter or the source is building up the dictionary, 609 00:34:34,400 --> 00:34:39,380 looking at strings in the input sequence, 610 00:34:39,380 --> 00:34:42,080 communicating the address-- 611 00:34:42,080 --> 00:34:46,250 the addresses of the appropriate strings to the receiver, 612 00:34:46,250 --> 00:34:48,843 and the receiver is building up a dictionary in parallel. 613 00:34:48,843 --> 00:34:50,510 Now I think the easiest way to do this-- 614 00:34:50,510 --> 00:34:52,760 there's discussion in the text. 615 00:34:52,760 --> 00:34:54,050 There's also code fragments. 616 00:34:54,050 --> 00:34:55,467 But I think the easiest way for me 617 00:34:55,467 --> 00:34:57,980 to try and do this is to actually just show you 618 00:34:57,980 --> 00:35:01,900 how it works on a particular sequence. 619 00:35:04,450 --> 00:35:07,290 And you may not get all the details all at once. 620 00:35:07,290 --> 00:35:09,810 I do have a little animation that I need to tweak a bit, 621 00:35:09,810 --> 00:35:10,780 and I'll-- 622 00:35:10,780 --> 00:35:12,930 well, it's not an animation, but a set of slides 623 00:35:12,930 --> 00:35:15,030 that'll help you understand, actually, 624 00:35:15,030 --> 00:35:16,990 this particular example. 625 00:35:16,990 --> 00:35:18,510 So I'll have that posted as well. 626 00:35:22,022 --> 00:35:23,730 But for now, let's just work through this 627 00:35:23,730 --> 00:35:27,520 and see what it looks like. 628 00:35:27,520 --> 00:35:30,250 And I hope I don't trip over myself in the process. 629 00:35:30,250 --> 00:35:31,617 I hope you'll be forgiving. 630 00:35:41,150 --> 00:35:42,900 And I need these two blackboards to do it. 631 00:35:48,020 --> 00:35:48,520 OK. 632 00:35:51,570 --> 00:35:52,820 And I need some colored chalk. 633 00:35:56,320 --> 00:36:01,670 So what I'm going to have over here is the source. 634 00:36:01,670 --> 00:36:04,030 And over here is the receiver. 635 00:36:09,720 --> 00:36:16,590 And the source wants to send a message that I'll put here-- 636 00:36:16,590 --> 00:36:27,955 A-B-C. This is going to look incredibly boring. 637 00:36:31,030 --> 00:36:33,620 But the algorithm does different things at different stages, 638 00:36:33,620 --> 00:36:35,470 so that keeps it interesting. 639 00:36:35,470 --> 00:36:38,800 And let's see 1, 2, 3, 4, 5. 640 00:36:38,800 --> 00:36:42,937 And then we hit a special case somewhere near the end here 641 00:36:42,937 --> 00:36:44,020 that is worth sorting out. 642 00:36:44,020 --> 00:36:45,730 Because otherwise that, the fragment 643 00:36:45,730 --> 00:36:47,900 of the code that you see doesn't make sense. 644 00:36:47,900 --> 00:36:53,080 Gee, can you believe that I want to start this again here? 645 00:36:53,080 --> 00:36:53,740 Sorry. 646 00:36:53,740 --> 00:36:55,360 Let's start here. 647 00:36:55,360 --> 00:36:57,385 I want at least six replications of ABC. 648 00:37:04,823 --> 00:37:06,240 I want you to get comfortable also 649 00:37:06,240 --> 00:37:07,590 so you can settle into this. 650 00:37:11,900 --> 00:37:16,100 OK, here we go. 651 00:37:16,100 --> 00:37:18,630 All right. 652 00:37:18,630 --> 00:37:21,330 The receiver has no idea that this is the sequence. 653 00:37:21,330 --> 00:37:23,550 The source has, and the receiver both 654 00:37:23,550 --> 00:37:27,420 have A through Z sitting in their dictionary 655 00:37:27,420 --> 00:37:30,970 at designated locations. 656 00:37:30,970 --> 00:37:37,649 So the source will first see the letter A 657 00:37:37,649 --> 00:37:40,160 and does nothing because A is in its dictionary. 658 00:37:40,160 --> 00:37:43,010 It doesn't want to do anything yet. 659 00:37:43,010 --> 00:37:44,150 Then it looks at-- 660 00:37:44,150 --> 00:37:46,775 it pulls in B. So now it's looking at AB. 661 00:37:46,775 --> 00:37:49,700 AB is not in its dictionary because it's a symbol of-- 662 00:37:49,700 --> 00:37:52,410 it's a string of two symbols. 663 00:37:52,410 --> 00:37:55,550 So now it knows it needs to make a dictionary entry. 664 00:37:55,550 --> 00:37:58,560 I'm going to indicate dictionary entry with this. 665 00:37:58,560 --> 00:38:02,660 So the source is going to make a dictionary entry of AB. 666 00:38:02,660 --> 00:38:05,420 So what this means is somewhere in that register 667 00:38:05,420 --> 00:38:07,760 in a particular position, or in the next position 668 00:38:07,760 --> 00:38:13,490 actually from the agreed on table, it sticks in this. 669 00:38:13,490 --> 00:38:18,980 And then what it transmits to the receiver is not this, 670 00:38:18,980 --> 00:38:25,260 but the code for A. OK? 671 00:38:25,260 --> 00:38:29,280 So it enters the longer fragment here as a new dictionary word 672 00:38:29,280 --> 00:38:34,580 and sends the address for the piece that the receiver sees. 673 00:38:34,580 --> 00:38:36,120 So what does the receiver get? 674 00:38:36,120 --> 00:38:40,140 The receiver sees A coming in and says, OK, 675 00:38:40,140 --> 00:38:44,530 that's the sequence A. That's the symbol. 676 00:38:44,530 --> 00:38:46,190 A, I'm all set. 677 00:38:46,190 --> 00:38:48,930 All right? 678 00:38:48,930 --> 00:38:54,420 Now what happens is that the source 679 00:38:54,420 --> 00:38:56,190 pulls in the next letter. 680 00:38:56,190 --> 00:38:59,040 It's done with the A, so you can essentially forget about that. 681 00:38:59,040 --> 00:39:00,900 It pulls in the next letter. 682 00:39:00,900 --> 00:39:05,280 Looks to see if it's got B-C in its dictionary. 683 00:39:05,280 --> 00:39:08,880 It doesn't have BC because it only has single letter entries, 684 00:39:08,880 --> 00:39:10,280 and it has AB. 685 00:39:10,280 --> 00:39:12,660 So it's got to put in BC. 686 00:39:12,660 --> 00:39:14,670 So it's going to put in an entry for BC. 687 00:39:19,880 --> 00:39:28,170 And then what it's going to transmit is the B. 688 00:39:28,170 --> 00:39:33,900 The receiver gets the B. Oh, sorry-- 689 00:39:33,900 --> 00:39:37,620 the directory entry for B. And so it knows that's the letter 690 00:39:37,620 --> 00:39:42,990 B. And now it enters AB in its-- 691 00:39:47,670 --> 00:39:51,010 in its dictionary, OK, in the next location. 692 00:39:51,010 --> 00:39:52,920 So you see, with a one-step delay, 693 00:39:52,920 --> 00:39:55,350 the AB that was in the dictionary here 694 00:39:55,350 --> 00:39:57,590 has ended up in the dictionary of the receiver. 695 00:40:00,740 --> 00:40:03,770 OK, we're done with this. 696 00:40:03,770 --> 00:40:06,830 We now pull in the next letter here. 697 00:40:06,830 --> 00:40:09,140 That's A. We haven't seen A-- 698 00:40:09,140 --> 00:40:10,950 we haven't seen CA in our dictionary. 699 00:40:10,950 --> 00:40:19,550 So we make an entry for CA, ship out C. C comes here. 700 00:40:23,700 --> 00:40:29,100 I should say that this was done with the A. The C comes here, 701 00:40:29,100 --> 00:40:34,020 and the receiver knows to make an entry for BC. 702 00:40:38,630 --> 00:40:39,890 So with one delay it's got it. 703 00:40:43,160 --> 00:40:44,675 OK, we're done with this. 704 00:40:48,910 --> 00:40:51,670 We pull in the next letter, AB. 705 00:40:51,670 --> 00:40:52,910 That's in our dictionary. 706 00:40:52,910 --> 00:40:54,880 So we keep going, all right? 707 00:40:54,880 --> 00:40:59,740 So this algorithm doesn't look to ship out the dictionary 708 00:40:59,740 --> 00:41:02,718 address every time it sees a sequence that it recognizes. 709 00:41:02,718 --> 00:41:04,510 If it's got this already in its dictionary, 710 00:41:04,510 --> 00:41:07,420 it keeps going to try and learn a new word. 711 00:41:07,420 --> 00:41:09,430 So it's already got AB there, so it keeps going 712 00:41:09,430 --> 00:41:13,610 and it pulls in C. And now that's a new word. 713 00:41:13,610 --> 00:41:20,170 So it's got ABC as a new entry. 714 00:41:20,170 --> 00:41:23,050 It ships out AB-- 715 00:41:23,050 --> 00:41:24,280 the address for AB rather. 716 00:41:30,770 --> 00:41:37,650 This gets the address for AB, which is in its dictionary. 717 00:41:37,650 --> 00:41:39,105 It puts the AB down there. 718 00:41:41,760 --> 00:41:43,650 It takes the first letter of the string that 719 00:41:43,650 --> 00:41:45,280 came in and appends it to the last one 720 00:41:45,280 --> 00:41:46,905 that it had there and gives you the CA. 721 00:41:50,730 --> 00:41:53,340 So you see, it's keeping up but with a one-step delay. 722 00:41:56,220 --> 00:41:57,120 Let's keep going. 723 00:41:57,120 --> 00:42:00,590 So the AB is done with. 724 00:42:00,590 --> 00:42:04,234 We pull in A. We've got CA. 725 00:42:04,234 --> 00:42:08,260 We pull in the B. We don't have CAB, 726 00:42:08,260 --> 00:42:09,590 so let's enter that as well. 727 00:42:12,120 --> 00:42:14,120 By the time we've done this example, by the way, 728 00:42:14,120 --> 00:42:17,030 I'm hoping you'll know Lempel-Ziv. 729 00:42:17,030 --> 00:42:18,770 So bear with me. 730 00:42:21,680 --> 00:42:23,260 All right, dictionary entry-- and now 731 00:42:23,260 --> 00:42:25,895 what does it send out to the receiver? 732 00:42:25,895 --> 00:42:26,770 AUDIENCE: [INAUDIBLE] 733 00:42:26,770 --> 00:42:27,478 PROFESSOR: Sorry. 734 00:42:27,478 --> 00:42:28,120 AUDIENCE: C2 735 00:42:28,120 --> 00:42:35,030 PROFESSOR: CA-- the address for CA, right? 736 00:42:35,030 --> 00:42:36,055 The address for CA. 737 00:42:36,055 --> 00:42:37,700 So the address for CA comes in. 738 00:42:40,330 --> 00:42:42,160 It decodes the CA. 739 00:42:48,450 --> 00:42:49,740 And so let's see. 740 00:42:49,740 --> 00:42:51,780 We're done with these pieces, but this one 741 00:42:51,780 --> 00:42:56,670 has to build up its new direct dictionary entry. 742 00:42:56,670 --> 00:42:59,460 And so what it's got is the AB setting from before, 743 00:42:59,460 --> 00:43:01,650 and it pulls in the first letter. 744 00:43:01,650 --> 00:43:03,240 Instead of wrapping to the next board, 745 00:43:03,240 --> 00:43:05,970 let me start winding up again-- 746 00:43:05,970 --> 00:43:06,690 winding upwards. 747 00:43:09,650 --> 00:43:13,240 OK, so that's the new entry there, the receiver-- 748 00:43:13,240 --> 00:43:14,470 one step delayed from here. 749 00:43:18,660 --> 00:43:22,210 OK, I pull in the C. I have BC . 750 00:43:22,210 --> 00:43:24,140 I keep going. 751 00:43:24,140 --> 00:43:28,240 I pull on the A. I don't see that. 752 00:43:28,240 --> 00:43:29,015 So I need BCA. 753 00:43:33,650 --> 00:43:35,480 I ship out the address for BC. 754 00:43:38,250 --> 00:43:40,580 So I'm done with these. 755 00:43:40,580 --> 00:43:42,260 I get the address for BC here. 756 00:43:46,130 --> 00:43:50,120 I decode and get BC. 757 00:43:50,120 --> 00:43:53,405 I combined the first letter of the new fragment 758 00:43:53,405 --> 00:43:54,530 with what was sitting here. 759 00:43:54,530 --> 00:44:03,300 So I get CAB as my dictionary entry. 760 00:44:06,720 --> 00:44:08,400 And I keep going. 761 00:44:08,400 --> 00:44:10,565 All right, it's very systematic. 762 00:44:10,565 --> 00:44:12,190 I'm going to keep going because there's 763 00:44:12,190 --> 00:44:15,820 a special case that will trip you up if you don't get to it. 764 00:44:15,820 --> 00:44:19,300 And we need to proceed a couple more here. 765 00:44:19,300 --> 00:44:23,680 OK, I pull in the B. I've got a AB. 766 00:44:23,680 --> 00:44:25,525 I pull in the C. I've got ABC. 767 00:44:29,580 --> 00:44:32,920 I pull in the A. I don't have ABCA. 768 00:44:32,920 --> 00:44:40,327 So I enter that in my dictionary. 769 00:44:43,430 --> 00:44:44,430 And then I ship out ABC. 770 00:44:54,200 --> 00:44:57,980 OK, so you're always building a new word, entering it 771 00:44:57,980 --> 00:45:00,643 in your dictionary, and then the part that's already 772 00:45:00,643 --> 00:45:02,060 known you're shipping out and then 773 00:45:02,060 --> 00:45:04,367 hanging onto the end of this to start 774 00:45:04,367 --> 00:45:05,450 building the new fragment. 775 00:45:08,150 --> 00:45:09,350 ABC arrives here. 776 00:45:17,140 --> 00:45:18,710 I had the BC from before. 777 00:45:18,710 --> 00:45:21,890 I pull in the first letter of that, 778 00:45:21,890 --> 00:45:28,160 and I get a BCA as my new entry, which is this one. 779 00:45:32,080 --> 00:45:33,220 OK. 780 00:45:33,220 --> 00:45:35,140 Now we pull in the AB. 781 00:45:35,140 --> 00:45:36,940 I mean, we pull in the B. We have AB. 782 00:45:36,940 --> 00:45:40,000 We pull in the C. We have ABC. 783 00:45:40,000 --> 00:45:47,770 We pull in the A, we have ABCA, so we pull on the B. 784 00:45:47,770 --> 00:45:50,135 We ship out ABCA-- 785 00:45:50,135 --> 00:45:55,950 A-B-C-A. Right? 786 00:45:55,950 --> 00:45:59,620 And now we're done with all those guys. 787 00:45:59,620 --> 00:46:00,650 And here comes ABCA. 788 00:46:06,370 --> 00:46:11,910 And I go to my dictionary, and I don't have ABCA-- 789 00:46:14,840 --> 00:46:15,370 big hiccup. 790 00:46:18,840 --> 00:46:22,190 So the reason that happened is that I'm 791 00:46:22,190 --> 00:46:27,080 discovering I need to send ABCA on the very next step 792 00:46:27,080 --> 00:46:30,230 after entering it in my dictionary on the receiver-- 793 00:46:30,230 --> 00:46:32,080 on the transmitter side. 794 00:46:32,080 --> 00:46:35,990 And so the receiver hasn't yet had a chance to catch up. 795 00:46:35,990 --> 00:46:37,700 Now if you analyze this, It turns out 796 00:46:37,700 --> 00:46:43,010 that whenever this happens, the sequence involved 797 00:46:43,010 --> 00:46:46,980 has its last character equal to its first character. 798 00:46:46,980 --> 00:46:50,510 So looking at this, the dictionary 799 00:46:50,510 --> 00:46:52,340 here is waiting to build up. 800 00:46:52,340 --> 00:46:54,050 It's got the ABC here, and it's waiting 801 00:46:54,050 --> 00:46:58,370 to pull in the first letter from the sequence-- 802 00:46:58,370 --> 00:47:00,770 the sequence associated with this dictionary entry. 803 00:47:00,770 --> 00:47:02,390 It doesn't have that dictionary entry. 804 00:47:02,390 --> 00:47:05,300 So it can't pull in the A like it was doing all along. 805 00:47:05,300 --> 00:47:07,740 But if you analyze the cases under which this happens, 806 00:47:07,740 --> 00:47:09,980 It turns out that whenever you don't have it 807 00:47:09,980 --> 00:47:12,340 in your dictionary entry, the missing letter 808 00:47:12,340 --> 00:47:14,090 that you want to pull into your dictionary 809 00:47:14,090 --> 00:47:16,760 is the same as the first one in that string that's 810 00:47:16,760 --> 00:47:18,320 waiting to be built up. 811 00:47:18,320 --> 00:47:22,500 So it completes it with an A, and it's all set. 812 00:47:22,500 --> 00:47:28,010 Now it says ABCA, and it continues 813 00:47:28,010 --> 00:47:30,200 So this happens under very particular conditions. 814 00:47:30,200 --> 00:47:31,250 It's a special case. 815 00:47:31,250 --> 00:47:36,350 If you actually look at the code that's in the notes you'll see. 816 00:47:36,350 --> 00:47:38,700 While the encoding is straightforward, 817 00:47:38,700 --> 00:47:41,630 it's really remarkable that a short fragment like this 818 00:47:41,630 --> 00:47:43,100 can do that encoding. 819 00:47:48,600 --> 00:47:49,440 Let's see here. 820 00:47:51,830 --> 00:47:52,830 I don't want to do this. 821 00:47:52,830 --> 00:47:53,747 I did another example. 822 00:48:00,210 --> 00:48:03,410 Let me just say what's on this before I dispense with it. 823 00:48:03,410 --> 00:48:05,930 Sorry. 824 00:48:05,930 --> 00:48:06,530 OK. 825 00:48:06,530 --> 00:48:12,390 So look at what's happened. 826 00:48:12,390 --> 00:48:15,090 In terms of the number of things we've sent, 827 00:48:15,090 --> 00:48:16,447 we've only sent these addresses. 828 00:48:16,447 --> 00:48:18,030 And there are fewer of them than there 829 00:48:18,030 --> 00:48:19,462 were symbols in the original. 830 00:48:19,462 --> 00:48:21,170 So that's where the compression comes in. 831 00:48:21,170 --> 00:48:25,595 And as you get the longer strings, the benefit is higher. 832 00:48:25,595 --> 00:48:27,720 Actually, I'm going to pass this and just tell you, 833 00:48:27,720 --> 00:48:33,060 when you look through the code fragment for decoding, 834 00:48:33,060 --> 00:48:35,070 this is the special case that we talked about. 835 00:48:35,070 --> 00:48:37,020 If the code is not in your dictionary, 836 00:48:37,020 --> 00:48:38,260 then do such and such. 837 00:48:38,260 --> 00:48:40,020 So that's the explanation. 838 00:48:40,020 --> 00:48:41,142 All right. 839 00:48:41,142 --> 00:48:42,600 And that's described in the slides. 840 00:48:42,600 --> 00:48:43,980 We'll put that on. 841 00:48:43,980 --> 00:48:47,500 I just wanted to end with a couple of things. 842 00:48:47,500 --> 00:48:50,160 One is actually-- LZW is a good example of something 843 00:48:50,160 --> 00:48:53,280 that you see in other contexts as well, where you're 844 00:48:53,280 --> 00:48:55,950 faced with transmitting data and you decide instead 845 00:48:55,950 --> 00:48:58,260 that you'll transmit the model or your best model 846 00:48:58,260 --> 00:48:59,550 for what generates that data. 847 00:48:59,550 --> 00:49:02,220 That can often be a much more efficient way to do things. 848 00:49:02,220 --> 00:49:04,650 And in fact, when you speak into your cell phone, 849 00:49:04,650 --> 00:49:07,290 you're not transmitting a raw speech waveform. 850 00:49:07,290 --> 00:49:10,890 There's actually a very sophisticated code there 851 00:49:10,890 --> 00:49:13,380 that's modeling your speech as the output 852 00:49:13,380 --> 00:49:14,840 of an autoregressive filter. 853 00:49:14,840 --> 00:49:19,470 And then it sends the filter tap weights to the receiver. 854 00:49:19,470 --> 00:49:21,690 So this kind of thing arises again and again. 855 00:49:21,690 --> 00:49:24,330 Sending the model and the little information 856 00:49:24,330 --> 00:49:26,430 you need to run the model at the receiving end 857 00:49:26,430 --> 00:49:28,572 can be much more efficient than sending the data. 858 00:49:28,572 --> 00:49:30,030 The other thing is everything we've 859 00:49:30,030 --> 00:49:32,590 talked about has been lossless compression-- 860 00:49:32,590 --> 00:49:33,930 Huffman and LZW. 861 00:49:33,930 --> 00:49:37,940 You can completely recover what was compressed. 862 00:49:37,940 --> 00:49:40,700 But there's a whole world of lossy compression, 863 00:49:40,700 --> 00:49:41,700 which is very important. 864 00:49:41,700 --> 00:49:44,760 And we'll find ways to sneak in discussion of that as well. 865 00:49:44,760 --> 00:49:46,820 All right, thank you.