Description: Can multiple weak classifiers be used to make a strong one? We examine the boosting algorithm, which adjusts the weight of each classifier, and work through the math. We end with how boosting doesn't seem to overfit, and mention some applications.
Instructor: Patrick H. Winston
Specifically developed for this lecture are the boosting notes (PDF), which are also listed among the readings for this course.
PATRICK WINSTON: We've now almost completed our journey.
This will be it for talking about several kinds of learning-- the venerable kind, that's the nearest neighbors and identification tree types of learning.
Still useful, still the right thing to do if there's no reason not to do the simple thing.
Then we have the biologically-inspired approaches.
All kinds of problems with local maxima and overfitting and oscillation, if you get the rate constant too big.
Like neural nets, both are very naive in their attempt to mimic nature.
So maybe they work on a class of problems.
They surely do each have a class of problems for which they're good.
But as a general purpose first resort, I don't recommend it.
But now the theorists have come out and done some things are very remarkable.
And in the end, you have to say, wow, these are such powerful ideas.
I wonder if nature has discovered them, too?
Is there good engineering in the brain, based on good science?
Or given the nature of evolution, is it just random junk that is the best ways for doing anything?
But today, we're going to talk about an idea that I'll bet is in there somewhere, because it's easy to implement, and it's extremely powerful in what it does, and it's the essential item in anybody's repertoire of learning mechanisms.
It's also a mechanism which, if you understand only by formula, you will never be able to work the problems on the quiz, that's for sure.
Because on the surface, it looks like it'd be very complicated to simulate this approach.
But once you understand how it works and look at a little bit of the math and let it sing songs to you, it turns out to be extremely easy.
So it's about letting multiple methods work in your behalf.
So far, we've been talking about using just one method to do something.
And what we're going to do now is we're looking to see if a crowd can be smarter than the individuals in the crowd.
But before we get too far down that abstract path, let me just say that the whole works has to do with classification, and binary classification.
Am I holding a piece of chalk in my hand, or a hand grenade?
Is that a cup of coffee or tea?
Those are binary classification problems.
And so we're going to be talking today strictly about binary classification.
We're not going to be talking about finding the right letter in the alphabet that's written on the page.
That's a 26-way choice.
We're talking about binary choices.
So we assume that there's a set of classifiers that we can draw on.
Here's one-- h.
And it produces either a minus 1 or a plus 1.
So that's how the classification is done.
If it's coffee, plus 1.
If it's tea, minus 1.
Is this chalk, plus one.
If it's a hand grenade, minus 1.
So that's how the classification works.
Now, too bad for us, normally the world doesn't give us very good classifiers.
So if we look at the error rate of this classifier or any other classifier, that error rate will range from 0 to 1 in terms of the fraction of the cases got wrong on a sample set.
So you'd like your error rate to be way down here.
You're dead if it's over there.
But what about in the middle?
What if it's, say, right there.
Just a little bit better than flipping a coin.
If it's just a little bit better than flipping a coin, that's a weak classifier.
And the question is, can you make a classifier that's way over here, like there, a strong classifier, by combining several of these weak classifiers, and letting them vote?
So how would you do that?
You might say, well, let us make a big classifier capital H, that works on some sample x, and has its output produces something that depends on the sum of the outputs of the individual classifiers.
So we have H1 working on x.
We have H2 working on x.
And we have H3 also working on x.
Let's say three of them, just to start us off.
And now let's add those guys up, and take the sign of the output.
So if two out of the three of those guys agree, then we'll get an either plus 1 or minus 1.
If all three agree, we'll get plus 1 or minus 1.
Because we're just taking the sign.
We're just taking the sign of the sum of these guys.
So this means that one guy can be wrong, as long as the other two guys are right.
But I think it's easier to see how this all works if you think of some space of samples, you say, well, let's let that area here be where H1 is wrong, and this area over here is where H2 is wrong.
And then this area over here is where H3 is wrong.
So if the situation is like that, then this formula always gives you the right answers on the samples.
I'm going to stop saying that right now, because I want to be kind of a background thing on the samples set.
We're talking about wrapping this stuff over the sample set.
Later on, we'll ask, OK, given that you trained this thing on a sample set, how well does it do on some new examples?
Because we want to ask ourselves about overfitting questions.
But for now, we just want to look and see if we believe that this arrangement, where each of these H's is producing plus 1 or minus 1, we're adding them up and taking the sign, is that going to give us a better result than the tests individually?
And if they look like this when draped over a sample set, then it's clear that we're going to get the right answer every time, because there's no area here where any two of those tests are giving us the wrong answer.
So the two that are getting the right answer, in this little circle here for H1, these other two are getting the right answer.
So they'll outvote it, and you'll get the right answer every time.
But it doesn't have to be that simple.
It could look like this.
There could be a situation where this is H1, wrong answer.
This is H2, wrong answer.
And this is H3, wrong answer.
And now the situation gets a little bit more murky, because we have to ask ourselves whether that area where three out of the three get it wrong is sufficiently big so as to be worse than 1 of the individual tests.
So if you look at that Venn diagram, and stare at it long enough, and try some things, you can say, well, there is no case where this will give a worse answer.
Or, you might end up with the conclusion that there are cases where we can arrange those circles such that the voting scheme will give an answer that's worst than an individual test, but I'm not going to tell you the answer, because I think we'll make that a quiz question.
So we'll make that a quiz question.
So that looks like a good idea.
And we can construct a little algorithm that will help us pick the particular weak classifiers to plug in here.
We've got a whole bag of classifiers.
We've got H1, we've got H2, we've got H55.
We've got a lot of them we can choose from.
So what we're going to do is we're going to use the data, undisturbed, to produce H1.
We're just going to try all the tests on the data and see which one gives us the smallest error rate.
And that's the good guy, so we're going to use that.
Then we're going to use the data with an exaggeration of H1 errors.
In other words-- this is a critical idea.
What we're going to do is we're going to run this algorithm again, but instead of just looking at the number of samples that are got wrong, what we're going to do is we're going to look at a distorted set of samples, where the ones we're not doing well on has exaggerated effect on the result.
So we're going to weight them or multiply them, or do something so that we're going to pay more attention to the samples on which H1 produces an error, and that's going to give us H2.
And then we're going to do it one more time, because we've got three things to go with here in this particular little exploratory scheme.
And this time, we're going to have an exaggeration of those samples-- which samples are we going to exaggerate now?
We might as well look for the ones where H1 gives us a different answer from H2, because we want to be on the good guy's side.
So we can say we're going to exaggerate those samples four which H1 gives us a different result from H2.
And that's going to give us H3.
So we can think of this whole works here as part one of a multi-part idea.
So let's see.
I don't know, what might be step two?
Well, this is a good idea.
Then what we've got that we can easily derive from that is a little tree looked like this.
And we can say that H of x depends on H1, H2, and H3.
But now, if that that's a good idea, and that gives a better answer than any of the individual tests, maybe we can make this idea a little bit recursive, and say, well, maybe H1 is actually not an atomic test.
But maybe it's the vote of three other tests.
So you can make a tree structure that looks like this.
So this is H11, H12, H13, and then 3 here.
And then this will be H31, H32, H33.
And so that's a sort of get out the vote idea.
We're trying to get a whole bunch of individual tests into the act.
So I guess the reason this wasn't discovered until about '10 years ago was because you've got to get so many of these desks all lined up before the idea gets through that long filter of ideas.
So that's the only idea number two of quite a few.
Well, next thing we might think is, well, we keep talking about these classifiers.
What kind of classifiers are we talking about?
I've got-- oh, shoot, I've spent my last nickel.
I don't have a coin to flip.
But that's one classifier, right?
The trouble with that classifier is it's a weak classifier, because it gives me a 50/50 chance of being right.
I guess there are conditions in which a coin flip is better than a-- it is a weak classifier.
If the two outcomes are not equally probable, than a coin flip is a perfectly good weak classifier.
But what we're going to do is we're going to think in terms of a different set of classifiers.
And we're going to call them decision tree.
Now, you remember decision trees, right?
But we're not going to build decision trees.
We're going to use decision tree stumps.
So if we have a two-dimensional space that looks like this, then a decision tree stump is a single test.
It's not a complete tree that will divide up the samples into homogeneous groups.
It's just what you can do with one test.
So each possible test is a classifier.
How many tests do we get out of that?
It doesn't look like 12 to me, either.
But here's how you get to 12.
One decision tree test you can stick in there would be that test right there.
And that would be a complete decision tree stump.
But, of course, you can also put in this one.
That would be another decision tree stump.
Now, for this one on the right, I could say, everything on the right is a minus.
Or, I could say, everything on the right is a plus.
It would happen to be wrong, but it's a valid test with a valid outcome.
So that's how we double the number of test that we have lines for.
And you know what?
can even have a kind of test out here that says everything is plus, or everything is wrong.
So for each dimension, the number of decision tree stumps is the number of lines I can put in times 2.
And then I've got two dimensions here, that's how I got to twelve.
So there are three lines.
I can have the pluses on either the left or the right side.
So that's six.
And then I've got two dimensions, so that gives me 12.
So that's the decision tree stump idea.
And here are the other decision tree boundaries, obviously just like that.
So that's one way can generate a batch of tests to try out with this idea of using a lot of tests to help you get the job done.
STUDENT: Couldn't you also have a decision tree on the right side?
PATRICK WINSTON: The question is, can you also have a test on the right side?
See, this is just a stand-in for saying, everything's plus or everything's minus.
So it doesn't matter where you put the line.
It can be on the right side, or the left side, or the bottom, or the top.
Or you don't have to put the line anywhere.
It's just an extra test, an additional to the ones you put between the samples.
So this whole idea of boosting, the main idea of the day.
Does it depend on using decision tree stumps?
The answer is no.
Do not be confused.
You can use boosting with any kind of classifier.
so why do I use decision tree stumps today?
Because it makes my life easy.
We can look at it, we can see what it's doing.
But we could put bunch of neural nets in there.
We could put a bunch of real decision trees in there.
We could put a bunch of nearest neighbor things in there.
The boosting idea doesn't care.
I just used these decision tree stumps because I and everybody else use them for illustration.
We're making progress.
Now, what's the error rate for any these tests and lines we drew?
Well, I guess it'll be the error rate is equal to the sum of 1 over n-- That's the total number of points, the number of samples-- summed over the cases where we are wrong.
So gee, we're going to work on combining some of these ideas.
And we've got this notion of exaggeration.
At some stage in what we're doing here, we're going to want to be able to exaggerate the effect of some errors relative to other errors.
So one thing we can do is we can assume, or we can stipulate, or we can assert that each of these samples has a weight associated with it.
That's W1, this is W2, and that's W3.
And in the beginning, there's no reason to suppose that any one of these is more or less important than any of the other.
So in the beginning, W sub i at time [? stub ?] one is equal to 1 over n.
So the error is just adding up the number of samples that were got wrong.
And that'll be the fraction of samples to that you didn't get right.
And that will be the error rate.
So what we want to do is we want to say, instead of using this as the error rate for all time, what we want to do is we want to move that over, and say that the error rate is equal to the sum over the things you got wrong in the current step, times the weights of those that were got wrong.
So in step one, everything's got the same weight, it doesn't matter.
But if we find a way to change their weights going downstream-- so as to, for example, highly exaggerate that third sample, then W3 will go up relative to W1 and W2.
The one thing we want to be sure of is there is no matter how we adjust the weights, that the sum of the weights over the whole space is equal to 1.
So in other words, we want to choose the weights so that they emphasize some of the samples, but we also want to put a constraint on the weights such that all of them added together is summing to one.
And we'll say that that enforces a distribution.
A distribution is a set of weights that sum to one.
Well, that's just a nice idea.
So we're make a little progress.
We've got this idea that we can add some plus/minus 1 classifiers together, you get a better classifier.
We got some idea about how to do that.
It occurs to us that maybe we want to get a lot of classifiers into the act somehow or another.
And maybe we want to think about using decision tree stumps so as to ground out thinking about all this stuff.
So the next step is to say, well, how actually should we combine this stuff?
And you will find, in the literature libraries, full of papers that do stuff like that.
And that was state of the art for quite a few years.
But then people began to say, well, maybe we can build up this classifier, H of x, in multiple steps and get a lot of classifiers into the act.
So maybe we can say that the classifier is the sign of H-- that's the one we picked first.
That's the classifier we picked first.
That's looking at samples.
And then we've got H2.
And then we've got H3.
And then we've got how many other classifiers we might want, or how many classifiers we might need in order to correctly classify everything in our sample set.
So people began to think about whether there might be an algorithm that would develop a classifier that way, one step at a time.
That's why I put that step number in the exponent, because we're picking this one at first, then we're expanding it to have two, and then we're expanding it to have three, and so on.
And each of those individual classifiers are separately looking at the sample.
But of course, it would be natural to suppose that just adding things up wouldn't be enough.
And it's not.
So it isn't too hard to invent the next idea, which is to modify this thing just a little bit by doing what?
It looks almost like a scoring polynomial, doesn't it?
So what would we do to tart this up a little bit?
PATRICK WINSTON: Come again?
PATRICK WINSTON: Somewhere out there someone's murmuring.
STUDENT: Add-- PATRICK WINSTON: Add weights!
PATRICK WINSTON: Excellent.
So what we're going to do is we're going to have alphas associated with each of these classifiers, and we're going to determine if somebody can build that kind formula to do the job.
So maybe I ought to modify this gold star idea before I get too far downstream.
And we're not going to treat everybody in a crowd equally.
We're going to wait some of the opinions more than others.
And by the way, they're all going to make errors in different parts of the space.
So maybe it's not the wisdom of even a weighted crowd, but a crowd of experts.
Each of which is good at different parts of the space.
So anyhow, we've got this formula, and there are a few things that one can say turn out.
But first, let's write down the an algorithm for what this ought to look like.
Before I run out of space, I think I'll exploit the right hand board here, and put the overall algorithm right here.
So we're going to start out by letting of all the weights at time 1 be equal to 1 over n.
That's just saying that they're all equal in the beginning, and they're equal to 1 over n.
And n is the number of samples.
And then, when I've got that, I want to compute alpha, somehow.
No, I don't want to do that.
I want to I want to pick a classifier the minimizes the error rate.
And then m, i, zes, error at time t.
And that's going to be at time t.
And we're going to come back in here.
That's why we put a step index in there.
So once we've picked a classifier that produces an error rate, then we can use the error rate to determine the alpha.
So I want the alpha over here.
That'll be sort of a byproduct of picking that test.
And with all that stuff in hand, maybe that will be enough to calculate Wt plus 1.
So we're going to use that classifier that we just picked to get some revised weights, and then we're going to go around that loop until this classifier produces a perfect set of conclusions on all the sample data.
So that's going to be our overall strategy.
Maybe we've got, if we're going to number these things, that's the fourth big idea.
And this arrangement here is the fifth big idea.
Then we've got the sixth big idea.
And the sixth big idea says this.
Suppose that the weight on it ith sample at time t plus 1 is equal to the weight at time t on that same sample, divided by some normalizing factor, times e to the minus alpha at time t, times h at time t, times some function y which is a function of x, But not a function of time.
Now you say, where did this come from?
And the answer is, it did not spring from the heart of mathematician in the first 10 minutes that he looked at this problem.
In fact, when I asked [INAUDIBLE] how this worked, he said, well, he was thinking about this on the couch every Saturday for about a year, and his wife was getting pretty sore, but he finally found it and saved their marriage.
So where does stuff like this come from?
Really, it comes from knowing a lot of mathematics, and seeing a lot of situations, and knowing that something like this might be mathematically convenient.
Something like this might be mathematically convenient.
But we've got to back up a little and let it sing to us.
We saw y last time.
The support vector machines.
That's just a function.
That's plus 1 or minus 1, depending on whether the output ought to be plus 1 or minus 1.
So if this guy is giving the correct answer, and the correct answer is plus, and then this guy will be plus 1 too, because it always gives you the correct answer.
So in that case, where this guy is giving the right answer, these will have the same sign, so that will be a plus 1 combination.
On the other hand, if that guy's giving the wrong answer, you're going to get a minus 1 out of that combination.
So it's true even if the right answer should be minus, right?
So if the right answer should be minus, and this is plus, then this will be minus 1, and the whole combination well give you minus 1 again.
In other words, the y just flips the sign if you've got the wrong answer, no matter whether the wrong answer is plus 1 or minus 1.
These alphas-- shoot, those are the same alphas that are in this formula up here, somehow.
And then that z, what's that for?
Well, if you just look at the previous weights, and its exponential function to produce these W's for the next generation, that's not going to be a distribution, because they won't sum up to 1.
So what this thing here, this z is, that's a sort of normalizer.
And that makes that whole combination of new weights add up to 1.
So it's whatever you got by adding up all those guys, and then dividing by that number.
I don't know.
Now there's some it-turns-out-thats.
We're going to imagine that somebody's done the same sort of thing we did to the support vector machines.
We're going to find a way to minimize the error.
And the error we're going to minimize is the error produced by that whole thing up there in 4.
We're going to minimize the error of that entire expression as we go along.
And what we discover when we do the appropriate differentiations and stuff-- you know, that's what we do in calculus-- what we discover is that you get minimum error for the whole thing if alpha is equal to 1 minus the error rate at time t, divided by the error rate at time t.
Now let's take the logarithm of that, and multiply it by half.
And that's what [INAUDIBLE] was struggling to find.
But we haven't quite got it right.
And so let me add this in separate chunks, so we don't get confused about this.
It's a bound on that expression up there.
It's a bound on the error rate produced by that expression.
So interestingly enough, this means that the error rate can actually go up as you add terms to this formula.
all you know is that the error rate is going to be bounded by an exponentially decaying function.
So it's eventually guaranteed to converge on zero.
So it's a minimal error bound.
It turns out to be exponential.
Well, there it is.
Would you like to see a demonstration?
Because you look at that, and you say, well, how could anything like that possibly work?
And the answer is, surprisingly enough, here's what happens.
There's a simple little example.
So that's the first test chosen.
the greens are pluses and the reds are minuses, so it's still got an error.
Still got an error-- boom.
There, in two steps.
It now has-- we can look in the upper right hand corner-- we see its used three classifiers, and we see that one of those classifiers says that everybody belongs to a particular class, three different weights.
And the error rate has converged to 0.
So let's look at a couple of other ones.
Here is the one I use for debugging this thing.
We'll let that run.
See how fast it is?
It converges to getting all the samples right very fast.
Here's another one.
This is one we gave on an exam a few years back.
Oh, I let it run, so it got everything instantaneously right.
Let's take that through step at a time.
There's the first one, second one.
Still got a lot of errors.
Ah, the error rate's dropping.
And then flattened, flattened, and it goes to 0.
Cool, don't you think?
But you say to me, bah, who cares about that stuff?
Let's try something more interesting.
That was pretty fast, too.
Well, there's not too many samples here.
So we can try this.
So there's an array of pluses and minuses.
You can see how that error rate is bounded by an exponential?
So in a bottom graph, you've got the number of classifiers involved, and that goes up to a total, eventually, of 10.
You can see how positive or negative each of the classifiers that's added is by looking at this particular tab.
And this just shows how they evolve over time.
But the progress thing here is the most interesting.
And now you say to me, well, how did the machine do that?
And it's all right here.
We use an alpha that looks like this.
And that allows us to compute the new weights.
It says we've got a preliminary calculation.
We've got to find a z that does the normalization.
And we sure better bring our calculator, because we've got, first of all, to calculate the error rate.
Then we've got to take its logarithm, divide by 2, plug it into that formula, take the exponent, and that gives us the new weight.
And that's how the program works.
And if you try that, I guarantee you will flunk the exam.
Now, I don't care about my computer.
I really don't.
It's a slave, and it can calculate these logarithm and exponentials till it turns blue, and I don't care.
Because I've got four cores or something, and who cares.
Might as well do this, than sit around just burning up heat.
But you don't want to do that.
So what you want to do is you want to know how to do this sort of thing more expeditiously.
So we're going to have to let them the math sing to us a little bit, with a view towards finding better ways of doing this sort of thing.
So let's do that.
And we're going to run out of space here before long, so let me reclaim as much of this board as I can.
So what I'm going to do is I'm going to say, well, now that we've got this formula for alpha that relates alpha t to the error, then I can plug that into this formula up here, number 6.
And what I'll get is that the weight of t plus 1 is equal to the weight at t divided by that normalizing factor, multiplied times something that depends on whether it's categorized correctly or not.
That's what that y's in their for, right?
So we've got a logarithm here, and we got a sign flipper up there in terms of that H of x and y combination.
So if the sign of that whole thing at minus alpha and that y H combination turns out to be negative, then we're going to have to flip the numerator and denominator here in this logarithm, right?
And oh, by the way, since we've got a half out here, that turns out to be the square root of that term inside the logarithm.
So when we carefully do that, what we discover is that it depends on whether it's the right thing or not.
But what it turns out to be is something like a multiplier of the square root.
Better be careful, here.
The square root of what?
PATRICK WINSTON: Well, let's see.
But we have to be careful.
So let's suppose that this is 4 things that we get correct.
So if we get it correct, then we're going to get the same sign out of H of x and y.
We've get a minus sign out there, so we're going to flip the numerator and denominator.
So we're going to get the square root of e of t over 1 minus epsilon of t if that's correct.
If it's wrong, it'll just be the flip of that.
So it'll be the square root of 1 minus the error rate over the error rate.
Everybody with me on that?
I think that's right.
If it's wrong, I'll have to hang myself and wear a paper bag over my head like I did last year.
But let's see if we can make this go correctly this time.
So now, we've got this guy here, we've got everything plugged in all right, and we know that now this z ought to be selected so that it's equal to the sum of this guy multiplied by these things as appropriate for whether it's correct or not.
Because we want, in the end, for all of these w's to add up to 1.
So let's see what they add up to without the z there.
So what we know is that it must be the case that if we add over the correct ones, we get the square root of the error rate over 1 minus the rate of the Wt plus 1.
Plus now we've got the sum of 1 minus the error rate over the error rate, times the sum of the Wi at time t for wrong.
So that's what we get if we added all these up without the z.
So since everything has to add up to 1, then z ought to be equal to this sum.
That looks pretty horrible, until we realize that if we add these guys up over the weights that are wrong, that is the error rate.
This is e.
So therefore, z is equal the square root of the error rate times 1 minus the error rate.
That's the contribution of this term.
Now, let's see.
What is the sum of the weights over the ones that are correct?
Well, that must be 1 minus the error rate.
Ah, so this thing gives you the same result as this one.
So z is equal to 2 times that.
And that's a good thing.
Now we are getting somewhere.
Because now, it becomes a little bit easier to write some things down.
Well, we're way past this, so let's get rid of this.
And now we can put some things together.
Let me point out what I'm putting together.
I've got an expression for z right here.
And I've got an expression for the new w's here.
So let's put those together and say that w of t plus 1 is equal to w of t.
I guess we're going to divide that by 2.
And then we've got this square root times that expression.
So if we take that correct one, and divide by that one, then the [INAUDIBLE] cancel out, and I get 1 over 1 minus the error rate.
And if it's not correct, then it's Wt over 2-- and working through the math-- 1 over epsilon, if wrong.
Do we feel like we're making any progress?
Because we haven't let it sing to us enough yet.
So I want to draw your attention to what happens to amateur rock climbers when they're halfway up a difficult cliff.
They're usually [INAUDIBLE], sometimes they're not.
If they're not, they're scared to death.
And every once in a while, as they're just about to fall, they find some little tiny hole to stick a fingernail in, and that keeps them from falling.
That's called a thank-god hole.
So what I'm about to introduce is the analog of those little places where you can stick your fingernail in.
It's the thank-god hole for dealing with boosting problems.
So what happens if I add all these [? Wi ?] up for the ones that the classifier where produces a correct answer on?
Well, it'll be 1 over 2, and 1 over 1 minus epsilon, times the sum of the Wt for which the answer was correct.
What's this sum?
1 minus epsilon.
So what I've just discovered is that if I sum new w's over those samples for which I got a correct answer, it's equal to 1/2.
And guess what?
That means that if I sum them over wrong, it's equal to 1/2 half as well.
So that means that I take all of the weight for which I got the right answer with the previous test, and those ways will add up to something.
And to get the weights for the next generation, all I have to do is scale them so that they equal half.
This was not noticed by the people who developed this stuff.
This was noticed by Luis Ortiz, who was a 6.034 instructor a few years ago.
The sum of those weights is going to be a scaled version of what they were before.
So you take all the weights for which this new classifier-- this one you selected to give you the minimum weight on the re-weighted stuff-- you take the ones that it gives a correct answer for, and you take all of those weights, and you just scale them so they add up to 1/2.
So do you have to compute any logarithms?
Do you have to compute any exponentials?
Do you have to calculate z?
Do you have to calculate alpha to get the new weights?
All you have to do is scale them.
And that's a pretty good thank-god hole.
So that's thank-god hole number one.
Now, for thank-god hole number two, we need to go back and think about the fact that were going to give you problems in probability that involve decision tree stumps.
And there are a lot of decision tree stumps that you might have to pick from.
So we need a thank-god hole for deciding how to deal with that.
Where can I find some room?
How about right here.
Suppose you've got a space that looks like this.
I'm just makings this up at random.
So how many-- let's see.
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11.
How many tests do I have to consider in that dimension?
It's 1 plus the number of samples.
That would be horrible.
I don't know.
Do I have actually calculate this one?
How could that possibly be better than that one?
It's got one more thing wrong.
So that one makes sense.
The other one doesn't make sense.
So in the end, no test that lies between two correctly classified samples will ever be any good.
So that one's a good guy, and that one's a good guy.
And this one's a bad guy.
Bad guy, bad guy bad guy, bad guy.
Bad guy, bad guy, bad buy.
So the actual number of tests you've got is three.
And likewise, in the other dimension-- well, I haven't drawn it so well here, but would this test be a good one?
Actually, I'd better look over here on the right and see what I've got before I draw too many conclusions.
Let's look over this, since I don't want to think too hard about what's going on in the other dimension.
But the idea is that very few of those tests actually matter.
Now, you say to me, there's one last thing.
What about overfitting?
Because all this does is drape a solution over the samples.
And like support vector machines overfit, neural maps overfit, identification trees overfit.
This doesn't seem to overfit.
That's an experimental result for which the literature is confused.
It goes back to providing an explanation.
So this stuff is tried on all sorts of problems, like handwriting recognition, understanding speech, all sorts of stuff uses boosting.
And unlike other methods, for some reason as yet imperfectly understood, it doesn't seem to overfit.
But in the end, they leave no stone unturned in 6.034.
Every time we do this, we do some additional experiments.
So here's a sample that I'll leave you with.
Here's a situation in which we have a 10-dimensional space.
We've made a fake distribution, and then we put in that boxed outlier.
That was just put into the space at random, so it can be viewed as an error point.
So now what we're going to do is we're going to see what happens when we run that guy.
And sure enough, in 17 steps, it finds a solution.
But maybe it's overfit that little guy who's an error.
But one thing you can do is you can say, well, all of these classifiers are dividing this space up into chunks, and we can compute the size of the space occupied by any sample.
So one thing we can do-- alas, I'll have to get up a new demonstration.
One thing we can do, now that this guy's over here, we can switch the volume tab and watch how the volume occupied by that error point evolves as we solve the problem.
So look what happens.
This is, of course, randomly generated.
I'm counting on this working.
Never failed before.
So it originally starts out as occupying 26% of the total volume.
It ends up occupying 1.4 times 10 to the minus 3rd% of the volume.
So what tends to happen is that these decision tree stumps tend to wrap themselves so tightly around the error points, there's no room for overfitting, because nothing else will fit in that same volume.
So that's why I think that this thing tends to produce solutions which don't overfit.
So in conclusion, this is magic.
You always want to use it.
It'll work with any kind of [? speed ?] of classifiers you want.
And you should understand it very thoroughly, because of anything is useful in the subject in dimension learning, this is it.
This is one of over 2,200 courses on OCW. Find materials for this course in the pages linked along the left.
MIT OpenCourseWare is a free & open publication of material from thousands of MIT courses, covering the entire MIT curriculum.
No enrollment or registration. Freely browse and use OCW materials at your own pace. There's no signup, and no start or end dates.
Knowledge is your reward. Use OCW to guide your own life-long learning, or to teach others. We don't offer credit or certification for using OCW.
Made for sharing. Download files for later. Send to friends and colleagues. Modify, remix, and reuse (just remember to cite OCW as the source.)
Learn more at Get Started with MIT OpenCourseWare