WEBVTT
00:00:00.530 --> 00:00:02.960
The following content is
provided under a Creative
00:00:02.960 --> 00:00:04.370
Commons license.
00:00:04.370 --> 00:00:07.410
Your support will help MIT
OpenCourseWare continue to
00:00:07.410 --> 00:00:11.060
offer high quality educational
resources for free.
00:00:11.060 --> 00:00:13.960
To make a donation or view
additional materials from
00:00:13.960 --> 00:00:17.890
hundreds of MIT courses, visit
MIT OpenCourseWare at
00:00:17.890 --> 00:00:19.140
ocw.mit.edu.
00:00:24.220 --> 00:00:24.840
PROFESSOR: OK.
00:00:24.840 --> 00:00:30.230
Today we're going to finish
up with Markov chains.
00:00:30.230 --> 00:00:34.570
And the last topic will be
dynamic programming.
00:00:34.570 --> 00:00:39.900
I'm not going to say an awful
lot about dynamic programming.
00:00:39.900 --> 00:00:43.530
It's a topic that was enormously
important in
00:00:43.530 --> 00:00:49.600
research for probably 20
years from 1960 until
00:00:49.600 --> 00:00:53.540
about 1980, or 1990.
00:00:53.540 --> 00:01:00.300
And it seemed as if half the
Ph.D. theses done in the
00:01:00.300 --> 00:01:03.920
control area and the
operations research
00:01:03.920 --> 00:01:07.630
were in this area.
00:01:07.630 --> 00:01:11.950
Suddenly, everything seemed
to be done, could be done.
00:01:11.950 --> 00:01:15.310
And strangely enough, not
many people seem to
00:01:15.310 --> 00:01:16.760
know about it anymore.
00:01:16.760 --> 00:01:20.760
It's an enormously useful
algorithm for solving an awful
00:01:20.760 --> 00:01:23.000
lot of different problems.
00:01:23.000 --> 00:01:25.420
It's quite a simple algorithm.
00:01:25.420 --> 00:01:28.780
You don't need the full power
of Markov chains in order to
00:01:28.780 --> 00:01:30.470
understand it.
00:01:30.470 --> 00:01:34.250
So I do want to at least talk
about it a little bit.
00:01:34.250 --> 00:01:38.070
And we will use what we've done
so far with Markov chains
00:01:38.070 --> 00:01:40.940
in order to understand it.
00:01:40.940 --> 00:01:44.200
I want to start out today by
reviewing a little bit of what
00:01:44.200 --> 00:01:49.040
we did last time about
eigenvalues and eigenvectors.
00:01:49.040 --> 00:01:56.320
This was a somewhat awkward
topic to talk about, because
00:01:56.320 --> 00:01:59.970
you people have very different
backgrounds in linear algebra.
00:01:59.970 --> 00:02:03.450
Some of you have a very strong
background, some of you have
00:02:03.450 --> 00:02:05.240
almost no background.
00:02:05.240 --> 00:02:10.509
So it was a lot of material for
those of you who know very
00:02:10.509 --> 00:02:14.190
little about linear algebra.
00:02:14.190 --> 00:02:16.620
And probably somewhat boring
for those of you
00:02:16.620 --> 00:02:18.690
use it all the time.
00:02:18.690 --> 00:02:22.670
At any rate, if you don't know
anything about it, linear
00:02:22.670 --> 00:02:28.820
algebra is a topic that you
ought to understand for almost
00:02:28.820 --> 00:02:30.270
anything you do.
00:02:30.270 --> 00:02:35.230
If you've gotten to this point
without having to study it,
00:02:35.230 --> 00:02:37.460
it's very strange.
00:02:37.460 --> 00:02:41.720
So you should probably take
some extra time out, not
00:02:41.720 --> 00:02:43.900
because you need it so
much for this course.
00:02:43.900 --> 00:02:46.670
We won't use it enormously
in many of the
00:02:46.670 --> 00:02:48.500
things we do later.
00:02:48.500 --> 00:02:51.930
But you will use it so many
times in the future that you
00:02:51.930 --> 00:02:56.870
ought to just sit down, not to
learn abstract linear algebra,
00:02:56.870 --> 00:03:00.150
which is very useful also, but
just to learn how to use the
00:03:00.150 --> 00:03:03.280
topic of solving linear
equations.
00:03:03.280 --> 00:03:06.450
Being able to express them
in terms of matrices.
00:03:06.450 --> 00:03:09.310
Being able to use the
eigenvalues and eigenvectors,
00:03:09.310 --> 00:03:12.220
and matrices as a way of
understanding these things.
00:03:12.220 --> 00:03:16.440
So I want to say a little more
about that today, which is why
00:03:16.440 --> 00:03:19.720
I've called this a review
plus of eigenvalues and
00:03:19.720 --> 00:03:21.020
eigenvectors.
00:03:21.020 --> 00:03:25.930
It's a review of the topics
we did last time, but it's
00:03:25.930 --> 00:03:28.250
looking at it in a somewhat
different way.
00:03:28.250 --> 00:03:32.150
So let's proceed with that.
00:03:32.150 --> 00:03:36.810
We said that the determinant of
an M by M matrix is given
00:03:36.810 --> 00:03:38.530
by this strange formula.
00:03:38.530 --> 00:03:44.340
The determinant of a is the sum
over all permutations of
00:03:44.340 --> 00:03:51.260
the integers 1 to M of the
product from i equals 1 to M
00:03:51.260 --> 00:03:56.080
of the matrix element
a sub i mu of i.
00:03:56.080 --> 00:04:01.670
Mu of i is the permutation of
the number i. i is between one
00:04:01.670 --> 00:04:05.510
and M, and mu of i is a
permutation of that.
00:04:05.510 --> 00:04:17.529
Now if you look at the matrix,
which has the form, which is
00:04:17.529 --> 00:04:19.600
block upper diagonal.
00:04:19.600 --> 00:04:22.990
In other words, there's a matrix
here, a square matrix a
00:04:22.990 --> 00:04:26.390
sub t, which is a transient
matrix.
00:04:26.390 --> 00:04:31.610
There's a recurrent matrix here,
and there's some way of
00:04:31.610 --> 00:04:33.900
getting from the transient
states to
00:04:33.900 --> 00:04:36.730
the recurring states.
00:04:36.730 --> 00:04:41.630
And this is the general form
that a unit chain has to have.
00:04:41.630 --> 00:04:44.970
There are a bunch of transient
states, there are a bunch of
00:04:44.970 --> 00:04:47.230
recurring states.
00:04:47.230 --> 00:04:52.630
And the interesting thing here
is that the determinant of a
00:04:52.630 --> 00:04:57.620
is exactly the determinant
of a sub t times the
00:04:57.620 --> 00:04:59.410
determinant a sub r.
00:04:59.410 --> 00:05:03.210
I'm calling this a instead of
the transition matrix p
00:05:03.210 --> 00:05:08.840
because I want to replace a by
p minus lambda i, so I can
00:05:08.840 --> 00:05:11.820
talk about the eigenvalues
of p.
00:05:11.820 --> 00:05:15.690
So when I do that replacement
here, if I know that the
00:05:15.690 --> 00:05:20.140
determinant of a is this product
of determinants, then
00:05:20.140 --> 00:05:24.130
the determinant of p minus
lambda i is the determinant of
00:05:24.130 --> 00:05:32.160
pt minus lambda it, where it is
just a crazy way of saying
00:05:32.160 --> 00:05:35.120
a diagonal matrix.
00:05:35.120 --> 00:05:40.070
A diagonal t by t matrix,
because this is a t by t
00:05:40.070 --> 00:05:41.740
matrix, also.
00:05:41.740 --> 00:05:48.580
i sub r is an r by r matrix,
where this is a square r by r
00:05:48.580 --> 00:05:50.260
matrix also.
00:05:50.260 --> 00:05:53.970
Now, why is it that this
determinant is equal to this
00:05:53.970 --> 00:05:56.670
product of determinants here?
00:05:56.670 --> 00:06:02.010
Well, before explaining why this
is true, why do you care?
00:06:02.010 --> 00:06:08.180
Well, because we know that if
we have a recurring matrix
00:06:08.180 --> 00:06:11.630
here, we know that it has--
00:06:11.630 --> 00:06:13.790
I mean, we know a great
deal about it.
00:06:13.790 --> 00:06:21.150
We know that any square matrix,
r by r matrix has r
00:06:21.150 --> 00:06:22.750
different eigenvalues.
00:06:22.750 --> 00:06:26.330
Some of them might be repeated,
but they're always r
00:06:26.330 --> 00:06:27.480
eigenvalues.
00:06:27.480 --> 00:06:31.420
This matrix here has
t eigenvalues.
00:06:31.420 --> 00:06:32.520
OK.
00:06:32.520 --> 00:06:37.730
This matrix here, we know has
r plus t eigenvalues.
00:06:37.730 --> 00:06:42.060
You look at this formula here
and you say aha, I can take
00:06:42.060 --> 00:06:46.670
all the eigenvalues here, add
them to all the eigenvalues
00:06:46.670 --> 00:06:50.280
here, and I have every one
of the eigenvalues here.
00:06:50.280 --> 00:06:54.780
In other words, if I want to
find all of the eigenvalues of
00:06:54.780 --> 00:06:59.620
p, all I have to do is define
the eigenvalues of p sub t,
00:06:59.620 --> 00:07:04.710
add them to the eigenvalues of
p sub r, and I'm all done.
00:07:04.710 --> 00:07:08.640
So that really has simplified
things a good deal.
00:07:08.640 --> 00:07:14.270
And it also really says
explicitly that if you
00:07:14.270 --> 00:07:20.060
understand how to deal with
recurrent Markov chains, you
00:07:20.060 --> 00:07:22.620
really know everything.
00:07:22.620 --> 00:07:25.840
Well, you also have to know how
to deal with a transient
00:07:25.840 --> 00:07:29.880
chain, but the main part of it
is dealing with this chain.
00:07:29.880 --> 00:07:34.870
This has little r different
eigenvalues, and all of those
00:07:34.870 --> 00:07:41.860
are eigenvalues, excuse me,
p sub r has little r
00:07:41.860 --> 00:07:42.710
eigenvalues.
00:07:42.710 --> 00:07:46.860
They're given by the roots
of this determinant here.
00:07:46.860 --> 00:07:49.530
And all of those
are roots here.
00:07:49.530 --> 00:07:51.580
OK, so why is this true?
00:07:51.580 --> 00:07:57.990
Well, the reason for it is that
this product up here,
00:07:57.990 --> 00:07:59.200
look at this.
00:07:59.200 --> 00:08:02.490
We're taking the sum over
all permutations.
00:08:02.490 --> 00:08:05.315
But which one of those
permutations can be non-zero?
00:08:12.940 --> 00:08:18.740
If I start out by saying that a
sub t is t by t, then I know
00:08:18.740 --> 00:08:21.440
that this might be anything.
00:08:21.440 --> 00:08:24.050
These have to be zeroes here.
00:08:24.050 --> 00:08:30.450
If I choose some permutation of
down here, of sum i, which
00:08:30.450 --> 00:08:31.530
is greater than t.
00:08:31.530 --> 00:08:35.030
In other words, if I choose
mu o i to be some
00:08:35.030 --> 00:08:36.130
element over here.
00:08:36.130 --> 00:08:42.309
If I choose mu of i to be less
than our equal to t, and i to
00:08:42.309 --> 00:08:45.500
be greater than t,
what happens?
00:08:45.500 --> 00:08:47.790
I get a term which
is equal to zero.
00:08:47.790 --> 00:08:51.210
That term in this
product is zero.
00:08:51.210 --> 00:08:55.670
And none of those products
can be zero.
00:08:55.670 --> 00:09:00.830
So the only way I can get non
zeros here is when I'm dealing
00:09:00.830 --> 00:09:03.730
with an i which is less
than or equal to t.
00:09:03.730 --> 00:09:06.100
Namely an i here.
00:09:06.100 --> 00:09:09.440
I have to choose a mu of
i, a column which is
00:09:09.440 --> 00:09:10.870
less than t, also.
00:09:10.870 --> 00:09:17.540
If I'm dealing with an i which
is greater than t, namely and
00:09:17.540 --> 00:09:23.410
i up here, then, well, it
looks like I can choose
00:09:23.410 --> 00:09:24.950
anything there.
00:09:24.950 --> 00:09:25.630
But look.
00:09:25.630 --> 00:09:31.180
I've already used up all of
these columns here by the five
00:09:31.180 --> 00:09:33.470
by the non-zero terms here.
00:09:33.470 --> 00:09:37.360
So I can't do anything
but use a smaller i,
00:09:37.360 --> 00:09:40.080
smaller than t up here.
00:09:40.080 --> 00:09:44.703
So when I look at the
permutations that are non
00:09:44.703 --> 00:09:49.010
zero, the only permutations that
are non zero are those
00:09:49.010 --> 00:09:55.610
where mu of i is less than t if
i less than t, and mu of i
00:09:55.610 --> 00:10:01.960
is less than or equal to t if i
is less than or equal to t.
00:10:01.960 --> 00:10:06.100
And mu of i is greater than
t if i is greater than t.
00:10:06.100 --> 00:10:11.580
Now, how does that show that
this is equal here?
00:10:11.580 --> 00:10:16.480
Well, let's look at
that a little bit.
00:10:16.480 --> 00:10:19.740
I didn't even try to do it on
the slide because the notation
00:10:19.740 --> 00:10:20.970
is kind of horrifying.
00:10:20.970 --> 00:10:24.850
But let's try to write this
the following way.
00:10:24.850 --> 00:10:36.910
Determinant of a is equal to the
sum, and now I'll write it
00:10:36.910 --> 00:10:48.040
as a sum over mu of 1 up to t.
00:10:48.040 --> 00:10:59.690
And the sum over mu of t
plus 1 up to, well, t
00:10:59.690 --> 00:11:02.460
plus r, let's say.
00:11:02.460 --> 00:11:06.210
OK, so here I have all of
the permutations of the
00:11:06.210 --> 00:11:08.870
numbers 1 to t.
00:11:08.870 --> 00:11:11.350
And here I have all the
permutations of the
00:11:11.350 --> 00:11:14.010
numbers t plus 1 up.
00:11:14.010 --> 00:11:16.760
And for all of those,
I'm going to
00:11:16.760 --> 00:11:18.190
ignore this plus minus.
00:11:18.190 --> 00:11:21.420
You can sort that out
for yourselves.
00:11:21.420 --> 00:11:27.620
And then I have a product
from i equals 1 to t.
00:11:27.620 --> 00:11:37.950
And then a product from i
equals t plus 1 up to m.
00:11:37.950 --> 00:11:39.200
Excuse me.
00:11:42.410 --> 00:11:53.000
i sub i, mu of i times
product of a of i.
00:11:53.000 --> 00:12:07.130
Mu of i for i equals t plus
1 up to t plus r.
00:12:07.130 --> 00:12:09.300
OK?
00:12:09.300 --> 00:12:14.740
So I'm separating this product
here into a product first of
00:12:14.740 --> 00:12:19.070
the terms i less than or equal
to t, and then for the terms i
00:12:19.070 --> 00:12:20.180
greater than t.
00:12:20.180 --> 00:12:24.620
For every permutation I choose
using the i's that are less
00:12:24.620 --> 00:12:29.090
than or equal to t, I can choose
any of the permutation
00:12:29.090 --> 00:12:33.520
using mu of i greater than
t that I choose to use.
00:12:33.520 --> 00:12:35.570
So this breaks up in this way.
00:12:35.570 --> 00:12:37.960
I have this sum, I
have this sum.
00:12:37.960 --> 00:12:43.120
I have these two products, so
I can break this up as a sum
00:12:43.120 --> 00:12:55.270
over mu of 1 to t of plus minus
product from i equals 1
00:12:55.270 --> 00:13:08.752
to t of ai, mu of i times the
sum over mu of t plus 1 up to
00:13:08.752 --> 00:13:15.072
t plus r ai mu of i.
00:13:20.160 --> 00:13:22.380
Product.
00:13:22.380 --> 00:13:23.300
OK.
00:13:23.300 --> 00:13:26.030
So I've separated that into
two different terms.
00:13:26.030 --> 00:13:27.000
STUDENT: T equals [INAUDIBLE].
00:13:27.000 --> 00:13:27.570
PROFESSOR: What?
00:13:27.570 --> 00:13:30.680
STUDENT: T plus r
equals big m?
00:13:30.680 --> 00:13:33.230
PROFESSOR: T plus
r is big m, yes.
00:13:33.230 --> 00:13:40.060
Because I have t terms here,
and I have r terms here.
00:13:40.060 --> 00:13:44.710
OK, so the interesting thing
here is having this non-zero
00:13:44.710 --> 00:13:48.400
term here doesn't make
any difference here.
00:13:48.400 --> 00:13:52.430
I mean, this is more
straightforward if you have a
00:13:52.430 --> 00:13:54.020
block diagonal matrix.
00:13:54.020 --> 00:13:58.330
It's clear that the eigenvalues
of a block
00:13:58.330 --> 00:14:03.700
diagonal matrix are going to be
the eigenvalues of 1 plus
00:14:03.700 --> 00:14:05.560
the eigenvalues of the other.
00:14:05.560 --> 00:14:09.980
Here we have the eigenvalues
of this, and the
00:14:09.980 --> 00:14:11.450
eigenvalues is this.
00:14:11.450 --> 00:14:14.910
And what's surprising is that as
far as the eigenvalues are
00:14:14.910 --> 00:14:19.950
concerned, this has nothing
whatsoever to do with it.
00:14:19.950 --> 00:14:20.690
OK.
00:14:20.690 --> 00:14:24.480
The only thing that this has
to do with it is it says
00:14:24.480 --> 00:14:28.780
something about the sums of this
matrix here, because the
00:14:28.780 --> 00:14:31.500
sums of these rows are
now less than 1.
00:14:31.500 --> 00:14:34.660
They all have to be, some of
them, at least, have to be
00:14:34.660 --> 00:14:36.760
less than or equal to 1.
00:14:36.760 --> 00:14:40.090
Because you do have this way of
getting from the transient
00:14:40.090 --> 00:14:43.470
elements to the non transient
elements.
00:14:43.470 --> 00:14:48.060
But it's very surprising that
these elements, which are
00:14:48.060 --> 00:14:52.100
critically important, because
those are the things that get
00:14:52.100 --> 00:14:55.800
you from the transition states
to the recurrent states have
00:14:55.800 --> 00:14:59.540
nothing to do in the eigenvalues
whatsoever.
00:14:59.540 --> 00:15:00.105
I don't know why.
00:15:00.105 --> 00:15:04.310
I can't give you any insights
about that, but
00:15:04.310 --> 00:15:06.810
that's the way it is.
00:15:06.810 --> 00:15:12.030
That's an interesting thing,
because if you take this
00:15:12.030 --> 00:15:19.930
transition matrix, and you keep
at and a sub r fixed, and
00:15:19.930 --> 00:15:23.250
you play any kind of funny game
you want to with those
00:15:23.250 --> 00:15:28.780
terms going from the transient
states to the non transient
00:15:28.780 --> 00:15:33.370
states, it won't change
any eigenvalues.
00:15:33.370 --> 00:15:35.490
Don't know why it doesn't.
00:15:35.490 --> 00:15:39.400
OK, so where do we
go with that?
00:15:39.400 --> 00:15:45.440
Well, that's what it says.
00:15:45.440 --> 00:15:50.580
The eigenvalues of p, or the t
eigenvalues of pt, and the r
00:15:50.580 --> 00:15:52.200
eigenvalues of PR.
00:15:52.200 --> 00:15:56.180
It also tells you something
about simple eigenvalues, and
00:15:56.180 --> 00:15:59.800
these crazy eigenvalues, which
don't have enough eigenvectors
00:15:59.800 --> 00:16:01.230
to go along with them.
00:16:01.230 --> 00:16:06.420
Because it tells you that a
piece of r has all of its
00:16:06.420 --> 00:16:11.880
eigenvectors, and a piece of t
has all of its eigenvectors.
00:16:11.880 --> 00:16:14.550
Then you don't have any of this
crazy [INAUDIBLE] form
00:16:14.550 --> 00:16:16.520
thing, or anything.
00:16:16.520 --> 00:16:29.670
OK If pi is a left eigenvector
of this recurrent matrix, then
00:16:29.670 --> 00:16:35.550
if you look at the vector,
starting was zeros, and then I
00:16:35.550 --> 00:16:42.390
guess I should really say, well,
if pi sub 1 up to pi sub
00:16:42.390 --> 00:16:47.910
r as a left eigenvalue of this r
by r matrix, then if I start
00:16:47.910 --> 00:16:52.620
out with t zeroes, and then
put in pi 1 to pi r, this
00:16:52.620 --> 00:16:57.310
vector here has to be a left
eigenvector of all of p.
00:16:57.310 --> 00:16:58.310
Why is that?
00:16:58.310 --> 00:17:01.610
Well, if I look at a vector,
which starts out with zeroes,
00:17:01.610 --> 00:17:06.900
and then has this eigenvector
pi, and I multiply that vector
00:17:06.900 --> 00:17:10.210
by this matrix here, I'm
taking these terms,
00:17:10.210 --> 00:17:16.260
multiplying them by the columns
of this matrix, these
00:17:16.260 --> 00:17:22.310
zeros knock out all of
these elements here.
00:17:22.310 --> 00:17:25.470
These zeroes knock out all
of these elements.
00:17:25.470 --> 00:17:28.410
So I start out with zeroes
everywhere here.
00:17:28.410 --> 00:17:30.480
That's what this says.
00:17:30.480 --> 00:17:34.660
And then when I'm dealing with
this part of the matrix, the
00:17:34.660 --> 00:17:39.750
zeros knock out all of this, and
I just have pi multiplying
00:17:39.750 --> 00:17:40.820
piece of r.
00:17:40.820 --> 00:17:45.220
So if I have an eigenvalue
lambda, it says I have the
00:17:45.220 --> 00:17:50.170
eigenvalue lambda times a
vector zero times pi.
00:17:50.170 --> 00:17:54.760
It says that if I have an
eigenvector, a left
00:17:54.760 --> 00:18:01.010
eigenvector of this recurrent
matrix, then that turns into,
00:18:01.010 --> 00:18:05.670
if you put some zeroes up in
front of, it turns into an
00:18:05.670 --> 00:18:07.790
eigenvector of the
whole matrix.
00:18:07.790 --> 00:18:11.580
If we look at the eigenvalue 1,
which is the most important
00:18:11.580 --> 00:18:14.350
thing this, is the thing that
gives you the steady state
00:18:14.350 --> 00:18:16.930
factor, this is sort
of obvious.
00:18:16.930 --> 00:18:19.630
Because the steady state
vector is where you go
00:18:19.630 --> 00:18:23.960
eventually, and eventually where
you go is you have to be
00:18:23.960 --> 00:18:27.290
in one of these recurrent
states, eventually.
00:18:27.290 --> 00:18:30.610
And the probabilities within
the recurrent set of states
00:18:30.610 --> 00:18:33.400
are the same as the
probabilities if you didn't
00:18:33.400 --> 00:18:36.590
have this transient
states at all.
00:18:36.590 --> 00:18:40.490
so this is all sort of obvious,
as far as the steady
00:18:40.490 --> 00:18:43.020
state factor pi.
00:18:43.020 --> 00:18:47.480
But it's a little less obvious
as far as the other vectors.
00:18:47.480 --> 00:18:52.300
The left eigenvectors,
a piece of t, I don't
00:18:52.300 --> 00:18:53.610
understand them at all.
00:18:53.610 --> 00:18:59.660
They aren't the same as the left
eigenvectors of, well,
00:18:59.660 --> 00:19:04.670
the left eigenvectors of the
eigenvalues of p sub t.
00:19:08.040 --> 00:19:10.270
I didn't say this right here.
00:19:10.270 --> 00:19:15.870
The left eigenvectors of p
corresponding to the left
00:19:15.870 --> 00:19:18.700
eigenvectors of p sub t.
00:19:18.700 --> 00:19:22.010
I don't understand how they
work, and I don't understand
00:19:22.010 --> 00:19:24.350
anything you can derive
from them.
00:19:24.350 --> 00:19:26.740
They're just kind of crazy
things, which are what they
00:19:26.740 --> 00:19:27.780
happen to be.
00:19:27.780 --> 00:19:29.350
And I don't care about them.
00:19:29.350 --> 00:19:32.200
I don't know anything
to do with them.
00:19:32.200 --> 00:19:35.200
But these other eigenvectors
are very useful.
00:19:35.200 --> 00:19:38.130
OK.
00:19:38.130 --> 00:19:45.040
We can extend this to as many
different recurrent sets of
00:19:45.040 --> 00:19:47.080
states as you choose.
00:19:47.080 --> 00:19:53.100
Here I'm doing it with a Markov
chain, which has two
00:19:53.100 --> 00:19:56.550
different sets of recurrent
states.
00:19:56.550 --> 00:20:00.010
They might be periodic, they
might be ergodic, it doesn't
00:20:00.010 --> 00:20:01.340
make any difference.
00:20:01.340 --> 00:20:07.730
So the matrix p has these
transient states up here.
00:20:07.730 --> 00:20:11.990
Here we have those transition
states would just go to each
00:20:11.990 --> 00:20:16.320
other, where the transition
probabilities starting with
00:20:16.320 --> 00:20:19.140
the transient state and going
to a transition state.
00:20:19.140 --> 00:20:24.090
Here we have the transitions,
which go from transient states
00:20:24.090 --> 00:20:26.500
to this first set of
recurrent states.
00:20:26.500 --> 00:20:30.810
Here we have the transitions,
which go from a transient
00:20:30.810 --> 00:20:35.480
state to the second state
of recurrent states.
00:20:35.480 --> 00:20:36.180
OK.
00:20:36.180 --> 00:20:39.330
The same way as before, the
determinant of this whole
00:20:39.330 --> 00:20:44.790
thing here, and this
determinant, the roots of that
00:20:44.790 --> 00:20:49.300
are in fact the eigenvalues of
p, are the product of the
00:20:49.300 --> 00:20:54.930
determinant of pt minus lambda
it times the product of this,
00:20:54.930 --> 00:20:58.030
times this determinant here.
00:20:58.030 --> 00:21:02.180
This has little t eigenvalues.
00:21:02.180 --> 00:21:05.220
This has little r eigenvalues.
00:21:05.220 --> 00:21:08.690
This has little r prime
eigenvalues, and if you add up
00:21:08.690 --> 00:21:11.880
t plus little r plus little
r prime, what do you get?
00:21:11.880 --> 00:21:17.790
You get jM, excuse me, capital
M, which is the total number
00:21:17.790 --> 00:21:21.470
of states in the Markov chain.
00:21:21.470 --> 00:21:27.110
So the eigenvalues here are
exactly the eigenvalues here
00:21:27.110 --> 00:21:33.300
plus the eigenvalues here, plus
the eigenvalues here.
00:21:33.300 --> 00:21:36.720
And you can find the
eigenvectors, the left
00:21:36.720 --> 00:21:40.810
eigenvectors for these
states in exactly
00:21:40.810 --> 00:21:43.450
the same way as before.
00:21:43.450 --> 00:21:44.570
OK.
00:21:44.570 --> 00:21:45.772
Yeah?
00:21:45.772 --> 00:21:48.628
STUDENT: So again, the
eigenvalues can be repeated
00:21:48.628 --> 00:21:51.960
both within t, r, r prime,
and in between the--
00:21:51.960 --> 00:21:52.436
PROFESSOR: Yes.
00:21:52.436 --> 00:21:54.340
STUDENT: There's nothing
that says [INAUDIBLE].
00:21:54.340 --> 00:21:54.610
PROFESSOR: No.
00:21:54.610 --> 00:21:58.440
There's nothing that says they
can't, except you can always
00:21:58.440 --> 00:22:05.980
find the left eigenvectors,
anyway, of this are, in fact,
00:22:05.980 --> 00:22:08.680
these things in the form.
00:22:08.680 --> 00:22:15.840
If pi is a left eigenvector of p
sub r, then zero followed by
00:22:15.840 --> 00:22:17.460
pi followed by zero.
00:22:17.460 --> 00:22:26.480
In other words, little t zeros
followed by r, followed by the
00:22:26.480 --> 00:22:32.060
eigenvector pi, followed by
little r prime zeroes here,
00:22:32.060 --> 00:22:34.490
this has to be a left
eigenvector of t.
00:22:34.490 --> 00:22:37.280
So this tells you something
about whether you're going to
00:22:37.280 --> 00:22:40.140
have a Jordan form or not,
one of these really
00:22:40.140 --> 00:22:41.240
ugly things in it.
00:22:41.240 --> 00:22:44.590
And it tells you that
in many cases, you
00:22:44.590 --> 00:22:46.370
just can't have them.
00:22:46.370 --> 00:22:48.850
If you have them, they're
usually tied up with this
00:22:48.850 --> 00:22:50.730
matrix here.
00:22:50.730 --> 00:22:53.140
OK, so that, I don't know.
00:22:53.140 --> 00:22:53.950
Was this useful?
00:22:53.950 --> 00:22:55.550
Does this clarify anything?
00:22:55.550 --> 00:22:58.830
Or if it doesn't,
it's too bad.
00:23:01.810 --> 00:23:02.330
OK.
00:23:02.330 --> 00:23:05.080
So now we want to start
talking about rewards.
00:23:07.580 --> 00:23:09.150
Some people call these costs.
00:23:09.150 --> 00:23:11.230
If you're an optimist,
you call it rewards.
00:23:11.230 --> 00:23:13.870
If you're a pessimist,
you call it costs.
00:23:13.870 --> 00:23:15.520
They're both the same thing.
00:23:15.520 --> 00:23:18.180
If you're dealing with rewards,
you maximize them.
00:23:18.180 --> 00:23:20.470
If you're dealing with costs,
you minimize them.
00:23:20.470 --> 00:23:24.800
So mathematically, who cares?
00:23:24.800 --> 00:23:30.590
OK, so suppose that each state
i of a Markov chain is
00:23:30.590 --> 00:23:33.280
associated with a given
reward, or a sub i.
00:23:33.280 --> 00:23:36.350
In other words, you think of
this Markov chain, which is
00:23:36.350 --> 00:23:37.180
running along.
00:23:37.180 --> 00:23:41.320
You go from one state to
another over time.
00:23:41.320 --> 00:23:45.930
And while this is happening,
you're pocketing some reward
00:23:45.930 --> 00:23:47.250
all the time.
00:23:47.250 --> 00:23:47.650
OK.
00:23:47.650 --> 00:23:50.890
You invest in a stock.
00:23:50.890 --> 00:23:53.470
Strangely enough, these
particular stocks we're
00:23:53.470 --> 00:23:57.270
thinking about here I this
Markov property.
00:23:57.270 --> 00:23:59.970
Stocks really don't have a
Markov property, but we'll
00:23:59.970 --> 00:24:02.130
assume they do.
00:24:02.130 --> 00:24:06.200
And since they have this Markov
property, you win for a
00:24:06.200 --> 00:24:07.840
while, and you lose
for a while.
00:24:07.840 --> 00:24:10.060
You win for a while, you
lose for a while.
00:24:10.060 --> 00:24:12.770
But we have something
extra, other than
00:24:12.770 --> 00:24:15.050
just the Markov chains.
00:24:15.050 --> 00:24:18.830
We can analyze this whole
situation, knowing how Markov
00:24:18.830 --> 00:24:20.670
chains behave.
00:24:20.670 --> 00:24:24.980
There's not much left besides
that, but there are an
00:24:24.980 --> 00:24:29.860
extraordinary number of
applications of this idea, and
00:24:29.860 --> 00:24:31.900
dynamic programming
is one of them.
00:24:31.900 --> 00:24:35.380
Because that's just one
added extension beyond
00:24:35.380 --> 00:24:37.880
this idea of rewards.
00:24:37.880 --> 00:24:38.380
OK.
00:24:38.380 --> 00:24:40.770
The random variable x of n.
00:24:40.770 --> 00:24:43.240
That's a random quantity.
00:24:43.240 --> 00:24:45.840
It's the state at time n.
00:24:45.840 --> 00:24:50.010
And the random reward of time n
is then the random variable
00:24:50.010 --> 00:24:55.680
r of xn that maps xn equals
i into ri for each i.
00:24:55.680 --> 00:24:59.140
This is the same idea of taking
one random variable,
00:24:59.140 --> 00:25:02.030
which is a function of another
random variable.
00:25:02.030 --> 00:25:06.000
The one random variable takes
on the values one up to
00:25:06.000 --> 00:25:07.740
capital M.
00:25:07.740 --> 00:25:11.080
And then the other random
variable takes on a value
00:25:11.080 --> 00:25:14.680
which is determined by the state
that you happen to be
00:25:14.680 --> 00:25:16.600
in, which is this
random states.
00:25:16.600 --> 00:25:21.700
So specifying our sub i
specifies what the set of
00:25:21.700 --> 00:25:25.380
rewards are, what the reward
is in each given state.
00:25:25.380 --> 00:25:28.520
Again, we have this awful
problem, which I wish we could
00:25:28.520 --> 00:25:32.760
avoid in Markov chains, of using
the same word state to
00:25:32.760 --> 00:25:35.900
talk about the set of
different states.
00:25:35.900 --> 00:25:38.120
And also to talk about
the random state
00:25:38.120 --> 00:25:39.170
at any given time.
00:25:39.170 --> 00:25:43.560
But hopefully by now you're
used to that.
00:25:43.560 --> 00:25:47.700
In our discussion here, the only
thing we're going to talk
00:25:47.700 --> 00:25:50.670
about are expected rewards.
00:25:50.670 --> 00:25:55.810
Now, you know that expected
rewards, or expectations are a
00:25:55.810 --> 00:25:58.310
little more generally than you
would think they would be,
00:25:58.310 --> 00:26:02.060
because you're going to take the
expected value of any sort
00:26:02.060 --> 00:26:04.300
of crazy thing.
00:26:04.300 --> 00:26:07.870
If you want to talk about any
event, you can take the
00:26:07.870 --> 00:26:11.310
indicator function of that
event, and find the expected
00:26:11.310 --> 00:26:13.890
value of that indicator
function.
00:26:13.890 --> 00:26:16.920
And that's just the probability
of that event.
00:26:16.920 --> 00:26:22.660
So by understanding how to deal
with expectations, you
00:26:22.660 --> 00:26:25.560
really have the capability
of finding distribution
00:26:25.560 --> 00:26:28.480
functions, or anything else
you want to find.
00:26:28.480 --> 00:26:28.970
OK.
00:26:28.970 --> 00:26:31.490
But anyway, since we're
interested only in expected
00:26:31.490 --> 00:26:37.555
rewards, the expected reward at
time n, given that x zero
00:26:37.555 --> 00:26:44.950
is i is the expected value of r
of xn given x zero equals i,
00:26:44.950 --> 00:26:49.840
which is the sum over j of the
reward you get if you're in
00:26:49.840 --> 00:26:55.700
state j at time n times p sub
ij, super n, which we've
00:26:55.700 --> 00:27:00.850
talked about ad nauseum for the
last four lectures now.
00:27:00.850 --> 00:27:06.900
And this is the probability that
the state at time n is j,
00:27:06.900 --> 00:27:09.910
given that the state
at time zero is i.
00:27:09.910 --> 00:27:13.650
So you can just automatically
find the expected
00:27:13.650 --> 00:27:17.570
value of r of xn.
00:27:17.570 --> 00:27:20.610
And it's by that formula.
00:27:20.610 --> 00:27:24.230
Now, recall that this quantity
here is not all that simple.
00:27:24.230 --> 00:27:28.680
This is the ij element of the
product of the matrix, of the
00:27:28.680 --> 00:27:31.010
nth product of the matrix p.
00:27:31.010 --> 00:27:32.370
But, so what?
00:27:32.370 --> 00:27:36.130
We can at least write a nice
formula for it now.
00:27:36.130 --> 00:27:40.140
The expected aggregate reward
over the n steps from m to m
00:27:40.140 --> 00:27:43.080
plus n minus 1.
00:27:43.080 --> 00:27:44.900
What is m doing in here?
00:27:44.900 --> 00:27:48.970
It's just reminding us that
Markov chains are
00:27:48.970 --> 00:27:51.890
homogeneous over time.
00:27:51.890 --> 00:27:56.370
So, when I talk about the
aggregate reward from time m
00:27:56.370 --> 00:28:01.200
the m plus n minus 1, it's the
same as the aggregate reward
00:28:01.200 --> 00:28:04.500
from time 0 up to
time n minus 1.
00:28:04.500 --> 00:28:06.270
The expected values
are the same.
00:28:06.270 --> 00:28:09.550
The actual sample functions
are different.
00:28:09.550 --> 00:28:14.290
OK, so if I try to calculate
this aggregate reward
00:28:14.290 --> 00:28:18.880
conditional on xm equals i,
mainly conditional on starting
00:28:18.880 --> 00:28:23.660
in state i, then this expected
aggregate reward, I use that
00:28:23.660 --> 00:28:28.610
as a symbol for it, is the
expected value of r of xm,
00:28:28.610 --> 00:28:30.310
given xm equals i.
00:28:30.310 --> 00:28:30.890
What is that?
00:28:30.890 --> 00:28:33.030
Well, that's ri.
00:28:33.030 --> 00:28:35.220
I mean, given that xm
is equal to i, this
00:28:35.220 --> 00:28:36.490
isn't random anymore.
00:28:36.490 --> 00:28:38.500
It's just the source sub i.
00:28:38.500 --> 00:28:45.350
Plus the expected value of r of
xm plus 1, which is the sum
00:28:45.350 --> 00:28:49.490
over j, of pij times r sub j.
00:28:49.490 --> 00:28:54.305
That's the time m plus 1 given
that you're in state i at time
00:28:54.305 --> 00:29:00.370
m, and so forth, up until time
n minus 1, where the expected
00:29:00.370 --> 00:29:03.240
reward, then, is
a piece of ij.
00:29:06.180 --> 00:29:10.860
Probability of being in state j
at time n minus 1 given that
00:29:10.860 --> 00:29:16.190
you started off in state i
at time 0 times r sub j.
00:29:16.190 --> 00:29:20.790
And since expectations add, we
have this nice, convenient
00:29:20.790 --> 00:29:22.040
formula here.
00:29:26.180 --> 00:29:30.580
We're doing something I normally
hate doing, which is
00:29:30.580 --> 00:29:35.290
building up a lot of notation,
and then using that notation
00:29:35.290 --> 00:29:40.470
to write extremely complicated
formulas in a way that looks
00:29:40.470 --> 00:29:41.200
very simple.
00:29:41.200 --> 00:29:44.480
And therefore you will get some
sense of what we're doing
00:29:44.480 --> 00:29:45.840
is very simple.
00:29:45.840 --> 00:29:48.160
These quantities in
here, again, are
00:29:48.160 --> 00:29:49.790
not all that simple.
00:29:49.790 --> 00:29:52.550
But at least we can write
it in a simple way.
00:29:52.550 --> 00:29:56.260
And since we can write it in a
simple way, it turns out we
00:29:56.260 --> 00:29:59.160
can do some nice
things with it.
00:29:59.160 --> 00:29:59.420
OK.
00:29:59.420 --> 00:30:00.970
So where do we go from
all of this?
00:30:04.860 --> 00:30:12.280
We have just said that the
expected reward we get,
00:30:12.280 --> 00:30:18.550
expected aggregate reward over n
steps, namely from m up to m
00:30:18.550 --> 00:30:20.210
plus n minus 1.
00:30:20.210 --> 00:30:25.660
We're assuming that if we start
at time m, we pick up a
00:30:25.660 --> 00:30:27.660
reward at time n.
00:30:27.660 --> 00:30:30.530
I mean, that's just an
arbitrary decision.
00:30:30.530 --> 00:30:33.960
We might as well do that,
because otherwise we just have
00:30:33.960 --> 00:30:36.840
one more transition matrix
sitting here.
00:30:36.840 --> 00:30:38.660
OK, so we start at time m.
00:30:38.660 --> 00:30:42.640
We pick up a reward, which
is conditional on the
00:30:42.640 --> 00:30:45.030
state we start in.
00:30:45.030 --> 00:30:53.040
And then we look at the expected
reward for time m and
00:30:53.040 --> 00:30:58.420
time m plus 1, m plus 2,
up to m plus n minus 1.
00:30:58.420 --> 00:31:00.610
Since we started at
m, we're picking
00:31:00.610 --> 00:31:02.620
up n different rewards.
00:31:02.620 --> 00:31:07.490
We have to stop at time
m plus n minus 1.
00:31:07.490 --> 00:31:14.040
OK, so that's this expected
aggregate reward.
00:31:14.040 --> 00:31:17.890
Why do I care about expected
aggregate reward?
00:31:17.890 --> 00:31:22.220
Because the rewards at any time
n are sort of trivial.
00:31:22.220 --> 00:31:24.640
What we're are interested
in is how does this
00:31:24.640 --> 00:31:27.320
build up over time?
00:31:27.320 --> 00:31:29.150
You start to invest
in a stock.
00:31:29.150 --> 00:31:34.480
You don't much care what
it's worth at time 10.
00:31:34.480 --> 00:31:35.785
You care how it grows.
00:31:38.390 --> 00:31:41.040
You care about its value when
you want to sell it, and you
00:31:41.040 --> 00:31:44.880
don't know when you're going to
sell it, most of the time.
00:31:44.880 --> 00:31:48.150
So you're really interested
in these aggregate
00:31:48.150 --> 00:31:49.400
rewards that you.
00:31:52.260 --> 00:31:54.590
You'll see when we get to
dynamic programming what
00:31:54.590 --> 00:31:56.780
you're interested
in that, also.
00:31:56.780 --> 00:31:57.430
OK.
00:31:57.430 --> 00:32:01.340
If the Markov chain is an
ergotic unit chain, then
00:32:01.340 --> 00:32:04.710
successive terms of this
expression tend to a steady
00:32:04.710 --> 00:32:06.450
state gain per step.
00:32:06.450 --> 00:32:11.520
In other words, these terms here
, when n gets very large,
00:32:11.520 --> 00:32:17.070
if I run this process for very
long time, what happens to p
00:32:17.070 --> 00:32:20.640
sub ij to n minus 1?
00:32:20.640 --> 00:32:27.920
This tends towards the steady
state vector pi sub j.
00:32:27.920 --> 00:32:31.710
And it doesn't matter
where we started.
00:32:31.710 --> 00:32:34.690
The only thing of importance
is where we end up.
00:32:34.690 --> 00:32:37.180
It doesn't matter how
high this is.
00:32:37.180 --> 00:32:42.670
So we have a sum over j, of
pi sub j times r sub j.
00:32:42.670 --> 00:32:48.745
After a very long time, the
expected gain per step is just
00:32:48.745 --> 00:32:51.930
a sum of pi j times
our r sub j.
00:32:51.930 --> 00:32:56.000
That's what's important
after a long time.
00:32:56.000 --> 00:32:58.290
And that's independent of
the starting state.
00:32:58.290 --> 00:33:02.670
So what we have here is a big,
messy transient, which is a
00:33:02.670 --> 00:33:04.780
sum of a whole bunch
of things.
00:33:04.780 --> 00:33:08.090
And then eventually it just
settles down, and every extra
00:33:08.090 --> 00:33:15.190
step you do, you just pick up
an extra factor of g as an
00:33:15.190 --> 00:33:16.970
extra reward.
00:33:16.970 --> 00:33:19.960
The reward might, of course, be
negative, like in the stock
00:33:19.960 --> 00:33:25.100
market over the last 10 years,
or up until the last year or
00:33:25.100 --> 00:33:27.980
so, who was negative
for a long time.
00:33:27.980 --> 00:33:30.800
But that doesn't make
any difference.
00:33:30.800 --> 00:33:34.480
This is just a number, and
this is independent of
00:33:34.480 --> 00:33:36.590
starting state.
00:33:36.590 --> 00:33:41.740
And p sub in can be viewed a
transient ni, which is all
00:33:41.740 --> 00:33:43.330
this stuff at the beginning.
00:33:43.330 --> 00:33:47.010
The sum of all these terms at
the beginning plus something
00:33:47.010 --> 00:33:50.290
that settles down over a
long period of time.
00:33:50.290 --> 00:33:54.200
How to calculate that transient,
how to combine it
00:33:54.200 --> 00:33:56.230
with the steady state gain.
00:33:56.230 --> 00:33:59.920
Then those talk a great
deal about that.
00:33:59.920 --> 00:34:03.970
What we're trying to do today
is to talk about dynamic
00:34:03.970 --> 00:34:09.080
programming without going into
all of this terrible mess
00:34:09.080 --> 00:34:12.250
about dealing rewards
words in a very
00:34:12.250 --> 00:34:14.239
systematic and simple way.
00:34:14.239 --> 00:34:16.199
You can read about that later.
00:34:16.199 --> 00:34:19.610
What we're aiming at is to talk
about dynamic programming
00:34:19.610 --> 00:34:23.340
a little bit, and then get
off to other things.
00:34:23.340 --> 00:34:23.870
OK.
00:34:23.870 --> 00:34:27.239
So anyway, we have a transient,
plus we have a
00:34:27.239 --> 00:34:29.330
steady state gain.
00:34:29.330 --> 00:34:31.470
The transient is important.
00:34:31.470 --> 00:34:34.520
And it's particularly important
if g equals zero.
00:34:34.520 --> 00:34:40.090
Namely if your average gain per
step is nothing, then what
00:34:40.090 --> 00:34:47.980
you're primarily interested in
is how valuable is it to start
00:34:47.980 --> 00:34:49.360
in a particular state?
00:34:49.360 --> 00:34:53.000
If you start in one state versus
another state, you
00:34:53.000 --> 00:34:56.600
might get a great deal of reward
in this one state,
00:34:56.600 --> 00:34:59.120
whereas you make a loss
in some other state.
00:34:59.120 --> 00:35:03.200
So it's important to know which
state is worth being in.
00:35:03.200 --> 00:35:07.960
So that's the next thing
we try to look at.
00:35:07.960 --> 00:35:12.410
How does the state
affect things?
00:35:12.410 --> 00:35:17.760
This brings us to one example
which is particularly useful.
00:35:17.760 --> 00:35:22.360
And along with being a useful
example, well, it's a nice
00:35:22.360 --> 00:35:25.840
illustration of Markov
rewards.
00:35:25.840 --> 00:35:30.980
It's also something which
you often want to find.
00:35:30.980 --> 00:35:35.800
And when we start talking about
renewal processes, you
00:35:35.800 --> 00:35:40.890
will find that this idea here
is a nice connection between
00:35:40.890 --> 00:35:43.340
Markov chains and
renewal series.
00:35:43.340 --> 00:35:47.240
So it's important for a whole
bunch of different reasons.
00:35:47.240 --> 00:35:48.220
OK.
00:35:48.220 --> 00:35:52.470
Suppose for some arbitrary
unit chain, namely we're
00:35:52.470 --> 00:35:56.060
saying one set of recurring
states.
00:35:56.060 --> 00:35:59.710
We want to find the expected
number of steps, starting from
00:35:59.710 --> 00:36:04.260
a given state i, until
some particular
00:36:04.260 --> 00:36:06.560
state 1 is first entered.
00:36:06.560 --> 00:36:09.070
So you start at one state.
00:36:09.070 --> 00:36:12.090
There's this other state
way over here.
00:36:12.090 --> 00:36:15.690
This state is recurrent, so
presumably, eventually you're
00:36:15.690 --> 00:36:17.580
going to enter it.
00:36:17.580 --> 00:36:20.170
And you want to find out, what's
the expected time that
00:36:20.170 --> 00:36:23.810
it takes to get to that
particular state?
00:36:23.810 --> 00:36:26.110
OK?
00:36:26.110 --> 00:36:30.160
If you're a Ph.D. student, you
have this Markov chain of
00:36:30.160 --> 00:36:32.310
doing your research.
00:36:32.310 --> 00:36:36.180
And at some point, you're going
to get a Ph.D. So we can
00:36:36.180 --> 00:36:39.900
think of this as the first pass
each time to your first
00:36:39.900 --> 00:36:44.500
Ph.D. I mean, if you want to
get more Ph.D.'s, fine, but
00:36:44.500 --> 00:36:47.560
that's probably a different
Markov chain.
00:36:47.560 --> 00:36:48.550
OK.
00:36:48.550 --> 00:36:53.110
So anyway, that's the problem
we're trying to solve here.
00:36:53.110 --> 00:36:56.690
We can view this problem
as a reward problem.
00:36:56.690 --> 00:36:59.750
We have to go through a number
of steps if we want to view it
00:36:59.750 --> 00:37:01.940
as a reward problem.
00:37:01.940 --> 00:37:07.390
The first one, first step is to
assign one unit of reward
00:37:07.390 --> 00:37:11.430
to each successive state until
you enter state 1.
00:37:11.430 --> 00:37:15.040
So you're bombing through this
Markov chain, a frog jumping
00:37:15.040 --> 00:37:17.120
from lily pad to lily pad.
00:37:17.120 --> 00:37:19.590
And finally, the frog
gets to the lily pad
00:37:19.590 --> 00:37:21.500
with the food on it.
00:37:21.500 --> 00:37:25.780
And the frog wants to know, is
it going to start before he
00:37:25.780 --> 00:37:28.830
gets to this lily pad
with the food on it?
00:37:28.830 --> 00:37:32.940
So, if we're trying to find
the expected time to get
00:37:32.940 --> 00:37:35.850
there, here what we're really
interested in is a cost,
00:37:35.850 --> 00:37:39.920
because the frog is in
danger of starving.
00:37:39.920 --> 00:37:42.220
Or on the other hand, there
might be a snake lying under
00:37:42.220 --> 00:37:44.470
this one lily pad.
00:37:44.470 --> 00:37:47.770
And then he's getting a reward
for staying alive.
00:37:47.770 --> 00:37:51.390
You can look at these things
whichever way you want to.
00:37:51.390 --> 00:37:51.880
OK.
00:37:51.880 --> 00:37:55.020
We're going to assign one unit
of reward to successive state
00:37:55.020 --> 00:37:56.800
until state 1 is entered.
00:37:56.800 --> 00:38:01.430
1 is just an arbitrary state
that we've selected.
00:38:01.430 --> 00:38:04.760
That's where the snake is
underneath a lily pad, or
00:38:04.760 --> 00:38:08.130
that's where the food is,
or what have you.
00:38:08.130 --> 00:38:10.450
Now, there's something
else we have to do.
00:38:10.450 --> 00:38:17.010
Because if we're starting out at
some arbitrary state i, and
00:38:17.010 --> 00:38:19.910
we're trying to look for the
first time that we enter state
00:38:19.910 --> 00:38:23.695
1, what do you do after
you enter state 1?
00:38:26.670 --> 00:38:32.400
Well eventually, normally you're
going to go away from
00:38:32.400 --> 00:38:34.110
state 1, and you're
going to start
00:38:34.110 --> 00:38:36.380
picking up rewards again.
00:38:36.380 --> 00:38:38.990
You don't want that to happen.
00:38:38.990 --> 00:38:42.020
So you do something we do all
the time when we're dealing
00:38:42.020 --> 00:38:45.510
with Markov chains, which is
we start with one Markov
00:38:45.510 --> 00:38:49.070
chain, and we say, to solve this
problem I'm interested
00:38:49.070 --> 00:38:52.110
in, I've got to change
the Markov chain.
00:38:52.110 --> 00:38:54.350
So how are we going
to change it?
00:38:54.350 --> 00:38:58.160
We're going to change it to say,
once we get in state 1,
00:38:58.160 --> 00:38:59.455
we're going to stay
there forever.
00:39:02.070 --> 00:39:04.600
Or in other words, the frog gets
eaten by the snake, and
00:39:04.600 --> 00:39:09.650
therefore its remains always
stay at that one lily pad.
00:39:09.650 --> 00:39:11.750
So we change the Markov
chain again.
00:39:11.750 --> 00:39:14.450
The frog can't jump anymore.
00:39:14.450 --> 00:39:18.290
And the way we change it is
to change the transition
00:39:18.290 --> 00:39:23.910
probabilities out of state 1
to p sub 1, 1, namely the
00:39:23.910 --> 00:39:27.010
probability given you're in
state 1, of going back to
00:39:27.010 --> 00:39:30.320
state 1 in the next transition
is equal to 1.
00:39:30.320 --> 00:39:32.670
So whenever you get
to state 1, you
00:39:32.670 --> 00:39:35.270
just stay there forever.
00:39:35.270 --> 00:39:39.210
We're going to say r1 equal to
zero, namely the reward you
00:39:39.210 --> 00:39:42.240
get in state 1 will be zero.
00:39:42.240 --> 00:39:46.070
So you keep getting rewards
until you go to state 1.
00:39:46.070 --> 00:39:49.840
And then when you go to state
1, you don't get any reward.
00:39:49.840 --> 00:39:54.150
You don't get any reward from
any time after that.
00:39:54.150 --> 00:39:56.600
So in fact, we've converted
the problem.
00:39:56.600 --> 00:39:59.970
We've converted the Markov chain
to be able to solve the
00:39:59.970 --> 00:40:03.160
problem that we want to solve.
00:40:03.160 --> 00:40:07.660
Now, how do we know that we
haven't changed the problem in
00:40:07.660 --> 00:40:10.330
some awful way?
00:40:10.330 --> 00:40:13.710
I mean, any time you start out
with a Markov chain and you
00:40:13.710 --> 00:40:16.510
modify it, and you solve a
problem for the modified
00:40:16.510 --> 00:40:20.410
chain, you have to really think
through whether you
00:40:20.410 --> 00:40:23.550
changed the problem that
you started to solve.
00:40:23.550 --> 00:40:27.790
Well, think of any sample path
which starts in some state i,
00:40:27.790 --> 00:40:29.610
which is not equal to 1.
00:40:29.610 --> 00:40:33.930
Think of the sample path
as going forever.
00:40:33.930 --> 00:40:38.430
In the original Markov chain,
that sample path at some
00:40:38.430 --> 00:40:43.050
point, presumably, is going
to get to state 1.
00:40:43.050 --> 00:40:47.100
After it gets to state 1, we
don't care what happens,
00:40:47.100 --> 00:40:51.520
because we then know how long
it's taken to get to state 1.
00:40:51.520 --> 00:40:54.550
And after it gets to state
1, the transition
00:40:54.550 --> 00:40:56.410
probabilities change.
00:40:56.410 --> 00:40:58.410
We don't care about that.
00:40:58.410 --> 00:41:03.570
So for every sample path, the
time that it takes the first
00:41:03.570 --> 00:41:08.370
pass each time to state 1 is the
same in the modify chain
00:41:08.370 --> 00:41:10.920
as it is in the actual chain.
00:41:10.920 --> 00:41:15.750
The transition probabilities are
the same up until the time
00:41:15.750 --> 00:41:17.770
when you first get to state 1.
00:41:17.770 --> 00:41:22.300
So for first pass each time
problems, it doesn't make any
00:41:22.300 --> 00:41:26.550
difference what you do after
you get to state 1.
00:41:26.550 --> 00:41:30.590
So to make the problem easy,
we're going to set these
00:41:30.590 --> 00:41:34.450
transition probabilities in
state 1 to 1, and we're going
00:41:34.450 --> 00:41:38.830
to set the reward
equal to zero.
00:41:38.830 --> 00:41:46.710
What do you call a state which
has p sub i, i equal to 1?
00:41:46.710 --> 00:41:48.700
You call it a trapping state.
00:41:48.700 --> 00:41:51.080
It's a trapping state because
once you get there,
00:41:51.080 --> 00:41:52.330
you can't get out.
00:41:55.500 --> 00:41:59.710
And since we started out with
a unit chain, and since
00:41:59.710 --> 00:42:03.650
presumably state 1 is a
recurrent state in that unit
00:42:03.650 --> 00:42:06.500
chain, eventually you're going
to get to state 1.
00:42:06.500 --> 00:42:08.560
But once you get there,
you can't get out.
00:42:08.560 --> 00:42:11.690
So what you've done is you've
turned the unit chain into
00:42:11.690 --> 00:42:15.200
another unit chain where the
recurrent set of states has
00:42:15.200 --> 00:42:17.900
only this one state,
state 1 in it.
00:42:17.900 --> 00:42:19.690
So it's a trapping state.
00:42:19.690 --> 00:42:23.920
Everything eventually
leads to state 1.
00:42:23.920 --> 00:42:26.600
All roads lead to Rome, but it's
not obvious that they're
00:42:26.600 --> 00:42:28.350
leading to Rome.
00:42:28.350 --> 00:42:31.480
And all of these states
eventually lead to state 1,
00:42:31.480 --> 00:42:34.420
but not for quite a
while sometimes.
00:42:34.420 --> 00:42:35.050
OK.
00:42:35.050 --> 00:42:37.710
So the probability of an initial
segment until 1 is
00:42:37.710 --> 00:42:41.960
entered is unchanged, and
expected first pass each time
00:42:41.960 --> 00:42:43.210
is unchanged.
00:42:45.630 --> 00:42:45.770
OK.
00:42:45.770 --> 00:42:50.430
A modified Markov chain is now
an ergotic unit chain.
00:42:50.430 --> 00:42:53.580
It has a single recurrent
state.
00:42:53.580 --> 00:42:57.150
State 1 is a trapping
state, we call it.
00:42:57.150 --> 00:43:03.730
ri is equal to 1 for i unequal
to 1, and r1 is equal to zero.
00:43:03.730 --> 00:43:08.480
This says that a state 1 is
first entered at time l, and
00:43:08.480 --> 00:43:13.770
the aggregate reward from 0 to
n is l for all m greater than
00:43:13.770 --> 00:43:14.335
or equal to l.
00:43:14.335 --> 00:43:16.780
In other words, after you get
to the trapping state, you
00:43:16.780 --> 00:43:19.410
stay there, and you don't
pick up any more
00:43:19.410 --> 00:43:21.250
reward from then on.
00:43:21.250 --> 00:43:23.970
One of the things that's
maddening about problems like
00:43:23.970 --> 00:43:26.720
this, at least that's maddening
for me, because I
00:43:26.720 --> 00:43:30.710
can't keep those things
straight, is the difference
00:43:30.710 --> 00:43:34.290
between n and n plus 1,
or n and n minus 1.
00:43:34.290 --> 00:43:37.280
There's always that strange
thing, we've started at time
00:43:37.280 --> 00:43:40.270
m, we get reward at time m.
00:43:40.270 --> 00:43:43.600
So if we're looking at m
transitions, as we go from m
00:43:43.600 --> 00:43:46.860
the m plus n minus 1.
00:43:46.860 --> 00:43:50.150
And that's just life.
00:43:50.150 --> 00:43:52.910
If you try to do it in a
different way, you wind up
00:43:52.910 --> 00:43:54.800
with a similar problem.
00:43:54.800 --> 00:43:56.220
You can't avoid it.
00:43:56.220 --> 00:44:02.130
OK, so what we're trying to find
is the expected value of
00:44:02.130 --> 00:44:06.470
v sub i of n, and the limit as n
goes to infinity, we'll just
00:44:06.470 --> 00:44:10.640
call that v sub i without
the n on it.
00:44:10.640 --> 00:44:14.620
And what we want to do is to
calculate this expected time
00:44:14.620 --> 00:44:18.040
until we first enter
state one.
00:44:18.040 --> 00:44:22.900
We want to calculate that for
all of the other states i.
00:44:22.900 --> 00:44:26.980
Well fortunately, there's a
sneaky way to calculate this.
00:44:26.980 --> 00:44:29.170
For most of these problems,
there's a sneaky way to
00:44:29.170 --> 00:44:30.680
calculate these limits.
00:44:30.680 --> 00:44:34.640
And you don't have to worry
about the limit.
00:44:34.640 --> 00:44:37.010
So the next thing I'm going to
do is to explain what this
00:44:37.010 --> 00:44:39.760
sneaky way is.
00:44:39.760 --> 00:44:44.710
You will see the same sneaky
method done about 100 times
00:44:44.710 --> 00:44:46.460
from now on until the
end of course.
00:44:46.460 --> 00:44:48.760
We use it all the time.
00:44:48.760 --> 00:44:52.250
And each time we do it, we'll
get a better sense of what it
00:44:52.250 --> 00:44:53.710
really amounts to.
00:44:53.710 --> 00:44:59.150
So for each state unequal to
the trapping state, let's
00:44:59.150 --> 00:45:02.290
start out by assuming that
we start at time
00:45:02.290 --> 00:45:04.470
zero, and state i.
00:45:04.470 --> 00:45:08.580
In other words, what this means
is first we're going to
00:45:08.580 --> 00:45:12.490
assume that x sub 0 equals
i for some given i.
00:45:12.490 --> 00:45:14.300
We're going to go through
whatever we're going to go
00:45:14.300 --> 00:45:17.620
through, then we'll go back
and assume that x sub 0 is
00:45:17.620 --> 00:45:18.890
some other i.
00:45:18.890 --> 00:45:21.800
And we don't have to worry about
that, because i is just
00:45:21.800 --> 00:45:22.900
a generic state.
00:45:22.900 --> 00:45:26.320
So we'll do it for everything
at once.
00:45:26.320 --> 00:45:30.630
There's a unit reward
at time 0.
00:45:30.630 --> 00:45:32.970
r sub i is equal to 1.
00:45:32.970 --> 00:45:37.270
So we start out at time
zero and state i.
00:45:37.270 --> 00:45:41.070
We pick up our reward of 1, and
then we go on from there
00:45:41.070 --> 00:45:46.370
to see how much longer it
takes to get to state 1.
00:45:46.370 --> 00:45:53.170
In addition to this unit reward
at time zero, which
00:45:53.170 --> 00:45:56.430
means it's already taken us one
unit of time to get the
00:45:56.430 --> 00:46:02.120
state 1, given that x sub 1
equals j, namely, given that
00:46:02.120 --> 00:46:07.910
we go from state i to state j,
the remaining expected reward
00:46:07.910 --> 00:46:10.380
is v sub j.
00:46:10.380 --> 00:46:15.830
In other words, if it's times
0, I'm in some state i.
00:46:15.830 --> 00:46:21.110
Given that I go to some stage
j, the next unit of time,
00:46:21.110 --> 00:46:24.930
what's the remaining accepted
expected time they
00:46:24.930 --> 00:46:27.560
get to state 1?
00:46:27.560 --> 00:46:32.830
The remaining expected time is
just v sub j, because that's
00:46:32.830 --> 00:46:34.050
the expected time.
00:46:34.050 --> 00:46:37.550
I mean, if v sub j is something
where it's very hard
00:46:37.550 --> 00:46:41.560
to get to state 1, then
we really lost out.
00:46:41.560 --> 00:46:44.370
If it's something which is
closer to state 1 in some
00:46:44.370 --> 00:46:45.730
sense, then we've gained.
00:46:45.730 --> 00:46:51.180
But what we wind up with is the
expected time to get to
00:46:51.180 --> 00:46:55.370
state 1 from state i is one.
00:46:55.370 --> 00:46:59.450
That's the instant reward that
we get, or the instant cost
00:46:59.450 --> 00:47:04.880
that we pay, plus each of
the possible states
00:47:04.880 --> 00:47:06.420
we might get to.
00:47:06.420 --> 00:47:11.290
There's a cost to go, or
reward to go from that
00:47:11.290 --> 00:47:12.470
particular j.
00:47:12.470 --> 00:47:15.320
So this is the formula
we have to solve.
00:47:15.320 --> 00:47:16.190
What's this mean?
00:47:16.190 --> 00:47:20.280
It means we have to solve
this formula for all i.
00:47:20.280 --> 00:47:24.870
If I solve it for all i, and
I've solved this for all i,
00:47:24.870 --> 00:47:28.910
then that's the linear equation
in the variables v
00:47:28.910 --> 00:47:40.010
sub 1 up to v linear equations
in i equals 2, up to m.
00:47:40.010 --> 00:47:44.660
We also have decided that
v sub 1 is equal to 0.
00:47:44.660 --> 00:47:48.350
In other words, if we start out
in state 1, you expect the
00:47:48.350 --> 00:47:50.670
time to get to state 1 is 0.
00:47:50.670 --> 00:47:53.260
We're already there.
00:47:53.260 --> 00:47:53.730
OK.
00:47:53.730 --> 00:47:57.300
So we have to solve these
linear equations.
00:47:57.300 --> 00:48:03.130
And if your philosophy on
solving linear equations is
00:48:03.130 --> 00:48:08.930
that of, I shouldn't say a
computer scientist because I
00:48:08.930 --> 00:48:11.830
don't want to indicate that they
are any different from
00:48:11.830 --> 00:48:16.960
any of the rest of us, but for
many people, your philosophy
00:48:16.960 --> 00:48:20.720
of solving linear equations
is to try to solve it.
00:48:20.720 --> 00:48:24.440
If you can't solve it, it
doesn't have any solution.
00:48:24.440 --> 00:48:28.020
And if you're happy with
doing that, fine.
00:48:28.020 --> 00:48:33.480
Some people would rather spend
10 hours asking whether in
00:48:33.480 --> 00:48:37.030
general it has any solution,
rather than spending five
00:48:37.030 --> 00:48:38.806
minutes solving it.
00:48:38.806 --> 00:48:48.420
So either way, this expected
first passage time, we've just
00:48:48.420 --> 00:48:50.390
stated what it is.
00:48:50.390 --> 00:48:57.710
Starting in state i, it's 1 plus
the time to go for any
00:48:57.710 --> 00:48:59.840
other state you happen
to go to.
00:48:59.840 --> 00:49:03.910
If we put this in vector form,
you put things in vector form
00:49:03.910 --> 00:49:06.670
because you want to spend two
hours finding the general
00:49:06.670 --> 00:49:09.685
solution, rather than five
minutes solving the problem.
00:49:14.240 --> 00:49:18.660
If you have 1,000 states, then
it works the other way.
00:49:18.660 --> 00:49:22.300
It takes you multiple hours to
work it out by hand, and it
00:49:22.300 --> 00:49:25.430
takes you five minutes by
looking at the equation.
00:49:25.430 --> 00:49:29.240
So sometimes you win, and
sometimes you lose by looking
00:49:29.240 --> 00:49:30.780
at the general solution.
00:49:30.780 --> 00:49:37.360
If you look at this as a vector
solution, the vector v
00:49:37.360 --> 00:49:43.080
where v1 is equal to zero, and
the other v's are unknowns, is
00:49:43.080 --> 00:49:47.590
the vector r, the
vector r is 0.
00:49:47.590 --> 00:49:50.030
0 reward in state 1.
00:49:50.030 --> 00:49:53.020
Unit reward in all other states,
because we're trying
00:49:53.020 --> 00:49:55.860
to get to this end.
00:49:55.860 --> 00:50:00.780
And then we have the matrix
here, t times v.
00:50:00.780 --> 00:50:04.780
So we want to solve this set of
linear equations, and what
00:50:04.780 --> 00:50:08.720
do we know about this set
of linear equations?
00:50:08.720 --> 00:50:11.890
We have an ergotic unit chain.
00:50:11.890 --> 00:50:16.410
We know that p has
an eigenvalue,
00:50:16.410 --> 00:50:18.700
which is equal to 1.
00:50:18.700 --> 00:50:22.040
We know that's a simple
eigenvalue.
00:50:22.040 --> 00:50:37.130
So that in fact, when we write
v equals r plus pv as zero
00:50:37.130 --> 00:50:50.070
equals r plus p minus
i times v.
00:50:50.070 --> 00:50:52.190
And we try to ask whether
v has any
00:50:52.190 --> 00:50:55.040
solution, what's the answer?
00:50:55.040 --> 00:50:59.140
Well, this matrix here has
an eigenvalue of 1.
00:50:59.140 --> 00:51:02.030
Since it has an eigenvalue of
one, and since it's a simple
00:51:02.030 --> 00:51:06.160
eigenvalue, there's a space of
solutions to this equation.
00:51:06.160 --> 00:51:11.330
The space of solutions is the
vector of all ones and the
00:51:11.330 --> 00:51:12.850
vector of all anything else.
00:51:12.850 --> 00:51:17.650
In other words, it's a vector of
v times any constant alpha.
00:51:17.650 --> 00:51:21.460
Now we've stuck this in here,
so now we want to find out
00:51:21.460 --> 00:51:25.200
what's the set of
solutions now.
00:51:25.200 --> 00:51:31.730
We observe v plus alpha e also
satisfies this equation if we
00:51:31.730 --> 00:51:33.500
found another solution.
00:51:33.500 --> 00:51:37.450
So if we found a solution, we
have a one dimensional family
00:51:37.450 --> 00:51:40.110
of solutions.
00:51:40.110 --> 00:51:47.520
Well, since this eigenvalue is a
simple eigenvalue, the space
00:51:47.520 --> 00:51:56.040
of vectors for which r is equal
to p minus i times v as
00:51:56.040 --> 00:51:59.390
a one dimensional space, and
therefore there has to be a
00:51:59.390 --> 00:52:02.350
unique solution to
this question.
00:52:02.350 --> 00:52:03.490
OK.
00:52:03.490 --> 00:52:07.460
So in fact, in only 15 minutes,
we've solved the
00:52:07.460 --> 00:52:13.710
problem in general, so that you
can deal with matrices of
00:52:13.710 --> 00:52:17.990
1,000 states, as opposed
to two states.
00:52:17.990 --> 00:52:20.170
And you still have
the same answer.
00:52:20.170 --> 00:52:21.840
OK.
00:52:21.840 --> 00:52:26.970
So this equation has a simple
solution, which says that you
00:52:26.970 --> 00:52:29.850
can program your computer to
solve this set of linear
00:52:29.850 --> 00:52:33.270
equations, and you're bound
to get an answer.
00:52:33.270 --> 00:52:35.740
And the answer will tell you
how long it takes to get to
00:52:35.740 --> 00:52:39.958
this particular state.
00:52:39.958 --> 00:52:40.390
OK.
00:52:40.390 --> 00:52:46.705
Let's go one to aggregate
rewards with a final reward.
00:52:51.420 --> 00:52:53.560
Starting to sound like-- yes?
00:52:53.560 --> 00:52:56.990
STUDENT: I'm sorry, for the
last example, how are we
00:52:56.990 --> 00:52:57.970
guaranteed that it's ergotic?
00:52:57.970 --> 00:53:01.370
Like, I possible you enter a
loop somewhere that can never
00:53:01.370 --> 00:53:05.670
go to your trapping
state, right?
00:53:05.670 --> 00:53:09.750
PROFESSOR: But I can't do that
because there always has to be
00:53:09.750 --> 00:53:12.520
a way of getting to the trapping
state, because
00:53:12.520 --> 00:53:14.770
there's only one recurrent
state.
00:53:14.770 --> 00:53:19.170
All these other states
are transient now.
00:53:19.170 --> 00:53:19.920
STUDENT: No, but I mean--
00:53:19.920 --> 00:53:21.467
OK, like, let's say you
start off with a
00:53:21.467 --> 00:53:22.655
general Markov chain.
00:53:22.655 --> 00:53:24.560
PROFESSOR: Oh, I start off with
a general Markov chain?
00:53:24.560 --> 00:53:27.060
You're absolutely right.
00:53:27.060 --> 00:53:30.060
Then there might be no way of
getting from some starting
00:53:30.060 --> 00:53:34.610
state to state 1, and therefore,
the amount of time
00:53:34.610 --> 00:53:36.890
that it takes you to get from
that state to the starting
00:53:36.890 --> 00:53:38.750
state is going to be infinite.
00:53:38.750 --> 00:53:40.250
You can't get there.
00:53:40.250 --> 00:53:43.960
So in fact, what you have to do
with a problem like this is
00:53:43.960 --> 00:53:48.730
to look at it first, and say,
are you in fact dealing with a
00:53:48.730 --> 00:53:49.760
unit chain?
00:53:49.760 --> 00:53:52.990
Or do you have multiple
recurrent sets?
00:53:52.990 --> 00:53:57.100
If you have multiple recurrent
sets, then the expected time
00:53:57.100 --> 00:54:00.770
to get into one of the recurrent
states, starting
00:54:00.770 --> 00:54:04.840
from either a transient state,
or from some other recurrent
00:54:04.840 --> 00:54:08.720
set is infinite.
00:54:08.720 --> 00:54:11.820
I mean, just like this business
we were going through
00:54:11.820 --> 00:54:13.480
at the beginning.
00:54:13.480 --> 00:54:16.050
What you would like to do is not
have to go through a lot
00:54:16.050 --> 00:54:20.750
of calculation when you have, or
a lot of thinking when you
00:54:20.750 --> 00:54:24.070
have multiple recurrent
sets of states.
00:54:24.070 --> 00:54:25.980
You just know what
happens there.
00:54:25.980 --> 00:54:28.540
There's no way to get from this
recurrent set to this
00:54:28.540 --> 00:54:30.020
recurrent set.
00:54:30.020 --> 00:54:31.440
So that's the end of it.
00:54:31.440 --> 00:54:31.888
STUDENT: OK.
00:54:31.888 --> 00:54:34.277
So like it works when you have
the unit chain, and then you
00:54:34.277 --> 00:54:36.585
choose your trapping state to
be one instance [INAUDIBLE].
00:54:36.585 --> 00:54:37.835
PROFESSOR: Yes.
00:54:39.700 --> 00:54:40.150
OK.
00:54:40.150 --> 00:54:41.400
Good.
00:54:44.220 --> 00:54:47.160
Now, yes?
00:54:47.160 --> 00:54:50.410
STUDENT: The previous equation
is true for any reward.
00:54:50.410 --> 00:54:51.692
But it's not necessary--
00:54:51.692 --> 00:54:53.950
PROFESSOR: Yeah, it is true for
any set of rewards, yes.
00:54:59.720 --> 00:55:02.090
Although what the interpretation
would be of any
00:55:02.090 --> 00:55:05.900
set of rewards is if you
have to sort that out.
00:55:05.900 --> 00:55:06.590
But yes.
00:55:06.590 --> 00:55:10.200
For any r that you choose,
there's going to be one unique
00:55:10.200 --> 00:55:15.530
solution, so long as one is
actually a trapping state, and
00:55:15.530 --> 00:55:16.950
everything else leads to one.
00:55:20.600 --> 00:55:25.875
OK, so why do I want to
put a-- ah, good.
00:55:25.875 --> 00:55:27.537
STUDENT: I feel like there's a
lot of the rewards that are
00:55:27.537 --> 00:55:30.625
designed for it, designed with
respect to being in a
00:55:30.625 --> 00:55:31.575
particular state.
00:55:31.575 --> 00:55:32.060
PROFESSOR: Yes.
00:55:32.060 --> 00:55:34.340
STUDENT: But if the rewards are
actually in transition, so
00:55:34.340 --> 00:55:38.012
for example, if you go from i to
j, there are going to be a
00:55:38.012 --> 00:55:40.015
different number from j to j.
00:55:40.015 --> 00:55:41.580
How do you deal with that?
00:55:41.580 --> 00:55:42.400
PROFESSOR: How do I
deal with that?
00:55:42.400 --> 00:55:45.000
Well, then let's talk
about that.
00:55:45.000 --> 00:55:48.200
And in fact, it's fairly simple
so long as you're only
00:55:48.200 --> 00:55:50.750
talking about expected
rewards.
00:55:50.750 --> 00:55:54.450
Because if I have a reward
associated with--
00:55:57.096 --> 00:56:18.574
if I have a reward rij, which is
the reward for transition i
00:56:18.574 --> 00:56:36.600
to j, then if I take the sum of
rij times p summed over j,
00:56:36.600 --> 00:56:51.768
what this gives me is the
expected reward associated
00:56:51.768 --> 00:56:53.940
with state j, with state i.
00:56:59.750 --> 00:57:02.680
Now, you have to be a little bit
careful with this because
00:57:02.680 --> 00:57:06.310
before we've been picking up
this reward as soon as we get
00:57:06.310 --> 00:57:09.860
to state i, and here suddenly
we have a slightly different
00:57:09.860 --> 00:57:14.560
situation where you have a
reward associated with state i
00:57:14.560 --> 00:57:17.230
but you don't pick it up
until the next set.
00:57:17.230 --> 00:57:23.780
So this is where this problem
of i or i plus 1 comes in.
00:57:23.780 --> 00:57:29.070
And you guys can do that much
better than I can, because at
00:57:29.070 --> 00:57:36.480
my age I start out with an age
of 60 and an age of 61 is the
00:57:36.480 --> 00:57:38.500
same thing.
00:57:38.500 --> 00:57:40.880
I mean, these are--
00:57:40.880 --> 00:57:42.130
OK.
00:57:44.790 --> 00:57:48.660
So anyway, the point of it is,
if you have rewards associated
00:57:48.660 --> 00:57:52.320
with transitions you can always
convert that to rewards
00:57:52.320 --> 00:57:53.570
associated with states.
00:57:58.320 --> 00:58:02.220
Oh, I didn't really
get to this.
00:58:02.220 --> 00:58:06.150
What I've been trying to say
now for a while is that
00:58:06.150 --> 00:58:13.120
sometimes, for some reason or
other, after you go through
00:58:13.120 --> 00:58:16.990
and end steps of this Markov
chain, when you get to the
00:58:16.990 --> 00:58:21.340
end, you want to consider some
particularly large reward for
00:58:21.340 --> 00:58:24.540
having gotten to the end, or
some particularly large cost
00:58:24.540 --> 00:58:27.950
of getting to the end, or
something which depends on the
00:58:27.950 --> 00:58:30.190
state that you happen
to be in.
00:58:30.190 --> 00:58:34.630
So we will assign some final
reward which in general can be
00:58:34.630 --> 00:58:37.820
different from the reward that
we're picking up at each of
00:58:37.820 --> 00:58:38.840
the other states.
00:58:38.840 --> 00:58:41.105
We're going to do this
in a particular way.
00:58:47.740 --> 00:58:50.920
You would think that what we
would want to do is, if we
00:58:50.920 --> 00:58:55.210
went through in steps, we would
associate this final
00:58:55.210 --> 00:58:57.580
reward with the n-th step.
00:58:57.580 --> 00:58:59.220
We're going to do it
a different way.
00:58:59.220 --> 00:59:02.180
We're going to go through n
steps, and then the final
00:59:02.180 --> 00:59:05.980
reward is what happens on
the state after that.
00:59:05.980 --> 00:59:09.480
So we're really turning the
problem of looking at n steps
00:59:09.480 --> 00:59:13.490
into a problem of looking
at n plus 1 steps.
00:59:13.490 --> 00:59:14.490
Why do we do that?
00:59:14.490 --> 00:59:16.320
Completely arbitrary.
00:59:16.320 --> 00:59:19.320
It turns out to be convenient
when we talk about dynamic
00:59:19.320 --> 00:59:24.720
programming, and you'll see
why in just a minute.
00:59:24.720 --> 00:59:29.770
So this extra final state is
just an arbitrary thing that
00:59:29.770 --> 00:59:34.230
you add, and we'll see
the main purpose for
00:59:34.230 --> 00:59:35.780
it in just a minute.
00:59:38.730 --> 00:59:39.380
OK.
00:59:39.380 --> 00:59:45.910
So we're going to now look at
what in principle is a much
00:59:45.910 --> 00:59:48.880
more complicated situation than
what we were looking at
00:59:48.880 --> 00:59:53.180
before, but you still have this
basic mark off condition
00:59:53.180 --> 00:59:56.310
which is making things
simple for you.
00:59:56.310 --> 01:00:00.990
So the idea is, you're looking
at a discrete time situation.
01:00:00.990 --> 01:00:04.260
Things happen in steps.
01:00:04.260 --> 01:00:07.655
There's a finite set of states
which don't change over time.
01:00:10.530 --> 01:00:13.690
At each unit of time, you're
going to be in one of the set
01:00:13.690 --> 01:00:20.420
of m states, and at each time l,
there's some decision maker
01:00:20.420 --> 01:00:24.520
sitting around who looks
at the state that
01:00:24.520 --> 01:00:26.530
you're in at time l.
01:00:26.530 --> 01:00:31.970
And the decision maker says I
have a choice between what
01:00:31.970 --> 01:00:38.570
reward I'm going to pick up
at this time and what the
01:00:38.570 --> 01:00:43.020
transition probabilities are for
going to the next state.
01:00:43.020 --> 01:00:46.110
OK, so it's kind of a
complicated thing.
01:00:46.110 --> 01:00:51.440
It's the same thing that
you face all the time.
01:00:51.440 --> 01:00:54.300
I mean, in the stock market for
example, you see that one
01:00:54.300 --> 01:00:57.010
stock is doing poorly,
so you have a choice.
01:00:57.010 --> 01:01:03.620
Should I sell it, eat my losses,
or should I keep on
01:01:03.620 --> 01:01:05.980
going and hope it'll
turnaround?
01:01:05.980 --> 01:01:09.980
If you're doing a thesis, you
have the even worse problem.
01:01:09.980 --> 01:01:13.540
You go for three months without
getting the result
01:01:13.540 --> 01:01:19.120
that you need, and you say,
well, I don't have a thesis.
01:01:19.120 --> 01:01:21.960
I can't say something
about this.
01:01:21.960 --> 01:01:25.280
Should I go on for one more
month, or should I can it and
01:01:25.280 --> 01:01:27.400
go on to another topic?
01:01:27.400 --> 01:01:30.460
OK, it's exactly the
same situation.
01:01:30.460 --> 01:01:34.900
So this is really a very broad
set of situations.
01:01:34.900 --> 01:01:37.858
The only thing that makes it
really different from real
01:01:37.858 --> 01:01:42.260
life is this Markov property
sitting there and the fact
01:01:42.260 --> 01:01:46.190
that you actually understand
what the rewards are and you
01:01:46.190 --> 01:01:48.180
can predict them in advance.
01:01:48.180 --> 01:01:51.990
You can't predict what state
you're going to be in, but you
01:01:51.990 --> 01:01:54.230
know that if you're in a
particular state, you know
01:01:54.230 --> 01:01:58.560
what your choices are in the
future as well as now, and all
01:01:58.560 --> 01:02:03.360
you have to do at each unit of
time is to make this choice
01:02:03.360 --> 01:02:05.860
between various different
things.
01:02:05.860 --> 01:02:08.485
You see an interesting
example of that here.
01:02:13.890 --> 01:02:17.430
If you look at this Markov chain
here, it's a two state
01:02:17.430 --> 01:02:18.680
Markov chain.
01:02:21.770 --> 01:02:23.860
And what's the steady
state probability of
01:02:23.860 --> 01:02:25.150
being in state one?
01:02:32.420 --> 01:02:33.670
Anybody?
01:02:35.596 --> 01:02:37.050
It's a half, yes.
01:02:37.050 --> 01:02:40.480
Why is it a half, and
why don't you have
01:02:40.480 --> 01:02:41.930
to solve for this?
01:02:41.930 --> 01:02:45.150
Why can you look at it
and say it's a half?
01:02:45.150 --> 01:02:46.740
Because it's completely
symmetric.
01:02:46.740 --> 01:02:53.930
0.99 here, 0.99 here, 0.01
here, 0.01 here.
01:02:53.930 --> 01:02:56.450
These rewards here had nothing
to do with the
01:02:56.450 --> 01:02:58.290
Markov chain itself.
01:02:58.290 --> 01:03:02.210
The Markov chain is symmetric
between states one and two,
01:03:02.210 --> 01:03:05.200
and therefore, the steady state
probabilities have to be
01:03:05.200 --> 01:03:06.660
one half each.
01:03:06.660 --> 01:03:13.410
So here's something where, if
you happen to be in state two,
01:03:13.410 --> 01:03:15.090
you're going to stay
there typically
01:03:15.090 --> 01:03:17.100
for a very long time.
01:03:17.100 --> 01:03:20.080
And while you're studying there
for a very long time,
01:03:20.080 --> 01:03:24.020
you're going to be picking up
rewards one unit of reward
01:03:24.020 --> 01:03:26.930
every unit of time.
01:03:26.930 --> 01:03:30.720
You work for some very stable
employer who pays you very
01:03:30.720 --> 01:03:33.540
little, and that's a
situation you have.
01:03:33.540 --> 01:03:37.880
You're sitting here, you have
a job but you're not making
01:03:37.880 --> 01:03:42.710
much, but still you're making
something, and you have a lot
01:03:42.710 --> 01:03:45.510
of job security.
01:03:45.510 --> 01:03:49.760
Now, we have a different choice
when we're sitting here
01:03:49.760 --> 01:03:57.390
with a job in state two, we can,
for example, you can go
01:03:57.390 --> 01:04:00.300
to the cash register and take
all the money out of it and
01:04:00.300 --> 01:04:01.550
disappear from the company.
01:04:03.920 --> 01:04:07.170
I don't advocate doing
that, except,
01:04:07.170 --> 01:04:09.190
it's one of your choices.
01:04:09.190 --> 01:04:13.730
So you pick up a big reward of
50, and then for a long period
01:04:13.730 --> 01:04:18.820
of time you go back to this
state over here and you make
01:04:18.820 --> 01:04:21.720
nothing in reward for a
long period of time
01:04:21.720 --> 01:04:23.360
while you're in jail.
01:04:23.360 --> 01:04:28.650
And then eventually you pop back
here, and if we assume
01:04:28.650 --> 01:04:32.320
the judicial system is such
that it has no memory,
01:04:32.320 --> 01:04:33.670
[INAUDIBLE]
01:04:33.670 --> 01:04:40.020
you can cut into the cash
register, and, well, OK.
01:04:40.020 --> 01:04:43.850
So anyway, this decision two,
you're looking for instant
01:04:43.850 --> 01:04:45.410
gratification here.
01:04:45.410 --> 01:04:48.340
You're getting a big reward all
at once, but by getting a
01:04:48.340 --> 01:04:53.040
big reward with probability
one, you're going back to
01:04:53.040 --> 01:04:54.390
state zero.
01:04:54.390 --> 01:04:57.830
From state zero, it takes a long
time to get back to the
01:04:57.830 --> 01:05:02.940
point where you can get a big
reward again, so you wonder,
01:05:02.940 --> 01:05:07.020
is it better to use this policy
or is it better to use
01:05:07.020 --> 01:05:08.270
this policy?
01:05:10.670 --> 01:05:14.160
Now, there are two basic ways
to look at this problem.
01:05:14.160 --> 01:05:16.280
I think it's important to
understand what they are
01:05:16.280 --> 01:05:18.330
before we go further.
01:05:18.330 --> 01:05:24.660
One of the ways is to say, OK,
let's suppose that I work out
01:05:24.660 --> 01:05:30.440
which is the best policy
and I use it forever.
01:05:30.440 --> 01:05:34.140
Namely, I use this policy
forever or I
01:05:34.140 --> 01:05:36.910
use this policy forever.
01:05:36.910 --> 01:05:40.570
And if I use this policy
forever, I can pretty easily
01:05:40.570 --> 01:05:43.470
work out what the steady state
probabilities of these two
01:05:43.470 --> 01:05:44.680
states are.
01:05:44.680 --> 01:05:50.040
I can then work out what my
expected gain is per unit time
01:05:50.040 --> 01:05:52.690
and I can compare
this with that.
01:05:55.260 --> 01:05:58.140
And who thinks that this is
going to be better than that
01:05:58.140 --> 01:06:01.370
and who thinks that this is
going to be better than that?
01:06:01.370 --> 01:06:03.600
Well, you can work
it out easily.
01:06:03.600 --> 01:06:07.080
It's kind of interesting because
the steady state gain
01:06:07.080 --> 01:06:12.940
here and here are very
close to the same.
01:06:12.940 --> 01:06:17.480
It turns out that this is just
a smidgen better than this,
01:06:17.480 --> 01:06:19.610
only by a very small amount.
01:06:19.610 --> 01:06:19.950
OK.
01:06:19.950 --> 01:06:25.620
See, what happens here is that
here, you tend to go for about
01:06:25.620 --> 01:06:28.110
100 steps here.
01:06:28.110 --> 01:06:33.090
So you pick up every reward of
about 100 if you use this very
01:06:33.090 --> 01:06:34.890
simple minded analysis.
01:06:34.890 --> 01:06:37.810
Then for 100 steps, you're
sitting here, you're getting
01:06:37.810 --> 01:06:43.020
no reward, so you think we ought
to get every reward of
01:06:43.020 --> 01:06:46.310
one half on the average,
and that's exactly
01:06:46.310 --> 01:06:48.090
what you do get here.
01:06:48.090 --> 01:06:52.820
And here, you get this big
reward of 50, but then you go
01:06:52.820 --> 01:06:57.560
over here and you spend 100
units of time in purgatory and
01:06:57.560 --> 01:07:00.470
then you get back again, you get
another reward of 50 and
01:07:00.470 --> 01:07:03.830
then spend hundreds units
of time in purgatory.
01:07:03.830 --> 01:07:07.300
So again, you're getting pretty
close to a half of a
01:07:07.300 --> 01:07:10.690
unit of reward, but it turns
out, when you work it out,
01:07:10.690 --> 01:07:12.690
that here is just a smidgen.
01:07:12.690 --> 01:07:18.190
It's 1% less than a half, so
this is not as good as that.
01:07:18.190 --> 01:07:24.900
But suppose that you have
a shorter time horizon.
01:07:24.900 --> 01:07:28.380
Suppose you don't want to wait
for 1,000 steps to see what's
01:07:28.380 --> 01:07:32.280
going on, so you don't want
to look at the average.
01:07:32.280 --> 01:07:34.280
Suppose this was a
gambling game.
01:07:34.280 --> 01:07:38.230
You have your choice of these
two gambling options, and
01:07:38.230 --> 01:07:41.820
suppose you're only going to be
playing for a short time.
01:07:41.820 --> 01:07:43.180
Suppose you're going
to be only playing
01:07:43.180 --> 01:07:44.830
for one unit of time.
01:07:44.830 --> 01:07:47.180
You can only play for one unit
of time and then you have to
01:07:47.180 --> 01:07:50.780
stop, you have to go home, you
have to go back to work, or
01:07:50.780 --> 01:07:52.180
something else.
01:07:52.180 --> 01:07:54.870
And you happen to be sitting
in state two.
01:07:54.870 --> 01:07:57.180
What do you want to do
if you only have one
01:07:57.180 --> 01:07:58.730
unit of time to play.
01:07:58.730 --> 01:08:03.630
Well, obviously, you want to get
the reward of 50, because
01:08:03.630 --> 01:08:07.620
delayed gratification doesn't
work here, because you don't
01:08:07.620 --> 01:08:11.330
get any opportunity for that
gratification later.
01:08:11.330 --> 01:08:14.900
So you pick up the big
reward at first.
01:08:14.900 --> 01:08:18.630
So when you have this problem of
playing for a finite amount
01:08:18.630 --> 01:08:24.649
of time, whatever kind of
situation you're in, what you
01:08:24.649 --> 01:08:28.310
would like to do is say, for
this finite amount of time
01:08:28.310 --> 01:08:34.290
that I'm going to play, what's
my best strategy then?
01:08:34.290 --> 01:08:39.850
Dynamic programming is the
problem, which is the
01:08:39.850 --> 01:08:43.600
algorithm which finds out what
the best thing to do is
01:08:43.600 --> 01:08:45.000
dynamically.
01:08:45.000 --> 01:08:48.670
Namely, if you're going to stop
in 10 steps, stop in 100
01:08:48.670 --> 01:08:52.710
steps, stop in one step, it
tells you what to do under all
01:08:52.710 --> 01:08:55.350
of those circumstances.
01:08:55.350 --> 01:08:59.340
And the stationary policy tells
you what to do if you're
01:08:59.340 --> 01:09:02.630
going to play forever.
01:09:02.630 --> 01:09:05.760
But in a situation like this
where things happen rather
01:09:05.760 --> 01:09:10.189
slowly, it might not be the
relevant thing to deal with.
01:09:10.189 --> 01:09:13.170
A lot of the notes deal with
comparing the stationary
01:09:13.170 --> 01:09:17.180
policy with this
dynamic policy.
01:09:17.180 --> 01:09:21.399
And I'm not going to do that
here because, well, we have
01:09:21.399 --> 01:09:23.290
too many other interesting
things that we
01:09:23.290 --> 01:09:24.170
want to deal with.
01:09:24.170 --> 01:09:26.939
So we're just going to skip
all of that stuff about
01:09:26.939 --> 01:09:28.470
stationary policies.
01:09:28.470 --> 01:09:30.670
You don't have to bother to
read it unless you're
01:09:30.670 --> 01:09:32.580
interested in it.
01:09:32.580 --> 01:09:35.029
I mean, if you're interested in
it, by all means, read it.
01:09:35.029 --> 01:09:38.950
It's a very interesting topic.
01:09:38.950 --> 01:09:41.580
It's not all that interesting
to find out what the best
01:09:41.580 --> 01:09:42.990
stationary policy is.
01:09:42.990 --> 01:09:45.210
That's kind of simple.
01:09:45.210 --> 01:09:48.729
What's the interesting topic
is what's the comparison
01:09:48.729 --> 01:09:53.100
between the dynamic policy and
the stationary policy.
01:09:53.100 --> 01:09:56.500
But all we're going to do
is worry about what the
01:09:56.500 --> 01:09:58.160
dynamic policy is.
01:09:58.160 --> 01:10:03.460
That seems like a hard problem,
and someone by the
01:10:03.460 --> 01:10:09.720
name of Bellman figured out what
the optimal solution to
01:10:09.720 --> 01:10:12.025
that dynamic policy was.
01:10:12.025 --> 01:10:16.900
And it turned out to be a
trivially simple algorithm,
01:10:16.900 --> 01:10:20.030
and Bellman became
famous forever.
01:10:20.030 --> 01:10:23.080
One of the things I want to
point out to you, again, I
01:10:23.080 --> 01:10:27.250
keep coming back to this because
you people are just
01:10:27.250 --> 01:10:29.970
starting a research career.
01:10:29.970 --> 01:10:34.490
Everyone in this class, given
the formulation of this
01:10:34.490 --> 01:10:38.670
dynamic programming problem,
could develop and would
01:10:38.670 --> 01:10:43.440
develop, I'm pretty sure, the
dynamic programming algorithm.
01:10:43.440 --> 01:10:47.020
Developing the algorithm,
understanding what the problem
01:10:47.020 --> 01:10:50.210
is is a trivial matter.
01:10:50.210 --> 01:10:52.390
Why is Bellman famous?
01:10:52.390 --> 01:10:56.270
Because he formulated
the problem.
01:10:56.270 --> 01:11:01.010
He said, aha, this dynamic
problem is interesting.
01:11:01.010 --> 01:11:04.710
I don't have to go through
the stationary problem.
01:11:04.710 --> 01:11:08.430
And in fact, my sense from
reading his book and from
01:11:08.430 --> 01:11:11.470
reading things he's written is
that he couldn't have solved
01:11:11.470 --> 01:11:14.240
the stationary problem because
he didn't understand
01:11:14.240 --> 01:11:16.750
probability that well.
01:11:16.750 --> 01:11:20.600
But he did understand how to
formulate what this really
01:11:20.600 --> 01:11:24.330
important problem was
and he solved it.
01:11:24.330 --> 01:11:27.880
So, all the more credit to him,
but when you're doing
01:11:27.880 --> 01:11:32.460
research, the time you spend
on formulating the right
01:11:32.460 --> 01:11:37.430
problem is far more important
than the time you spend
01:11:37.430 --> 01:11:38.390
solving it.
01:11:38.390 --> 01:11:41.490
If you start out with the right
problem, the solution is
01:11:41.490 --> 01:11:45.650
trivial and you're all done.
01:11:45.650 --> 01:11:49.930
It's hard to formulate the right
problem, and you learn
01:11:49.930 --> 01:11:57.810
to formulate the problem not
by playing all of this
01:11:57.810 --> 01:12:01.570
calculating things, but by
setting back and thinking
01:12:01.570 --> 01:12:04.480
about the problem and trying
to look at things in a more
01:12:04.480 --> 01:12:06.050
general way.
01:12:06.050 --> 01:12:07.660
So just another plug.
01:12:07.660 --> 01:12:10.440
I've been saying this, I will
probably say it every three or
01:12:10.440 --> 01:12:14.420
four lectures throughout
the term.
01:12:14.420 --> 01:12:14.860
OK.
01:12:14.860 --> 01:12:18.450
So let's go back and look
at what the problem is.
01:12:18.450 --> 01:12:21.330
We haven't quite formulated
it yet.
01:12:21.330 --> 01:12:24.940
We're going to assume this
process of random transitions
01:12:24.940 --> 01:12:27.790
combined with decisions based
on the current state.
01:12:27.790 --> 01:12:30.380
In other words, in this decision
maker, the decision
01:12:30.380 --> 01:12:34.960
maker at each unit of time sees
what state you're in at
01:12:34.960 --> 01:12:37.040
this unit of time.
01:12:37.040 --> 01:12:40.940
And seeing what state you're in
at this given unit of time,
01:12:40.940 --> 01:12:45.020
the decision maker has a choice
between how much reward
01:12:45.020 --> 01:12:51.740
is to be taken and along with
how much reward is to be
01:12:51.740 --> 01:12:54.940
taken, what the transition
probabilities are
01:12:54.940 --> 01:12:56.160
for the next state.
01:12:56.160 --> 01:13:00.150
If you rob the cash register,
your transition probabilities
01:13:00.150 --> 01:13:02.230
are going to be very different
than if you don't
01:13:02.230 --> 01:13:04.680
rob the cash register.
01:13:04.680 --> 01:13:08.190
By robbing the cash register,
your transition probabilities
01:13:08.190 --> 01:13:10.770
go into a rather high transition
probability that
01:13:10.770 --> 01:13:12.270
you're going to be caught.
01:13:12.270 --> 01:13:16.050
OK, so you don't want that.
01:13:16.050 --> 01:13:20.600
So you can't avoid the problem
of having the rewards at a
01:13:20.600 --> 01:13:24.290
given time locked into what the
transition probabilities
01:13:24.290 --> 01:13:27.990
are for going to the next state,
and that's the essence
01:13:27.990 --> 01:13:29.890
of this problem.
01:13:29.890 --> 01:13:30.500
OK.
01:13:30.500 --> 01:13:33.470
So, the decision maker observers
the state and
01:13:33.470 --> 01:13:36.530
chooses one of a finite
set of alternatives.
01:13:36.530 --> 01:13:39.790
Each alternative consists of
recurrent reward which we'll
01:13:39.790 --> 01:13:44.030
call r sub j of k, the
alternative is k, and a set of
01:13:44.030 --> 01:13:45.980
transition probabilities.
01:13:45.980 --> 01:13:50.250
pjl of k, one less than or
equal to a l less than or
01:13:50.250 --> 01:13:52.750
equal to m for going
to the next state.
01:13:52.750 --> 01:13:56.450
OK, the notation here is
horrifying, but the idea is
01:13:56.450 --> 01:13:57.880
very simple.
01:13:57.880 --> 01:14:01.370
I mean, once you get used to the
notation, there's nothing
01:14:01.370 --> 01:14:04.880
complicated here at all.
01:14:04.880 --> 01:14:08.940
OK, so in this example here,
well, we already
01:14:08.940 --> 01:14:10.190
talked about that.
01:14:13.120 --> 01:14:14.990
We're going to start
out at time m.
01:14:17.960 --> 01:14:21.150
We're going to make a decision
at time m, pick up the
01:14:21.150 --> 01:14:28.090
associated reward for that
decision, and pick the
01:14:28.090 --> 01:14:30.970
transition probabilities that
we're going to use at that
01:14:30.970 --> 01:14:33.460
time m, and then go on
to the next state.
01:14:33.460 --> 01:14:36.380
We're going to continue doing
this until time m
01:14:36.380 --> 01:14:37.960
plus n minus 1.
01:14:37.960 --> 01:14:41.450
Mainly, we're going to do this
for n steps of time.
01:14:41.450 --> 01:14:43.690
After the n-th decision--
01:14:43.690 --> 01:14:47.140
you make the n-th decision
at m plus n minus t--
01:14:47.140 --> 01:14:52.270
there's a final transition
based on that decision.
01:14:52.270 --> 01:14:55.490
The final transition is based
on that decision, but the
01:14:55.490 --> 01:14:58.345
final reward is fixed
ahead of time.
01:14:58.345 --> 01:15:01.500
You know what the final reward
is going to be, which happens
01:15:01.500 --> 01:15:03.480
at time m plus n.
01:15:03.480 --> 01:15:07.465
So the things which are variable
is how much reward do
01:15:07.465 --> 01:15:14.070
you get at each of these first
n time units, and what
01:15:14.070 --> 01:15:17.870
probabilities you choose for
going through the next state.
01:15:17.870 --> 01:15:20.170
Is this still a Markov chain?
01:15:20.170 --> 01:15:21.420
Is this still Markov?
01:15:24.750 --> 01:15:26.560
You can talk about this
for a long time.
01:15:26.560 --> 01:15:30.460
You can think about it for
a long time because this
01:15:30.460 --> 01:15:34.770
decision maker might or
might not be Markov.
01:15:34.770 --> 01:15:38.870
What is Markov is the transition
probabilities that
01:15:38.870 --> 01:15:41.380
are taking place in
each unit of time.
01:15:41.380 --> 01:15:46.410
After I make a decision, the
transition probabilities are
01:15:46.410 --> 01:15:51.370
fixed for that decision and
that initial state and had
01:15:51.370 --> 01:15:54.650
nothing to do with the decisions
that had been made
01:15:54.650 --> 01:15:58.650
before that or the states you've
been in before that.
01:15:58.650 --> 01:16:02.220
The Markov condition says that
what happens in the next unit
01:16:02.220 --> 01:16:06.020
of time is a function simply
of those transition
01:16:06.020 --> 01:16:10.370
probabilities that
had been chosen.
01:16:10.370 --> 01:16:13.530
We will see that when we look at
the algorithm, and then you
01:16:13.530 --> 01:16:16.670
can sort out for yourselves
whether there's something
01:16:16.670 --> 01:16:18.190
dishonest here or not.
01:16:18.190 --> 01:16:25.480
Turns out there isn't, but to
Bellman's credit he did sort
01:16:25.480 --> 01:16:28.740
out correctly that this worked,
and many people for a
01:16:28.740 --> 01:16:30.520
long time did not
think it worked.
01:16:34.080 --> 01:16:37.150
So the objective of dynamic
programming is both to
01:16:37.150 --> 01:16:41.540
determine the optimal decision
at each time and to determine
01:16:41.540 --> 01:16:45.040
the expected reward for each
starting state and for each
01:16:45.040 --> 01:16:47.690
number and steps.
01:16:47.690 --> 01:16:51.090
As one might suspect, now here's
the first thing that
01:16:51.090 --> 01:16:52.500
Bellman did.
01:16:52.500 --> 01:16:54.010
He said, here, I have
this problem.
01:16:54.010 --> 01:16:57.880
I want to find out what happens
after 1,000 steps.
01:16:57.880 --> 01:17:00.850
How do I solve the problem?
01:17:00.850 --> 01:17:04.330
Well, anybody with any sense
will tell you don't solve the
01:17:04.330 --> 01:17:06.740
problem with 1,000
steps first.
01:17:06.740 --> 01:17:10.220
Solve the problem with one step
first, and then see if
01:17:10.220 --> 01:17:13.330
you find out anything from it
and then maybe you can solve
01:17:13.330 --> 01:17:17.030
the problem with two steps and
then maybe something nice will
01:17:17.030 --> 01:17:20.600
happen, or maybe it won't.
01:17:20.600 --> 01:17:25.320
When we do this, it'll turn
out that what we're really
01:17:25.320 --> 01:17:30.820
doing is we're starting at the
end and working our way back,
01:17:30.820 --> 01:17:34.010
and this algorithm is due to
Richard Bellman, as I said.
01:17:34.010 --> 01:17:38.400
And he was the one who sorted
out how it worked.
01:17:38.400 --> 01:17:40.630
So what is the algorithm?
01:17:40.630 --> 01:17:45.250
We're going to start out making
a decision at time 1.
01:17:45.250 --> 01:17:50.500
So we're going to
start at time n.
01:17:50.500 --> 01:17:53.610
We're going to start
in a given state i.
01:17:53.610 --> 01:17:58.580
You make a decision, decision
k at time m.
01:17:58.580 --> 01:18:03.040
This provides a reward at time
m, and the selected transition
01:18:03.040 --> 01:18:06.240
probabilities lead to a
final expected reward.
01:18:06.240 --> 01:18:11.380
These are these final rewards
which occur at time n plus 1.
01:18:11.380 --> 01:18:13.710
It's nice to have that n because
it's what let's us
01:18:13.710 --> 01:18:15.550
generalize the problem.
01:18:15.550 --> 01:18:18.460
So this was another clever
thing that went on here.
01:18:18.460 --> 01:18:24.710
So the expected optimal
aggregate reward for a one
01:18:24.710 --> 01:18:32.230
step problem is the sum of the
reward that you get at time m
01:18:32.230 --> 01:18:37.260
plus this final reward you get
at time n plus 1, and you're
01:18:37.260 --> 01:18:40.290
maximizing over the different
policies you have
01:18:40.290 --> 01:18:41.490
available to you.
01:18:41.490 --> 01:18:44.970
So it looks like a trivial
problem, but the optimal
01:18:44.970 --> 01:18:47.980
reward with a one step
problem is just this.
01:18:51.170 --> 01:18:54.820
OK, next you want to consider
the two step problem.
01:18:54.820 --> 01:18:58.900
What's the maximum expected
reward starting at xm equals i
01:18:58.900 --> 01:19:03.480
with decisions at times
m and n plus 1.
01:19:03.480 --> 01:19:05.400
You make two decisions.
01:19:05.400 --> 01:19:08.240
Now, before, we just made
one decision at time m.
01:19:08.240 --> 01:19:13.000
Now we make a decision at time
m and at time n plus 1, and
01:19:13.000 --> 01:19:17.750
finally we pick up a final
reward at time n plus 2.
01:19:17.750 --> 01:19:20.540
Knowing what that final reward
is going to be is going to
01:19:20.540 --> 01:19:26.230
affect the decision you make at
time n plus 1, but it's a
01:19:26.230 --> 01:19:29.770
fixed reward which is a
function of the state.
01:19:29.770 --> 01:19:32.720
You can adjust the transition
probabilities of getting to
01:19:32.720 --> 01:19:35.110
those different rewards.
01:19:35.110 --> 01:19:38.420
The key to dynamic programming
is an optimal decision at time
01:19:38.420 --> 01:19:42.630
n plus 1 can be selected based
only on the state j
01:19:42.630 --> 01:19:45.060
at time n plus 1.
01:19:45.060 --> 01:19:48.960
This decision, given that you're
in state j at time n
01:19:48.960 --> 01:19:53.600
plus 1, is optimal independent
of what you did before that,
01:19:53.600 --> 01:19:55.770
which is why we're starting
out looking at what we're
01:19:55.770 --> 01:19:59.240
going to do with time n plus 1
before we even worry about
01:19:59.240 --> 01:20:02.630
what we're going to
do with time n.
01:20:02.630 --> 01:20:06.340
So, whatever decision you made
at time n, you observe what
01:20:06.340 --> 01:20:10.900
state you're at time n plus
1 and the maximal expected
01:20:10.900 --> 01:20:15.510
reward over times n plus 1 and
n plus 2, given that you
01:20:15.510 --> 01:20:20.610
happen to be in state j is just
maximal over k as the
01:20:20.610 --> 01:20:26.430
reward you're going to get by
choosing policy k and the
01:20:26.430 --> 01:20:30.670
expected value of the final
reward you get if you're using
01:20:30.670 --> 01:20:32.480
this policy k.
01:20:32.480 --> 01:20:36.850
This is just dj* of 1 and
u as you just found.
01:20:36.850 --> 01:20:40.070
In other words, you have the
same situation at time n plus
01:20:40.070 --> 01:20:42.090
1 as you have at time n.
01:20:44.600 --> 01:20:49.785
Well, surprisingly, you've just
solved the whole problem.
01:20:52.810 --> 01:20:58.410
So we've seen that what we
should do at time n plus 1 is
01:20:58.410 --> 01:21:00.450
do this maximization.
01:21:00.450 --> 01:21:05.670
So the optimal reward, aggregate
reward over times m,
01:21:05.670 --> 01:21:11.815
n plus 1, and n plus 2 is what
we get maximizing over our
01:21:11.815 --> 01:21:18.110
choice at time m of the reward
we get at time m plus the
01:21:18.110 --> 01:21:21.750
decision plus the transition
probabilities which we've
01:21:21.750 --> 01:21:27.340
decided on which get us to this
reward at time n plus 1
01:21:27.340 --> 01:21:29.020
and n plus 2.
01:21:29.020 --> 01:21:33.070
We found out what the reward
is for times n plus 1 and n
01:21:33.070 --> 01:21:34.370
plus 2 together.
01:21:34.370 --> 01:21:38.060
That's the reward to go, And
we know what that is, so we
01:21:38.060 --> 01:21:40.210
have this same formula
we used before.
01:21:40.210 --> 01:21:46.965
Why do we want to look at
these final rewards now?
01:21:46.965 --> 01:21:50.980
Well, you can view this as a
final reward in state m.
01:21:50.980 --> 01:21:54.220
It's the final reward which
tells you what you get both
01:21:54.220 --> 01:21:57.930
from state n plus
1 and n plus 2.
01:21:57.930 --> 01:22:04.790
And, going quickly, if we look
at playing this game for three
01:22:04.790 --> 01:22:11.280
steps, the optimal reward for
the three step game is the
01:22:11.280 --> 01:22:16.600
immediate reward optimized over
k plus the rewards at n
01:22:16.600 --> 01:22:22.900
plus 1, n plus 2, and n plus 3,
which we've already found.
01:22:22.900 --> 01:22:28.450
And in general, the optimal
reward at time n--
01:22:28.450 --> 01:22:33.900
when you play the game for n
steps, the optimal reward is
01:22:33.900 --> 01:22:35.170
maximum here.
01:22:35.170 --> 01:22:39.980
So, all you do in the algorithm
is, for each value
01:22:39.980 --> 01:22:43.500
of n when you start with n
equal to 1, you solve the
01:22:43.500 --> 01:22:48.950
problem for all states and you
maximize over all policies you
01:22:48.950 --> 01:22:52.100
have a choice over, and then you
go on to the next larger
01:22:52.100 --> 01:22:56.140
value of n, you solve the
problem for all states and you
01:22:56.140 --> 01:22:56.950
keep on going.
01:22:56.950 --> 01:22:59.820
If you don't have many
states, it's easy.
01:22:59.820 --> 01:23:05.100
If you have 100,000 states, it's
kind of tedious to run
01:23:05.100 --> 01:23:05.880
the algorithm.
01:23:05.880 --> 01:23:09.380
Today it's not bad, but today
we look at problems with
01:23:09.380 --> 01:23:11.770
millions and millions of states
or billions of states,
01:23:11.770 --> 01:23:18.100
and no matter how fast
computation gets, the
01:23:18.100 --> 01:23:22.280
ingenuity people to invent
harder problems always makes
01:23:22.280 --> 01:23:24.630
it hard to solve
these problems.
01:23:24.630 --> 01:23:29.060
So anyway, that's the dynamic
programming algorithm.
01:23:29.060 --> 01:23:31.320
And next time, we're going to
start on renewal processes.