1
00:00:00,530 --> 00:00:02,960
The following content is
provided under a Creative
2
00:00:02,960 --> 00:00:04,370
Commons license.
3
00:00:04,370 --> 00:00:07,410
Your support will help MIT
OpenCourseWare continue to
4
00:00:07,410 --> 00:00:11,060
offer high quality educational
resources for free.
5
00:00:11,060 --> 00:00:13,960
To make a donation or view
additional materials from
6
00:00:13,960 --> 00:00:17,890
hundreds of MIT courses, visit
MIT OpenCourseWare at
7
00:00:17,890 --> 00:00:19,140
ocw.mit.edu.
8
00:00:24,220 --> 00:00:24,840
PROFESSOR: OK.
9
00:00:24,840 --> 00:00:30,230
Today we're going to finish
up with Markov chains.
10
00:00:30,230 --> 00:00:34,570
And the last topic will be
dynamic programming.
11
00:00:34,570 --> 00:00:39,900
I'm not going to say an awful
lot about dynamic programming.
12
00:00:39,900 --> 00:00:43,530
It's a topic that was enormously
important in
13
00:00:43,530 --> 00:00:49,600
research for probably 20
years from 1960 until
14
00:00:49,600 --> 00:00:53,540
about 1980, or 1990.
15
00:00:53,540 --> 00:01:00,300
And it seemed as if half the
Ph.D. theses done in the
16
00:01:00,300 --> 00:01:03,920
control area and the
operations research
17
00:01:03,920 --> 00:01:07,630
were in this area.
18
00:01:07,630 --> 00:01:11,950
Suddenly, everything seemed
to be done, could be done.
19
00:01:11,950 --> 00:01:15,310
And strangely enough, not
many people seem to
20
00:01:15,310 --> 00:01:16,760
know about it anymore.
21
00:01:16,760 --> 00:01:20,760
It's an enormously useful
algorithm for solving an awful
22
00:01:20,760 --> 00:01:23,000
lot of different problems.
23
00:01:23,000 --> 00:01:25,420
It's quite a simple algorithm.
24
00:01:25,420 --> 00:01:28,780
You don't need the full power
of Markov chains in order to
25
00:01:28,780 --> 00:01:30,470
understand it.
26
00:01:30,470 --> 00:01:34,250
So I do want to at least talk
about it a little bit.
27
00:01:34,250 --> 00:01:38,070
And we will use what we've done
so far with Markov chains
28
00:01:38,070 --> 00:01:40,940
in order to understand it.
29
00:01:40,940 --> 00:01:44,200
I want to start out today by
reviewing a little bit of what
30
00:01:44,200 --> 00:01:49,040
we did last time about
eigenvalues and eigenvectors.
31
00:01:49,040 --> 00:01:56,320
This was a somewhat awkward
topic to talk about, because
32
00:01:56,320 --> 00:01:59,970
you people have very different
backgrounds in linear algebra.
33
00:01:59,970 --> 00:02:03,450
Some of you have a very strong
background, some of you have
34
00:02:03,450 --> 00:02:05,240
almost no background.
35
00:02:05,240 --> 00:02:10,509
So it was a lot of material for
those of you who know very
36
00:02:10,509 --> 00:02:14,190
little about linear algebra.
37
00:02:14,190 --> 00:02:16,620
And probably somewhat boring
for those of you
38
00:02:16,620 --> 00:02:18,690
use it all the time.
39
00:02:18,690 --> 00:02:22,670
At any rate, if you don't know
anything about it, linear
40
00:02:22,670 --> 00:02:28,820
algebra is a topic that you
ought to understand for almost
41
00:02:28,820 --> 00:02:30,270
anything you do.
42
00:02:30,270 --> 00:02:35,230
If you've gotten to this point
without having to study it,
43
00:02:35,230 --> 00:02:37,460
it's very strange.
44
00:02:37,460 --> 00:02:41,720
So you should probably take
some extra time out, not
45
00:02:41,720 --> 00:02:43,900
because you need it so
much for this course.
46
00:02:43,900 --> 00:02:46,670
We won't use it enormously
in many of the
47
00:02:46,670 --> 00:02:48,500
things we do later.
48
00:02:48,500 --> 00:02:51,930
But you will use it so many
times in the future that you
49
00:02:51,930 --> 00:02:56,870
ought to just sit down, not to
learn abstract linear algebra,
50
00:02:56,870 --> 00:03:00,150
which is very useful also, but
just to learn how to use the
51
00:03:00,150 --> 00:03:03,280
topic of solving linear
equations.
52
00:03:03,280 --> 00:03:06,450
Being able to express them
in terms of matrices.
53
00:03:06,450 --> 00:03:09,310
Being able to use the
eigenvalues and eigenvectors,
54
00:03:09,310 --> 00:03:12,220
and matrices as a way of
understanding these things.
55
00:03:12,220 --> 00:03:16,440
So I want to say a little more
about that today, which is why
56
00:03:16,440 --> 00:03:19,720
I've called this a review
plus of eigenvalues and
57
00:03:19,720 --> 00:03:21,020
eigenvectors.
58
00:03:21,020 --> 00:03:25,930
It's a review of the topics
we did last time, but it's
59
00:03:25,930 --> 00:03:28,250
looking at it in a somewhat
different way.
60
00:03:28,250 --> 00:03:32,150
So let's proceed with that.
61
00:03:32,150 --> 00:03:36,810
We said that the determinant of
an M by M matrix is given
62
00:03:36,810 --> 00:03:38,530
by this strange formula.
63
00:03:38,530 --> 00:03:44,340
The determinant of a is the sum
over all permutations of
64
00:03:44,340 --> 00:03:51,260
the integers 1 to M of the
product from i equals 1 to M
65
00:03:51,260 --> 00:03:56,080
of the matrix element
a sub i mu of i.
66
00:03:56,080 --> 00:04:01,670
Mu of i is the permutation of
the number i. i is between one
67
00:04:01,670 --> 00:04:05,510
and M, and mu of i is a
permutation of that.
68
00:04:05,510 --> 00:04:17,529
Now if you look at the matrix,
which has the form, which is
69
00:04:17,529 --> 00:04:19,600
block upper diagonal.
70
00:04:19,600 --> 00:04:22,990
In other words, there's a matrix
here, a square matrix a
71
00:04:22,990 --> 00:04:26,390
sub t, which is a transient
matrix.
72
00:04:26,390 --> 00:04:31,610
There's a recurrent matrix here,
and there's some way of
73
00:04:31,610 --> 00:04:33,900
getting from the transient
states to
74
00:04:33,900 --> 00:04:36,730
the recurring states.
75
00:04:36,730 --> 00:04:41,630
And this is the general form
that a unit chain has to have.
76
00:04:41,630 --> 00:04:44,970
There are a bunch of transient
states, there are a bunch of
77
00:04:44,970 --> 00:04:47,230
recurring states.
78
00:04:47,230 --> 00:04:52,630
And the interesting thing here
is that the determinant of a
79
00:04:52,630 --> 00:04:57,620
is exactly the determinant
of a sub t times the
80
00:04:57,620 --> 00:04:59,410
determinant a sub r.
81
00:04:59,410 --> 00:05:03,210
I'm calling this a instead of
the transition matrix p
82
00:05:03,210 --> 00:05:08,840
because I want to replace a by
p minus lambda i, so I can
83
00:05:08,840 --> 00:05:11,820
talk about the eigenvalues
of p.
84
00:05:11,820 --> 00:05:15,690
So when I do that replacement
here, if I know that the
85
00:05:15,690 --> 00:05:20,140
determinant of a is this product
of determinants, then
86
00:05:20,140 --> 00:05:24,130
the determinant of p minus
lambda i is the determinant of
87
00:05:24,130 --> 00:05:32,160
pt minus lambda it, where it is
just a crazy way of saying
88
00:05:32,160 --> 00:05:35,120
a diagonal matrix.
89
00:05:35,120 --> 00:05:40,070
A diagonal t by t matrix,
because this is a t by t
90
00:05:40,070 --> 00:05:41,740
matrix, also.
91
00:05:41,740 --> 00:05:48,580
i sub r is an r by r matrix,
where this is a square r by r
92
00:05:48,580 --> 00:05:50,260
matrix also.
93
00:05:50,260 --> 00:05:53,970
Now, why is it that this
determinant is equal to this
94
00:05:53,970 --> 00:05:56,670
product of determinants here?
95
00:05:56,670 --> 00:06:02,010
Well, before explaining why this
is true, why do you care?
96
00:06:02,010 --> 00:06:08,180
Well, because we know that if
we have a recurring matrix
97
00:06:08,180 --> 00:06:11,630
here, we know that it has--
98
00:06:11,630 --> 00:06:13,790
I mean, we know a great
deal about it.
99
00:06:13,790 --> 00:06:21,150
We know that any square matrix,
r by r matrix has r
100
00:06:21,150 --> 00:06:22,750
different eigenvalues.
101
00:06:22,750 --> 00:06:26,330
Some of them might be repeated,
but they're always r
102
00:06:26,330 --> 00:06:27,480
eigenvalues.
103
00:06:27,480 --> 00:06:31,420
This matrix here has
t eigenvalues.
104
00:06:31,420 --> 00:06:32,520
OK.
105
00:06:32,520 --> 00:06:37,730
This matrix here, we know has
r plus t eigenvalues.
106
00:06:37,730 --> 00:06:42,060
You look at this formula here
and you say aha, I can take
107
00:06:42,060 --> 00:06:46,670
all the eigenvalues here, add
them to all the eigenvalues
108
00:06:46,670 --> 00:06:50,280
here, and I have every one
of the eigenvalues here.
109
00:06:50,280 --> 00:06:54,780
In other words, if I want to
find all of the eigenvalues of
110
00:06:54,780 --> 00:06:59,620
p, all I have to do is define
the eigenvalues of p sub t,
111
00:06:59,620 --> 00:07:04,710
add them to the eigenvalues of
p sub r, and I'm all done.
112
00:07:04,710 --> 00:07:08,640
So that really has simplified
things a good deal.
113
00:07:08,640 --> 00:07:14,270
And it also really says
explicitly that if you
114
00:07:14,270 --> 00:07:20,060
understand how to deal with
recurrent Markov chains, you
115
00:07:20,060 --> 00:07:22,620
really know everything.
116
00:07:22,620 --> 00:07:25,840
Well, you also have to know how
to deal with a transient
117
00:07:25,840 --> 00:07:29,880
chain, but the main part of it
is dealing with this chain.
118
00:07:29,880 --> 00:07:34,870
This has little r different
eigenvalues, and all of those
119
00:07:34,870 --> 00:07:41,860
are eigenvalues, excuse me,
p sub r has little r
120
00:07:41,860 --> 00:07:42,710
eigenvalues.
121
00:07:42,710 --> 00:07:46,860
They're given by the roots
of this determinant here.
122
00:07:46,860 --> 00:07:49,530
And all of those
are roots here.
123
00:07:49,530 --> 00:07:51,580
OK, so why is this true?
124
00:07:51,580 --> 00:07:57,990
Well, the reason for it is that
this product up here,
125
00:07:57,990 --> 00:07:59,200
look at this.
126
00:07:59,200 --> 00:08:02,490
We're taking the sum over
all permutations.
127
00:08:02,490 --> 00:08:05,315
But which one of those
permutations can be non-zero?
128
00:08:12,940 --> 00:08:18,740
If I start out by saying that a
sub t is t by t, then I know
129
00:08:18,740 --> 00:08:21,440
that this might be anything.
130
00:08:21,440 --> 00:08:24,050
These have to be zeroes here.
131
00:08:24,050 --> 00:08:30,450
If I choose some permutation of
down here, of sum i, which
132
00:08:30,450 --> 00:08:31,530
is greater than t.
133
00:08:31,530 --> 00:08:35,030
In other words, if I choose
mu o i to be some
134
00:08:35,030 --> 00:08:36,130
element over here.
135
00:08:36,130 --> 00:08:42,309
If I choose mu of i to be less
than our equal to t, and i to
136
00:08:42,309 --> 00:08:45,500
be greater than t,
what happens?
137
00:08:45,500 --> 00:08:47,790
I get a term which
is equal to zero.
138
00:08:47,790 --> 00:08:51,210
That term in this
product is zero.
139
00:08:51,210 --> 00:08:55,670
And none of those products
can be zero.
140
00:08:55,670 --> 00:09:00,830
So the only way I can get non
zeros here is when I'm dealing
141
00:09:00,830 --> 00:09:03,730
with an i which is less
than or equal to t.
142
00:09:03,730 --> 00:09:06,100
Namely an i here.
143
00:09:06,100 --> 00:09:09,440
I have to choose a mu of
i, a column which is
144
00:09:09,440 --> 00:09:10,870
less than t, also.
145
00:09:10,870 --> 00:09:17,540
If I'm dealing with an i which
is greater than t, namely and
146
00:09:17,540 --> 00:09:23,410
i up here, then, well, it
looks like I can choose
147
00:09:23,410 --> 00:09:24,950
anything there.
148
00:09:24,950 --> 00:09:25,630
But look.
149
00:09:25,630 --> 00:09:31,180
I've already used up all of
these columns here by the five
150
00:09:31,180 --> 00:09:33,470
by the non-zero terms here.
151
00:09:33,470 --> 00:09:37,360
So I can't do anything
but use a smaller i,
152
00:09:37,360 --> 00:09:40,080
smaller than t up here.
153
00:09:40,080 --> 00:09:44,703
So when I look at the
permutations that are non
154
00:09:44,703 --> 00:09:49,010
zero, the only permutations that
are non zero are those
155
00:09:49,010 --> 00:09:55,610
where mu of i is less than t if
i less than t, and mu of i
156
00:09:55,610 --> 00:10:01,960
is less than or equal to t if i
is less than or equal to t.
157
00:10:01,960 --> 00:10:06,100
And mu of i is greater than
t if i is greater than t.
158
00:10:06,100 --> 00:10:11,580
Now, how does that show that
this is equal here?
159
00:10:11,580 --> 00:10:16,480
Well, let's look at
that a little bit.
160
00:10:16,480 --> 00:10:19,740
I didn't even try to do it on
the slide because the notation
161
00:10:19,740 --> 00:10:20,970
is kind of horrifying.
162
00:10:20,970 --> 00:10:24,850
But let's try to write this
the following way.
163
00:10:24,850 --> 00:10:36,910
Determinant of a is equal to the
sum, and now I'll write it
164
00:10:36,910 --> 00:10:48,040
as a sum over mu of 1 up to t.
165
00:10:48,040 --> 00:10:59,690
And the sum over mu of t
plus 1 up to, well, t
166
00:10:59,690 --> 00:11:02,460
plus r, let's say.
167
00:11:02,460 --> 00:11:06,210
OK, so here I have all of
the permutations of the
168
00:11:06,210 --> 00:11:08,870
numbers 1 to t.
169
00:11:08,870 --> 00:11:11,350
And here I have all the
permutations of the
170
00:11:11,350 --> 00:11:14,010
numbers t plus 1 up.
171
00:11:14,010 --> 00:11:16,760
And for all of those,
I'm going to
172
00:11:16,760 --> 00:11:18,190
ignore this plus minus.
173
00:11:18,190 --> 00:11:21,420
You can sort that out
for yourselves.
174
00:11:21,420 --> 00:11:27,620
And then I have a product
from i equals 1 to t.
175
00:11:27,620 --> 00:11:37,950
And then a product from i
equals t plus 1 up to m.
176
00:11:37,950 --> 00:11:39,200
Excuse me.
177
00:11:42,410 --> 00:11:53,000
i sub i, mu of i times
product of a of i.
178
00:11:53,000 --> 00:12:07,130
Mu of i for i equals t plus
1 up to t plus r.
179
00:12:07,130 --> 00:12:09,300
OK?
180
00:12:09,300 --> 00:12:14,740
So I'm separating this product
here into a product first of
181
00:12:14,740 --> 00:12:19,070
the terms i less than or equal
to t, and then for the terms i
182
00:12:19,070 --> 00:12:20,180
greater than t.
183
00:12:20,180 --> 00:12:24,620
For every permutation I choose
using the i's that are less
184
00:12:24,620 --> 00:12:29,090
than or equal to t, I can choose
any of the permutation
185
00:12:29,090 --> 00:12:33,520
using mu of i greater than
t that I choose to use.
186
00:12:33,520 --> 00:12:35,570
So this breaks up in this way.
187
00:12:35,570 --> 00:12:37,960
I have this sum, I
have this sum.
188
00:12:37,960 --> 00:12:43,120
I have these two products, so
I can break this up as a sum
189
00:12:43,120 --> 00:12:55,270
over mu of 1 to t of plus minus
product from i equals 1
190
00:12:55,270 --> 00:13:08,752
to t of ai, mu of i times the
sum over mu of t plus 1 up to
191
00:13:08,752 --> 00:13:15,072
t plus r ai mu of i.
192
00:13:20,160 --> 00:13:22,380
Product.
193
00:13:22,380 --> 00:13:23,300
OK.
194
00:13:23,300 --> 00:13:26,030
So I've separated that into
two different terms.
195
00:13:26,030 --> 00:13:27,000
STUDENT: T equals [INAUDIBLE].
196
00:13:27,000 --> 00:13:27,570
PROFESSOR: What?
197
00:13:27,570 --> 00:13:30,680
STUDENT: T plus r
equals big m?
198
00:13:30,680 --> 00:13:33,230
PROFESSOR: T plus
r is big m, yes.
199
00:13:33,230 --> 00:13:40,060
Because I have t terms here,
and I have r terms here.
200
00:13:40,060 --> 00:13:44,710
OK, so the interesting thing
here is having this non-zero
201
00:13:44,710 --> 00:13:48,400
term here doesn't make
any difference here.
202
00:13:48,400 --> 00:13:52,430
I mean, this is more
straightforward if you have a
203
00:13:52,430 --> 00:13:54,020
block diagonal matrix.
204
00:13:54,020 --> 00:13:58,330
It's clear that the eigenvalues
of a block
205
00:13:58,330 --> 00:14:03,700
diagonal matrix are going to be
the eigenvalues of 1 plus
206
00:14:03,700 --> 00:14:05,560
the eigenvalues of the other.
207
00:14:05,560 --> 00:14:09,980
Here we have the eigenvalues
of this, and the
208
00:14:09,980 --> 00:14:11,450
eigenvalues is this.
209
00:14:11,450 --> 00:14:14,910
And what's surprising is that as
far as the eigenvalues are
210
00:14:14,910 --> 00:14:19,950
concerned, this has nothing
whatsoever to do with it.
211
00:14:19,950 --> 00:14:20,690
OK.
212
00:14:20,690 --> 00:14:24,480
The only thing that this has
to do with it is it says
213
00:14:24,480 --> 00:14:28,780
something about the sums of this
matrix here, because the
214
00:14:28,780 --> 00:14:31,500
sums of these rows are
now less than 1.
215
00:14:31,500 --> 00:14:34,660
They all have to be, some of
them, at least, have to be
216
00:14:34,660 --> 00:14:36,760
less than or equal to 1.
217
00:14:36,760 --> 00:14:40,090
Because you do have this way of
getting from the transient
218
00:14:40,090 --> 00:14:43,470
elements to the non transient
elements.
219
00:14:43,470 --> 00:14:48,060
But it's very surprising that
these elements, which are
220
00:14:48,060 --> 00:14:52,100
critically important, because
those are the things that get
221
00:14:52,100 --> 00:14:55,800
you from the transition states
to the recurrent states have
222
00:14:55,800 --> 00:14:59,540
nothing to do in the eigenvalues
whatsoever.
223
00:14:59,540 --> 00:15:00,105
I don't know why.
224
00:15:00,105 --> 00:15:04,310
I can't give you any insights
about that, but
225
00:15:04,310 --> 00:15:06,810
that's the way it is.
226
00:15:06,810 --> 00:15:12,030
That's an interesting thing,
because if you take this
227
00:15:12,030 --> 00:15:19,930
transition matrix, and you keep
at and a sub r fixed, and
228
00:15:19,930 --> 00:15:23,250
you play any kind of funny game
you want to with those
229
00:15:23,250 --> 00:15:28,780
terms going from the transient
states to the non transient
230
00:15:28,780 --> 00:15:33,370
states, it won't change
any eigenvalues.
231
00:15:33,370 --> 00:15:35,490
Don't know why it doesn't.
232
00:15:35,490 --> 00:15:39,400
OK, so where do we
go with that?
233
00:15:39,400 --> 00:15:45,440
Well, that's what it says.
234
00:15:45,440 --> 00:15:50,580
The eigenvalues of p, or the t
eigenvalues of pt, and the r
235
00:15:50,580 --> 00:15:52,200
eigenvalues of PR.
236
00:15:52,200 --> 00:15:56,180
It also tells you something
about simple eigenvalues, and
237
00:15:56,180 --> 00:15:59,800
these crazy eigenvalues, which
don't have enough eigenvectors
238
00:15:59,800 --> 00:16:01,230
to go along with them.
239
00:16:01,230 --> 00:16:06,420
Because it tells you that a
piece of r has all of its
240
00:16:06,420 --> 00:16:11,880
eigenvectors, and a piece of t
has all of its eigenvectors.
241
00:16:11,880 --> 00:16:14,550
Then you don't have any of this
crazy [INAUDIBLE] form
242
00:16:14,550 --> 00:16:16,520
thing, or anything.
243
00:16:16,520 --> 00:16:29,670
OK If pi is a left eigenvector
of this recurrent matrix, then
244
00:16:29,670 --> 00:16:35,550
if you look at the vector,
starting was zeros, and then I
245
00:16:35,550 --> 00:16:42,390
guess I should really say, well,
if pi sub 1 up to pi sub
246
00:16:42,390 --> 00:16:47,910
r as a left eigenvalue of this r
by r matrix, then if I start
247
00:16:47,910 --> 00:16:52,620
out with t zeroes, and then
put in pi 1 to pi r, this
248
00:16:52,620 --> 00:16:57,310
vector here has to be a left
eigenvector of all of p.
249
00:16:57,310 --> 00:16:58,310
Why is that?
250
00:16:58,310 --> 00:17:01,610
Well, if I look at a vector,
which starts out with zeroes,
251
00:17:01,610 --> 00:17:06,900
and then has this eigenvector
pi, and I multiply that vector
252
00:17:06,900 --> 00:17:10,210
by this matrix here, I'm
taking these terms,
253
00:17:10,210 --> 00:17:16,260
multiplying them by the columns
of this matrix, these
254
00:17:16,260 --> 00:17:22,310
zeros knock out all of
these elements here.
255
00:17:22,310 --> 00:17:25,470
These zeroes knock out all
of these elements.
256
00:17:25,470 --> 00:17:28,410
So I start out with zeroes
everywhere here.
257
00:17:28,410 --> 00:17:30,480
That's what this says.
258
00:17:30,480 --> 00:17:34,660
And then when I'm dealing with
this part of the matrix, the
259
00:17:34,660 --> 00:17:39,750
zeros knock out all of this, and
I just have pi multiplying
260
00:17:39,750 --> 00:17:40,820
piece of r.
261
00:17:40,820 --> 00:17:45,220
So if I have an eigenvalue
lambda, it says I have the
262
00:17:45,220 --> 00:17:50,170
eigenvalue lambda times a
vector zero times pi.
263
00:17:50,170 --> 00:17:54,760
It says that if I have an
eigenvector, a left
264
00:17:54,760 --> 00:18:01,010
eigenvector of this recurrent
matrix, then that turns into,
265
00:18:01,010 --> 00:18:05,670
if you put some zeroes up in
front of, it turns into an
266
00:18:05,670 --> 00:18:07,790
eigenvector of the
whole matrix.
267
00:18:07,790 --> 00:18:11,580
If we look at the eigenvalue 1,
which is the most important
268
00:18:11,580 --> 00:18:14,350
thing this, is the thing that
gives you the steady state
269
00:18:14,350 --> 00:18:16,930
factor, this is sort
of obvious.
270
00:18:16,930 --> 00:18:19,630
Because the steady state
vector is where you go
271
00:18:19,630 --> 00:18:23,960
eventually, and eventually where
you go is you have to be
272
00:18:23,960 --> 00:18:27,290
in one of these recurrent
states, eventually.
273
00:18:27,290 --> 00:18:30,610
And the probabilities within
the recurrent set of states
274
00:18:30,610 --> 00:18:33,400
are the same as the
probabilities if you didn't
275
00:18:33,400 --> 00:18:36,590
have this transient
states at all.
276
00:18:36,590 --> 00:18:40,490
so this is all sort of obvious,
as far as the steady
277
00:18:40,490 --> 00:18:43,020
state factor pi.
278
00:18:43,020 --> 00:18:47,480
But it's a little less obvious
as far as the other vectors.
279
00:18:47,480 --> 00:18:52,300
The left eigenvectors,
a piece of t, I don't
280
00:18:52,300 --> 00:18:53,610
understand them at all.
281
00:18:53,610 --> 00:18:59,660
They aren't the same as the left
eigenvectors of, well,
282
00:18:59,660 --> 00:19:04,670
the left eigenvectors of the
eigenvalues of p sub t.
283
00:19:08,040 --> 00:19:10,270
I didn't say this right here.
284
00:19:10,270 --> 00:19:15,870
The left eigenvectors of p
corresponding to the left
285
00:19:15,870 --> 00:19:18,700
eigenvectors of p sub t.
286
00:19:18,700 --> 00:19:22,010
I don't understand how they
work, and I don't understand
287
00:19:22,010 --> 00:19:24,350
anything you can derive
from them.
288
00:19:24,350 --> 00:19:26,740
They're just kind of crazy
things, which are what they
289
00:19:26,740 --> 00:19:27,780
happen to be.
290
00:19:27,780 --> 00:19:29,350
And I don't care about them.
291
00:19:29,350 --> 00:19:32,200
I don't know anything
to do with them.
292
00:19:32,200 --> 00:19:35,200
But these other eigenvectors
are very useful.
293
00:19:35,200 --> 00:19:38,130
OK.
294
00:19:38,130 --> 00:19:45,040
We can extend this to as many
different recurrent sets of
295
00:19:45,040 --> 00:19:47,080
states as you choose.
296
00:19:47,080 --> 00:19:53,100
Here I'm doing it with a Markov
chain, which has two
297
00:19:53,100 --> 00:19:56,550
different sets of recurrent
states.
298
00:19:56,550 --> 00:20:00,010
They might be periodic, they
might be ergodic, it doesn't
299
00:20:00,010 --> 00:20:01,340
make any difference.
300
00:20:01,340 --> 00:20:07,730
So the matrix p has these
transient states up here.
301
00:20:07,730 --> 00:20:11,990
Here we have those transition
states would just go to each
302
00:20:11,990 --> 00:20:16,320
other, where the transition
probabilities starting with
303
00:20:16,320 --> 00:20:19,140
the transient state and going
to a transition state.
304
00:20:19,140 --> 00:20:24,090
Here we have the transitions,
which go from transient states
305
00:20:24,090 --> 00:20:26,500
to this first set of
recurrent states.
306
00:20:26,500 --> 00:20:30,810
Here we have the transitions,
which go from a transient
307
00:20:30,810 --> 00:20:35,480
state to the second state
of recurrent states.
308
00:20:35,480 --> 00:20:36,180
OK.
309
00:20:36,180 --> 00:20:39,330
The same way as before, the
determinant of this whole
310
00:20:39,330 --> 00:20:44,790
thing here, and this
determinant, the roots of that
311
00:20:44,790 --> 00:20:49,300
are in fact the eigenvalues of
p, are the product of the
312
00:20:49,300 --> 00:20:54,930
determinant of pt minus lambda
it times the product of this,
313
00:20:54,930 --> 00:20:58,030
times this determinant here.
314
00:20:58,030 --> 00:21:02,180
This has little t eigenvalues.
315
00:21:02,180 --> 00:21:05,220
This has little r eigenvalues.
316
00:21:05,220 --> 00:21:08,690
This has little r prime
eigenvalues, and if you add up
317
00:21:08,690 --> 00:21:11,880
t plus little r plus little
r prime, what do you get?
318
00:21:11,880 --> 00:21:17,790
You get jM, excuse me, capital
M, which is the total number
319
00:21:17,790 --> 00:21:21,470
of states in the Markov chain.
320
00:21:21,470 --> 00:21:27,110
So the eigenvalues here are
exactly the eigenvalues here
321
00:21:27,110 --> 00:21:33,300
plus the eigenvalues here, plus
the eigenvalues here.
322
00:21:33,300 --> 00:21:36,720
And you can find the
eigenvectors, the left
323
00:21:36,720 --> 00:21:40,810
eigenvectors for these
states in exactly
324
00:21:40,810 --> 00:21:43,450
the same way as before.
325
00:21:43,450 --> 00:21:44,570
OK.
326
00:21:44,570 --> 00:21:45,772
Yeah?
327
00:21:45,772 --> 00:21:48,628
STUDENT: So again, the
eigenvalues can be repeated
328
00:21:48,628 --> 00:21:51,960
both within t, r, r prime,
and in between the--
329
00:21:51,960 --> 00:21:52,436
PROFESSOR: Yes.
330
00:21:52,436 --> 00:21:54,340
STUDENT: There's nothing
that says [INAUDIBLE].
331
00:21:54,340 --> 00:21:54,610
PROFESSOR: No.
332
00:21:54,610 --> 00:21:58,440
There's nothing that says they
can't, except you can always
333
00:21:58,440 --> 00:22:05,980
find the left eigenvectors,
anyway, of this are, in fact,
334
00:22:05,980 --> 00:22:08,680
these things in the form.
335
00:22:08,680 --> 00:22:15,840
If pi is a left eigenvector of p
sub r, then zero followed by
336
00:22:15,840 --> 00:22:17,460
pi followed by zero.
337
00:22:17,460 --> 00:22:26,480
In other words, little t zeros
followed by r, followed by the
338
00:22:26,480 --> 00:22:32,060
eigenvector pi, followed by
little r prime zeroes here,
339
00:22:32,060 --> 00:22:34,490
this has to be a left
eigenvector of t.
340
00:22:34,490 --> 00:22:37,280
So this tells you something
about whether you're going to
341
00:22:37,280 --> 00:22:40,140
have a Jordan form or not,
one of these really
342
00:22:40,140 --> 00:22:41,240
ugly things in it.
343
00:22:41,240 --> 00:22:44,590
And it tells you that
in many cases, you
344
00:22:44,590 --> 00:22:46,370
just can't have them.
345
00:22:46,370 --> 00:22:48,850
If you have them, they're
usually tied up with this
346
00:22:48,850 --> 00:22:50,730
matrix here.
347
00:22:50,730 --> 00:22:53,140
OK, so that, I don't know.
348
00:22:53,140 --> 00:22:53,950
Was this useful?
349
00:22:53,950 --> 00:22:55,550
Does this clarify anything?
350
00:22:55,550 --> 00:22:58,830
Or if it doesn't,
it's too bad.
351
00:23:01,810 --> 00:23:02,330
OK.
352
00:23:02,330 --> 00:23:05,080
So now we want to start
talking about rewards.
353
00:23:07,580 --> 00:23:09,150
Some people call these costs.
354
00:23:09,150 --> 00:23:11,230
If you're an optimist,
you call it rewards.
355
00:23:11,230 --> 00:23:13,870
If you're a pessimist,
you call it costs.
356
00:23:13,870 --> 00:23:15,520
They're both the same thing.
357
00:23:15,520 --> 00:23:18,180
If you're dealing with rewards,
you maximize them.
358
00:23:18,180 --> 00:23:20,470
If you're dealing with costs,
you minimize them.
359
00:23:20,470 --> 00:23:24,800
So mathematically, who cares?
360
00:23:24,800 --> 00:23:30,590
OK, so suppose that each state
i of a Markov chain is
361
00:23:30,590 --> 00:23:33,280
associated with a given
reward, or a sub i.
362
00:23:33,280 --> 00:23:36,350
In other words, you think of
this Markov chain, which is
363
00:23:36,350 --> 00:23:37,180
running along.
364
00:23:37,180 --> 00:23:41,320
You go from one state to
another over time.
365
00:23:41,320 --> 00:23:45,930
And while this is happening,
you're pocketing some reward
366
00:23:45,930 --> 00:23:47,250
all the time.
367
00:23:47,250 --> 00:23:47,650
OK.
368
00:23:47,650 --> 00:23:50,890
You invest in a stock.
369
00:23:50,890 --> 00:23:53,470
Strangely enough, these
particular stocks we're
370
00:23:53,470 --> 00:23:57,270
thinking about here I this
Markov property.
371
00:23:57,270 --> 00:23:59,970
Stocks really don't have a
Markov property, but we'll
372
00:23:59,970 --> 00:24:02,130
assume they do.
373
00:24:02,130 --> 00:24:06,200
And since they have this Markov
property, you win for a
374
00:24:06,200 --> 00:24:07,840
while, and you lose
for a while.
375
00:24:07,840 --> 00:24:10,060
You win for a while, you
lose for a while.
376
00:24:10,060 --> 00:24:12,770
But we have something
extra, other than
377
00:24:12,770 --> 00:24:15,050
just the Markov chains.
378
00:24:15,050 --> 00:24:18,830
We can analyze this whole
situation, knowing how Markov
379
00:24:18,830 --> 00:24:20,670
chains behave.
380
00:24:20,670 --> 00:24:24,980
There's not much left besides
that, but there are an
381
00:24:24,980 --> 00:24:29,860
extraordinary number of
applications of this idea, and
382
00:24:29,860 --> 00:24:31,900
dynamic programming
is one of them.
383
00:24:31,900 --> 00:24:35,380
Because that's just one
added extension beyond
384
00:24:35,380 --> 00:24:37,880
this idea of rewards.
385
00:24:37,880 --> 00:24:38,380
OK.
386
00:24:38,380 --> 00:24:40,770
The random variable x of n.
387
00:24:40,770 --> 00:24:43,240
That's a random quantity.
388
00:24:43,240 --> 00:24:45,840
It's the state at time n.
389
00:24:45,840 --> 00:24:50,010
And the random reward of time n
is then the random variable
390
00:24:50,010 --> 00:24:55,680
r of xn that maps xn equals
i into ri for each i.
391
00:24:55,680 --> 00:24:59,140
This is the same idea of taking
one random variable,
392
00:24:59,140 --> 00:25:02,030
which is a function of another
random variable.
393
00:25:02,030 --> 00:25:06,000
The one random variable takes
on the values one up to
394
00:25:06,000 --> 00:25:07,740
capital M.
395
00:25:07,740 --> 00:25:11,080
And then the other random
variable takes on a value
396
00:25:11,080 --> 00:25:14,680
which is determined by the state
that you happen to be
397
00:25:14,680 --> 00:25:16,600
in, which is this
random states.
398
00:25:16,600 --> 00:25:21,700
So specifying our sub i
specifies what the set of
399
00:25:21,700 --> 00:25:25,380
rewards are, what the reward
is in each given state.
400
00:25:25,380 --> 00:25:28,520
Again, we have this awful
problem, which I wish we could
401
00:25:28,520 --> 00:25:32,760
avoid in Markov chains, of using
the same word state to
402
00:25:32,760 --> 00:25:35,900
talk about the set of
different states.
403
00:25:35,900 --> 00:25:38,120
And also to talk about
the random state
404
00:25:38,120 --> 00:25:39,170
at any given time.
405
00:25:39,170 --> 00:25:43,560
But hopefully by now you're
used to that.
406
00:25:43,560 --> 00:25:47,700
In our discussion here, the only
thing we're going to talk
407
00:25:47,700 --> 00:25:50,670
about are expected rewards.
408
00:25:50,670 --> 00:25:55,810
Now, you know that expected
rewards, or expectations are a
409
00:25:55,810 --> 00:25:58,310
little more generally than you
would think they would be,
410
00:25:58,310 --> 00:26:02,060
because you're going to take the
expected value of any sort
411
00:26:02,060 --> 00:26:04,300
of crazy thing.
412
00:26:04,300 --> 00:26:07,870
If you want to talk about any
event, you can take the
413
00:26:07,870 --> 00:26:11,310
indicator function of that
event, and find the expected
414
00:26:11,310 --> 00:26:13,890
value of that indicator
function.
415
00:26:13,890 --> 00:26:16,920
And that's just the probability
of that event.
416
00:26:16,920 --> 00:26:22,660
So by understanding how to deal
with expectations, you
417
00:26:22,660 --> 00:26:25,560
really have the capability
of finding distribution
418
00:26:25,560 --> 00:26:28,480
functions, or anything else
you want to find.
419
00:26:28,480 --> 00:26:28,970
OK.
420
00:26:28,970 --> 00:26:31,490
But anyway, since we're
interested only in expected
421
00:26:31,490 --> 00:26:37,555
rewards, the expected reward at
time n, given that x zero
422
00:26:37,555 --> 00:26:44,950
is i is the expected value of r
of xn given x zero equals i,
423
00:26:44,950 --> 00:26:49,840
which is the sum over j of the
reward you get if you're in
424
00:26:49,840 --> 00:26:55,700
state j at time n times p sub
ij, super n, which we've
425
00:26:55,700 --> 00:27:00,850
talked about ad nauseum for the
last four lectures now.
426
00:27:00,850 --> 00:27:06,900
And this is the probability that
the state at time n is j,
427
00:27:06,900 --> 00:27:09,910
given that the state
at time zero is i.
428
00:27:09,910 --> 00:27:13,650
So you can just automatically
find the expected
429
00:27:13,650 --> 00:27:17,570
value of r of xn.
430
00:27:17,570 --> 00:27:20,610
And it's by that formula.
431
00:27:20,610 --> 00:27:24,230
Now, recall that this quantity
here is not all that simple.
432
00:27:24,230 --> 00:27:28,680
This is the ij element of the
product of the matrix, of the
433
00:27:28,680 --> 00:27:31,010
nth product of the matrix p.
434
00:27:31,010 --> 00:27:32,370
But, so what?
435
00:27:32,370 --> 00:27:36,130
We can at least write a nice
formula for it now.
436
00:27:36,130 --> 00:27:40,140
The expected aggregate reward
over the n steps from m to m
437
00:27:40,140 --> 00:27:43,080
plus n minus 1.
438
00:27:43,080 --> 00:27:44,900
What is m doing in here?
439
00:27:44,900 --> 00:27:48,970
It's just reminding us that
Markov chains are
440
00:27:48,970 --> 00:27:51,890
homogeneous over time.
441
00:27:51,890 --> 00:27:56,370
So, when I talk about the
aggregate reward from time m
442
00:27:56,370 --> 00:28:01,200
the m plus n minus 1, it's the
same as the aggregate reward
443
00:28:01,200 --> 00:28:04,500
from time 0 up to
time n minus 1.
444
00:28:04,500 --> 00:28:06,270
The expected values
are the same.
445
00:28:06,270 --> 00:28:09,550
The actual sample functions
are different.
446
00:28:09,550 --> 00:28:14,290
OK, so if I try to calculate
this aggregate reward
447
00:28:14,290 --> 00:28:18,880
conditional on xm equals i,
mainly conditional on starting
448
00:28:18,880 --> 00:28:23,660
in state i, then this expected
aggregate reward, I use that
449
00:28:23,660 --> 00:28:28,610
as a symbol for it, is the
expected value of r of xm,
450
00:28:28,610 --> 00:28:30,310
given xm equals i.
451
00:28:30,310 --> 00:28:30,890
What is that?
452
00:28:30,890 --> 00:28:33,030
Well, that's ri.
453
00:28:33,030 --> 00:28:35,220
I mean, given that xm
is equal to i, this
454
00:28:35,220 --> 00:28:36,490
isn't random anymore.
455
00:28:36,490 --> 00:28:38,500
It's just the source sub i.
456
00:28:38,500 --> 00:28:45,350
Plus the expected value of r of
xm plus 1, which is the sum
457
00:28:45,350 --> 00:28:49,490
over j, of pij times r sub j.
458
00:28:49,490 --> 00:28:54,305
That's the time m plus 1 given
that you're in state i at time
459
00:28:54,305 --> 00:29:00,370
m, and so forth, up until time
n minus 1, where the expected
460
00:29:00,370 --> 00:29:03,240
reward, then, is
a piece of ij.
461
00:29:06,180 --> 00:29:10,860
Probability of being in state j
at time n minus 1 given that
462
00:29:10,860 --> 00:29:16,190
you started off in state i
at time 0 times r sub j.
463
00:29:16,190 --> 00:29:20,790
And since expectations add, we
have this nice, convenient
464
00:29:20,790 --> 00:29:22,040
formula here.
465
00:29:26,180 --> 00:29:30,580
We're doing something I normally
hate doing, which is
466
00:29:30,580 --> 00:29:35,290
building up a lot of notation,
and then using that notation
467
00:29:35,290 --> 00:29:40,470
to write extremely complicated
formulas in a way that looks
468
00:29:40,470 --> 00:29:41,200
very simple.
469
00:29:41,200 --> 00:29:44,480
And therefore you will get some
sense of what we're doing
470
00:29:44,480 --> 00:29:45,840
is very simple.
471
00:29:45,840 --> 00:29:48,160
These quantities in
here, again, are
472
00:29:48,160 --> 00:29:49,790
not all that simple.
473
00:29:49,790 --> 00:29:52,550
But at least we can write
it in a simple way.
474
00:29:52,550 --> 00:29:56,260
And since we can write it in a
simple way, it turns out we
475
00:29:56,260 --> 00:29:59,160
can do some nice
things with it.
476
00:29:59,160 --> 00:29:59,420
OK.
477
00:29:59,420 --> 00:30:00,970
So where do we go from
all of this?
478
00:30:04,860 --> 00:30:12,280
We have just said that the
expected reward we get,
479
00:30:12,280 --> 00:30:18,550
expected aggregate reward over n
steps, namely from m up to m
480
00:30:18,550 --> 00:30:20,210
plus n minus 1.
481
00:30:20,210 --> 00:30:25,660
We're assuming that if we start
at time m, we pick up a
482
00:30:25,660 --> 00:30:27,660
reward at time n.
483
00:30:27,660 --> 00:30:30,530
I mean, that's just an
arbitrary decision.
484
00:30:30,530 --> 00:30:33,960
We might as well do that,
because otherwise we just have
485
00:30:33,960 --> 00:30:36,840
one more transition matrix
sitting here.
486
00:30:36,840 --> 00:30:38,660
OK, so we start at time m.
487
00:30:38,660 --> 00:30:42,640
We pick up a reward, which
is conditional on the
488
00:30:42,640 --> 00:30:45,030
state we start in.
489
00:30:45,030 --> 00:30:53,040
And then we look at the expected
reward for time m and
490
00:30:53,040 --> 00:30:58,420
time m plus 1, m plus 2,
up to m plus n minus 1.
491
00:30:58,420 --> 00:31:00,610
Since we started at
m, we're picking
492
00:31:00,610 --> 00:31:02,620
up n different rewards.
493
00:31:02,620 --> 00:31:07,490
We have to stop at time
m plus n minus 1.
494
00:31:07,490 --> 00:31:14,040
OK, so that's this expected
aggregate reward.
495
00:31:14,040 --> 00:31:17,890
Why do I care about expected
aggregate reward?
496
00:31:17,890 --> 00:31:22,220
Because the rewards at any time
n are sort of trivial.
497
00:31:22,220 --> 00:31:24,640
What we're are interested
in is how does this
498
00:31:24,640 --> 00:31:27,320
build up over time?
499
00:31:27,320 --> 00:31:29,150
You start to invest
in a stock.
500
00:31:29,150 --> 00:31:34,480
You don't much care what
it's worth at time 10.
501
00:31:34,480 --> 00:31:35,785
You care how it grows.
502
00:31:38,390 --> 00:31:41,040
You care about its value when
you want to sell it, and you
503
00:31:41,040 --> 00:31:44,880
don't know when you're going to
sell it, most of the time.
504
00:31:44,880 --> 00:31:48,150
So you're really interested
in these aggregate
505
00:31:48,150 --> 00:31:49,400
rewards that you.
506
00:31:52,260 --> 00:31:54,590
You'll see when we get to
dynamic programming what
507
00:31:54,590 --> 00:31:56,780
you're interested
in that, also.
508
00:31:56,780 --> 00:31:57,430
OK.
509
00:31:57,430 --> 00:32:01,340
If the Markov chain is an
ergotic unit chain, then
510
00:32:01,340 --> 00:32:04,710
successive terms of this
expression tend to a steady
511
00:32:04,710 --> 00:32:06,450
state gain per step.
512
00:32:06,450 --> 00:32:11,520
In other words, these terms here
, when n gets very large,
513
00:32:11,520 --> 00:32:17,070
if I run this process for very
long time, what happens to p
514
00:32:17,070 --> 00:32:20,640
sub ij to n minus 1?
515
00:32:20,640 --> 00:32:27,920
This tends towards the steady
state vector pi sub j.
516
00:32:27,920 --> 00:32:31,710
And it doesn't matter
where we started.
517
00:32:31,710 --> 00:32:34,690
The only thing of importance
is where we end up.
518
00:32:34,690 --> 00:32:37,180
It doesn't matter how
high this is.
519
00:32:37,180 --> 00:32:42,670
So we have a sum over j, of
pi sub j times r sub j.
520
00:32:42,670 --> 00:32:48,745
After a very long time, the
expected gain per step is just
521
00:32:48,745 --> 00:32:51,930
a sum of pi j times
our r sub j.
522
00:32:51,930 --> 00:32:56,000
That's what's important
after a long time.
523
00:32:56,000 --> 00:32:58,290
And that's independent of
the starting state.
524
00:32:58,290 --> 00:33:02,670
So what we have here is a big,
messy transient, which is a
525
00:33:02,670 --> 00:33:04,780
sum of a whole bunch
of things.
526
00:33:04,780 --> 00:33:08,090
And then eventually it just
settles down, and every extra
527
00:33:08,090 --> 00:33:15,190
step you do, you just pick up
an extra factor of g as an
528
00:33:15,190 --> 00:33:16,970
extra reward.
529
00:33:16,970 --> 00:33:19,960
The reward might, of course, be
negative, like in the stock
530
00:33:19,960 --> 00:33:25,100
market over the last 10 years,
or up until the last year or
531
00:33:25,100 --> 00:33:27,980
so, who was negative
for a long time.
532
00:33:27,980 --> 00:33:30,800
But that doesn't make
any difference.
533
00:33:30,800 --> 00:33:34,480
This is just a number, and
this is independent of
534
00:33:34,480 --> 00:33:36,590
starting state.
535
00:33:36,590 --> 00:33:41,740
And p sub in can be viewed a
transient ni, which is all
536
00:33:41,740 --> 00:33:43,330
this stuff at the beginning.
537
00:33:43,330 --> 00:33:47,010
The sum of all these terms at
the beginning plus something
538
00:33:47,010 --> 00:33:50,290
that settles down over a
long period of time.
539
00:33:50,290 --> 00:33:54,200
How to calculate that transient,
how to combine it
540
00:33:54,200 --> 00:33:56,230
with the steady state gain.
541
00:33:56,230 --> 00:33:59,920
Then those talk a great
deal about that.
542
00:33:59,920 --> 00:34:03,970
What we're trying to do today
is to talk about dynamic
543
00:34:03,970 --> 00:34:09,080
programming without going into
all of this terrible mess
544
00:34:09,080 --> 00:34:12,250
about dealing rewards
words in a very
545
00:34:12,250 --> 00:34:14,239
systematic and simple way.
546
00:34:14,239 --> 00:34:16,199
You can read about that later.
547
00:34:16,199 --> 00:34:19,610
What we're aiming at is to talk
about dynamic programming
548
00:34:19,610 --> 00:34:23,340
a little bit, and then get
off to other things.
549
00:34:23,340 --> 00:34:23,870
OK.
550
00:34:23,870 --> 00:34:27,239
So anyway, we have a transient,
plus we have a
551
00:34:27,239 --> 00:34:29,330
steady state gain.
552
00:34:29,330 --> 00:34:31,470
The transient is important.
553
00:34:31,470 --> 00:34:34,520
And it's particularly important
if g equals zero.
554
00:34:34,520 --> 00:34:40,090
Namely if your average gain per
step is nothing, then what
555
00:34:40,090 --> 00:34:47,980
you're primarily interested in
is how valuable is it to start
556
00:34:47,980 --> 00:34:49,360
in a particular state?
557
00:34:49,360 --> 00:34:53,000
If you start in one state versus
another state, you
558
00:34:53,000 --> 00:34:56,600
might get a great deal of reward
in this one state,
559
00:34:56,600 --> 00:34:59,120
whereas you make a loss
in some other state.
560
00:34:59,120 --> 00:35:03,200
So it's important to know which
state is worth being in.
561
00:35:03,200 --> 00:35:07,960
So that's the next thing
we try to look at.
562
00:35:07,960 --> 00:35:12,410
How does the state
affect things?
563
00:35:12,410 --> 00:35:17,760
This brings us to one example
which is particularly useful.
564
00:35:17,760 --> 00:35:22,360
And along with being a useful
example, well, it's a nice
565
00:35:22,360 --> 00:35:25,840
illustration of Markov
rewards.
566
00:35:25,840 --> 00:35:30,980
It's also something which
you often want to find.
567
00:35:30,980 --> 00:35:35,800
And when we start talking about
renewal processes, you
568
00:35:35,800 --> 00:35:40,890
will find that this idea here
is a nice connection between
569
00:35:40,890 --> 00:35:43,340
Markov chains and
renewal series.
570
00:35:43,340 --> 00:35:47,240
So it's important for a whole
bunch of different reasons.
571
00:35:47,240 --> 00:35:48,220
OK.
572
00:35:48,220 --> 00:35:52,470
Suppose for some arbitrary
unit chain, namely we're
573
00:35:52,470 --> 00:35:56,060
saying one set of recurring
states.
574
00:35:56,060 --> 00:35:59,710
We want to find the expected
number of steps, starting from
575
00:35:59,710 --> 00:36:04,260
a given state i, until
some particular
576
00:36:04,260 --> 00:36:06,560
state 1 is first entered.
577
00:36:06,560 --> 00:36:09,070
So you start at one state.
578
00:36:09,070 --> 00:36:12,090
There's this other state
way over here.
579
00:36:12,090 --> 00:36:15,690
This state is recurrent, so
presumably, eventually you're
580
00:36:15,690 --> 00:36:17,580
going to enter it.
581
00:36:17,580 --> 00:36:20,170
And you want to find out, what's
the expected time that
582
00:36:20,170 --> 00:36:23,810
it takes to get to that
particular state?
583
00:36:23,810 --> 00:36:26,110
OK?
584
00:36:26,110 --> 00:36:30,160
If you're a Ph.D. student, you
have this Markov chain of
585
00:36:30,160 --> 00:36:32,310
doing your research.
586
00:36:32,310 --> 00:36:36,180
And at some point, you're going
to get a Ph.D. So we can
587
00:36:36,180 --> 00:36:39,900
think of this as the first pass
each time to your first
588
00:36:39,900 --> 00:36:44,500
Ph.D. I mean, if you want to
get more Ph.D.'s, fine, but
589
00:36:44,500 --> 00:36:47,560
that's probably a different
Markov chain.
590
00:36:47,560 --> 00:36:48,550
OK.
591
00:36:48,550 --> 00:36:53,110
So anyway, that's the problem
we're trying to solve here.
592
00:36:53,110 --> 00:36:56,690
We can view this problem
as a reward problem.
593
00:36:56,690 --> 00:36:59,750
We have to go through a number
of steps if we want to view it
594
00:36:59,750 --> 00:37:01,940
as a reward problem.
595
00:37:01,940 --> 00:37:07,390
The first one, first step is to
assign one unit of reward
596
00:37:07,390 --> 00:37:11,430
to each successive state until
you enter state 1.
597
00:37:11,430 --> 00:37:15,040
So you're bombing through this
Markov chain, a frog jumping
598
00:37:15,040 --> 00:37:17,120
from lily pad to lily pad.
599
00:37:17,120 --> 00:37:19,590
And finally, the frog
gets to the lily pad
600
00:37:19,590 --> 00:37:21,500
with the food on it.
601
00:37:21,500 --> 00:37:25,780
And the frog wants to know, is
it going to start before he
602
00:37:25,780 --> 00:37:28,830
gets to this lily pad
with the food on it?
603
00:37:28,830 --> 00:37:32,940
So, if we're trying to find
the expected time to get
604
00:37:32,940 --> 00:37:35,850
there, here what we're really
interested in is a cost,
605
00:37:35,850 --> 00:37:39,920
because the frog is in
danger of starving.
606
00:37:39,920 --> 00:37:42,220
Or on the other hand, there
might be a snake lying under
607
00:37:42,220 --> 00:37:44,470
this one lily pad.
608
00:37:44,470 --> 00:37:47,770
And then he's getting a reward
for staying alive.
609
00:37:47,770 --> 00:37:51,390
You can look at these things
whichever way you want to.
610
00:37:51,390 --> 00:37:51,880
OK.
611
00:37:51,880 --> 00:37:55,020
We're going to assign one unit
of reward to successive state
612
00:37:55,020 --> 00:37:56,800
until state 1 is entered.
613
00:37:56,800 --> 00:38:01,430
1 is just an arbitrary state
that we've selected.
614
00:38:01,430 --> 00:38:04,760
That's where the snake is
underneath a lily pad, or
615
00:38:04,760 --> 00:38:08,130
that's where the food is,
or what have you.
616
00:38:08,130 --> 00:38:10,450
Now, there's something
else we have to do.
617
00:38:10,450 --> 00:38:17,010
Because if we're starting out at
some arbitrary state i, and
618
00:38:17,010 --> 00:38:19,910
we're trying to look for the
first time that we enter state
619
00:38:19,910 --> 00:38:23,695
1, what do you do after
you enter state 1?
620
00:38:26,670 --> 00:38:32,400
Well eventually, normally you're
going to go away from
621
00:38:32,400 --> 00:38:34,110
state 1, and you're
going to start
622
00:38:34,110 --> 00:38:36,380
picking up rewards again.
623
00:38:36,380 --> 00:38:38,990
You don't want that to happen.
624
00:38:38,990 --> 00:38:42,020
So you do something we do all
the time when we're dealing
625
00:38:42,020 --> 00:38:45,510
with Markov chains, which is
we start with one Markov
626
00:38:45,510 --> 00:38:49,070
chain, and we say, to solve this
problem I'm interested
627
00:38:49,070 --> 00:38:52,110
in, I've got to change
the Markov chain.
628
00:38:52,110 --> 00:38:54,350
So how are we going
to change it?
629
00:38:54,350 --> 00:38:58,160
We're going to change it to say,
once we get in state 1,
630
00:38:58,160 --> 00:38:59,455
we're going to stay
there forever.
631
00:39:02,070 --> 00:39:04,600
Or in other words, the frog gets
eaten by the snake, and
632
00:39:04,600 --> 00:39:09,650
therefore its remains always
stay at that one lily pad.
633
00:39:09,650 --> 00:39:11,750
So we change the Markov
chain again.
634
00:39:11,750 --> 00:39:14,450
The frog can't jump anymore.
635
00:39:14,450 --> 00:39:18,290
And the way we change it is
to change the transition
636
00:39:18,290 --> 00:39:23,910
probabilities out of state 1
to p sub 1, 1, namely the
637
00:39:23,910 --> 00:39:27,010
probability given you're in
state 1, of going back to
638
00:39:27,010 --> 00:39:30,320
state 1 in the next transition
is equal to 1.
639
00:39:30,320 --> 00:39:32,670
So whenever you get
to state 1, you
640
00:39:32,670 --> 00:39:35,270
just stay there forever.
641
00:39:35,270 --> 00:39:39,210
We're going to say r1 equal to
zero, namely the reward you
642
00:39:39,210 --> 00:39:42,240
get in state 1 will be zero.
643
00:39:42,240 --> 00:39:46,070
So you keep getting rewards
until you go to state 1.
644
00:39:46,070 --> 00:39:49,840
And then when you go to state
1, you don't get any reward.
645
00:39:49,840 --> 00:39:54,150
You don't get any reward from
any time after that.
646
00:39:54,150 --> 00:39:56,600
So in fact, we've converted
the problem.
647
00:39:56,600 --> 00:39:59,970
We've converted the Markov chain
to be able to solve the
648
00:39:59,970 --> 00:40:03,160
problem that we want to solve.
649
00:40:03,160 --> 00:40:07,660
Now, how do we know that we
haven't changed the problem in
650
00:40:07,660 --> 00:40:10,330
some awful way?
651
00:40:10,330 --> 00:40:13,710
I mean, any time you start out
with a Markov chain and you
652
00:40:13,710 --> 00:40:16,510
modify it, and you solve a
problem for the modified
653
00:40:16,510 --> 00:40:20,410
chain, you have to really think
through whether you
654
00:40:20,410 --> 00:40:23,550
changed the problem that
you started to solve.
655
00:40:23,550 --> 00:40:27,790
Well, think of any sample path
which starts in some state i,
656
00:40:27,790 --> 00:40:29,610
which is not equal to 1.
657
00:40:29,610 --> 00:40:33,930
Think of the sample path
as going forever.
658
00:40:33,930 --> 00:40:38,430
In the original Markov chain,
that sample path at some
659
00:40:38,430 --> 00:40:43,050
point, presumably, is going
to get to state 1.
660
00:40:43,050 --> 00:40:47,100
After it gets to state 1, we
don't care what happens,
661
00:40:47,100 --> 00:40:51,520
because we then know how long
it's taken to get to state 1.
662
00:40:51,520 --> 00:40:54,550
And after it gets to state
1, the transition
663
00:40:54,550 --> 00:40:56,410
probabilities change.
664
00:40:56,410 --> 00:40:58,410
We don't care about that.
665
00:40:58,410 --> 00:41:03,570
So for every sample path, the
time that it takes the first
666
00:41:03,570 --> 00:41:08,370
pass each time to state 1 is the
same in the modify chain
667
00:41:08,370 --> 00:41:10,920
as it is in the actual chain.
668
00:41:10,920 --> 00:41:15,750
The transition probabilities are
the same up until the time
669
00:41:15,750 --> 00:41:17,770
when you first get to state 1.
670
00:41:17,770 --> 00:41:22,300
So for first pass each time
problems, it doesn't make any
671
00:41:22,300 --> 00:41:26,550
difference what you do after
you get to state 1.
672
00:41:26,550 --> 00:41:30,590
So to make the problem easy,
we're going to set these
673
00:41:30,590 --> 00:41:34,450
transition probabilities in
state 1 to 1, and we're going
674
00:41:34,450 --> 00:41:38,830
to set the reward
equal to zero.
675
00:41:38,830 --> 00:41:46,710
What do you call a state which
has p sub i, i equal to 1?
676
00:41:46,710 --> 00:41:48,700
You call it a trapping state.
677
00:41:48,700 --> 00:41:51,080
It's a trapping state because
once you get there,
678
00:41:51,080 --> 00:41:52,330
you can't get out.
679
00:41:55,500 --> 00:41:59,710
And since we started out with
a unit chain, and since
680
00:41:59,710 --> 00:42:03,650
presumably state 1 is a
recurrent state in that unit
681
00:42:03,650 --> 00:42:06,500
chain, eventually you're going
to get to state 1.
682
00:42:06,500 --> 00:42:08,560
But once you get there,
you can't get out.
683
00:42:08,560 --> 00:42:11,690
So what you've done is you've
turned the unit chain into
684
00:42:11,690 --> 00:42:15,200
another unit chain where the
recurrent set of states has
685
00:42:15,200 --> 00:42:17,900
only this one state,
state 1 in it.
686
00:42:17,900 --> 00:42:19,690
So it's a trapping state.
687
00:42:19,690 --> 00:42:23,920
Everything eventually
leads to state 1.
688
00:42:23,920 --> 00:42:26,600
All roads lead to Rome, but it's
not obvious that they're
689
00:42:26,600 --> 00:42:28,350
leading to Rome.
690
00:42:28,350 --> 00:42:31,480
And all of these states
eventually lead to state 1,
691
00:42:31,480 --> 00:42:34,420
but not for quite a
while sometimes.
692
00:42:34,420 --> 00:42:35,050
OK.
693
00:42:35,050 --> 00:42:37,710
So the probability of an initial
segment until 1 is
694
00:42:37,710 --> 00:42:41,960
entered is unchanged, and
expected first pass each time
695
00:42:41,960 --> 00:42:43,210
is unchanged.
696
00:42:45,630 --> 00:42:45,770
OK.
697
00:42:45,770 --> 00:42:50,430
A modified Markov chain is now
an ergotic unit chain.
698
00:42:50,430 --> 00:42:53,580
It has a single recurrent
state.
699
00:42:53,580 --> 00:42:57,150
State 1 is a trapping
state, we call it.
700
00:42:57,150 --> 00:43:03,730
ri is equal to 1 for i unequal
to 1, and r1 is equal to zero.
701
00:43:03,730 --> 00:43:08,480
This says that a state 1 is
first entered at time l, and
702
00:43:08,480 --> 00:43:13,770
the aggregate reward from 0 to
n is l for all m greater than
703
00:43:13,770 --> 00:43:14,335
or equal to l.
704
00:43:14,335 --> 00:43:16,780
In other words, after you get
to the trapping state, you
705
00:43:16,780 --> 00:43:19,410
stay there, and you don't
pick up any more
706
00:43:19,410 --> 00:43:21,250
reward from then on.
707
00:43:21,250 --> 00:43:23,970
One of the things that's
maddening about problems like
708
00:43:23,970 --> 00:43:26,720
this, at least that's maddening
for me, because I
709
00:43:26,720 --> 00:43:30,710
can't keep those things
straight, is the difference
710
00:43:30,710 --> 00:43:34,290
between n and n plus 1,
or n and n minus 1.
711
00:43:34,290 --> 00:43:37,280
There's always that strange
thing, we've started at time
712
00:43:37,280 --> 00:43:40,270
m, we get reward at time m.
713
00:43:40,270 --> 00:43:43,600
So if we're looking at m
transitions, as we go from m
714
00:43:43,600 --> 00:43:46,860
the m plus n minus 1.
715
00:43:46,860 --> 00:43:50,150
And that's just life.
716
00:43:50,150 --> 00:43:52,910
If you try to do it in a
different way, you wind up
717
00:43:52,910 --> 00:43:54,800
with a similar problem.
718
00:43:54,800 --> 00:43:56,220
You can't avoid it.
719
00:43:56,220 --> 00:44:02,130
OK, so what we're trying to find
is the expected value of
720
00:44:02,130 --> 00:44:06,470
v sub i of n, and the limit as n
goes to infinity, we'll just
721
00:44:06,470 --> 00:44:10,640
call that v sub i without
the n on it.
722
00:44:10,640 --> 00:44:14,620
And what we want to do is to
calculate this expected time
723
00:44:14,620 --> 00:44:18,040
until we first enter
state one.
724
00:44:18,040 --> 00:44:22,900
We want to calculate that for
all of the other states i.
725
00:44:22,900 --> 00:44:26,980
Well fortunately, there's a
sneaky way to calculate this.
726
00:44:26,980 --> 00:44:29,170
For most of these problems,
there's a sneaky way to
727
00:44:29,170 --> 00:44:30,680
calculate these limits.
728
00:44:30,680 --> 00:44:34,640
And you don't have to worry
about the limit.
729
00:44:34,640 --> 00:44:37,010
So the next thing I'm going to
do is to explain what this
730
00:44:37,010 --> 00:44:39,760
sneaky way is.
731
00:44:39,760 --> 00:44:44,710
You will see the same sneaky
method done about 100 times
732
00:44:44,710 --> 00:44:46,460
from now on until the
end of course.
733
00:44:46,460 --> 00:44:48,760
We use it all the time.
734
00:44:48,760 --> 00:44:52,250
And each time we do it, we'll
get a better sense of what it
735
00:44:52,250 --> 00:44:53,710
really amounts to.
736
00:44:53,710 --> 00:44:59,150
So for each state unequal to
the trapping state, let's
737
00:44:59,150 --> 00:45:02,290
start out by assuming that
we start at time
738
00:45:02,290 --> 00:45:04,470
zero, and state i.
739
00:45:04,470 --> 00:45:08,580
In other words, what this means
is first we're going to
740
00:45:08,580 --> 00:45:12,490
assume that x sub 0 equals
i for some given i.
741
00:45:12,490 --> 00:45:14,300
We're going to go through
whatever we're going to go
742
00:45:14,300 --> 00:45:17,620
through, then we'll go back
and assume that x sub 0 is
743
00:45:17,620 --> 00:45:18,890
some other i.
744
00:45:18,890 --> 00:45:21,800
And we don't have to worry about
that, because i is just
745
00:45:21,800 --> 00:45:22,900
a generic state.
746
00:45:22,900 --> 00:45:26,320
So we'll do it for everything
at once.
747
00:45:26,320 --> 00:45:30,630
There's a unit reward
at time 0.
748
00:45:30,630 --> 00:45:32,970
r sub i is equal to 1.
749
00:45:32,970 --> 00:45:37,270
So we start out at time
zero and state i.
750
00:45:37,270 --> 00:45:41,070
We pick up our reward of 1, and
then we go on from there
751
00:45:41,070 --> 00:45:46,370
to see how much longer it
takes to get to state 1.
752
00:45:46,370 --> 00:45:53,170
In addition to this unit reward
at time zero, which
753
00:45:53,170 --> 00:45:56,430
means it's already taken us one
unit of time to get the
754
00:45:56,430 --> 00:46:02,120
state 1, given that x sub 1
equals j, namely, given that
755
00:46:02,120 --> 00:46:07,910
we go from state i to state j,
the remaining expected reward
756
00:46:07,910 --> 00:46:10,380
is v sub j.
757
00:46:10,380 --> 00:46:15,830
In other words, if it's times
0, I'm in some state i.
758
00:46:15,830 --> 00:46:21,110
Given that I go to some stage
j, the next unit of time,
759
00:46:21,110 --> 00:46:24,930
what's the remaining accepted
expected time they
760
00:46:24,930 --> 00:46:27,560
get to state 1?
761
00:46:27,560 --> 00:46:32,830
The remaining expected time is
just v sub j, because that's
762
00:46:32,830 --> 00:46:34,050
the expected time.
763
00:46:34,050 --> 00:46:37,550
I mean, if v sub j is something
where it's very hard
764
00:46:37,550 --> 00:46:41,560
to get to state 1, then
we really lost out.
765
00:46:41,560 --> 00:46:44,370
If it's something which is
closer to state 1 in some
766
00:46:44,370 --> 00:46:45,730
sense, then we've gained.
767
00:46:45,730 --> 00:46:51,180
But what we wind up with is the
expected time to get to
768
00:46:51,180 --> 00:46:55,370
state 1 from state i is one.
769
00:46:55,370 --> 00:46:59,450
That's the instant reward that
we get, or the instant cost
770
00:46:59,450 --> 00:47:04,880
that we pay, plus each of
the possible states
771
00:47:04,880 --> 00:47:06,420
we might get to.
772
00:47:06,420 --> 00:47:11,290
There's a cost to go, or
reward to go from that
773
00:47:11,290 --> 00:47:12,470
particular j.
774
00:47:12,470 --> 00:47:15,320
So this is the formula
we have to solve.
775
00:47:15,320 --> 00:47:16,190
What's this mean?
776
00:47:16,190 --> 00:47:20,280
It means we have to solve
this formula for all i.
777
00:47:20,280 --> 00:47:24,870
If I solve it for all i, and
I've solved this for all i,
778
00:47:24,870 --> 00:47:28,910
then that's the linear equation
in the variables v
779
00:47:28,910 --> 00:47:40,010
sub 1 up to v linear equations
in i equals 2, up to m.
780
00:47:40,010 --> 00:47:44,660
We also have decided that
v sub 1 is equal to 0.
781
00:47:44,660 --> 00:47:48,350
In other words, if we start out
in state 1, you expect the
782
00:47:48,350 --> 00:47:50,670
time to get to state 1 is 0.
783
00:47:50,670 --> 00:47:53,260
We're already there.
784
00:47:53,260 --> 00:47:53,730
OK.
785
00:47:53,730 --> 00:47:57,300
So we have to solve these
linear equations.
786
00:47:57,300 --> 00:48:03,130
And if your philosophy on
solving linear equations is
787
00:48:03,130 --> 00:48:08,930
that of, I shouldn't say a
computer scientist because I
788
00:48:08,930 --> 00:48:11,830
don't want to indicate that they
are any different from
789
00:48:11,830 --> 00:48:16,960
any of the rest of us, but for
many people, your philosophy
790
00:48:16,960 --> 00:48:20,720
of solving linear equations
is to try to solve it.
791
00:48:20,720 --> 00:48:24,440
If you can't solve it, it
doesn't have any solution.
792
00:48:24,440 --> 00:48:28,020
And if you're happy with
doing that, fine.
793
00:48:28,020 --> 00:48:33,480
Some people would rather spend
10 hours asking whether in
794
00:48:33,480 --> 00:48:37,030
general it has any solution,
rather than spending five
795
00:48:37,030 --> 00:48:38,806
minutes solving it.
796
00:48:38,806 --> 00:48:48,420
So either way, this expected
first passage time, we've just
797
00:48:48,420 --> 00:48:50,390
stated what it is.
798
00:48:50,390 --> 00:48:57,710
Starting in state i, it's 1 plus
the time to go for any
799
00:48:57,710 --> 00:48:59,840
other state you happen
to go to.
800
00:48:59,840 --> 00:49:03,910
If we put this in vector form,
you put things in vector form
801
00:49:03,910 --> 00:49:06,670
because you want to spend two
hours finding the general
802
00:49:06,670 --> 00:49:09,685
solution, rather than five
minutes solving the problem.
803
00:49:14,240 --> 00:49:18,660
If you have 1,000 states, then
it works the other way.
804
00:49:18,660 --> 00:49:22,300
It takes you multiple hours to
work it out by hand, and it
805
00:49:22,300 --> 00:49:25,430
takes you five minutes by
looking at the equation.
806
00:49:25,430 --> 00:49:29,240
So sometimes you win, and
sometimes you lose by looking
807
00:49:29,240 --> 00:49:30,780
at the general solution.
808
00:49:30,780 --> 00:49:37,360
If you look at this as a vector
solution, the vector v
809
00:49:37,360 --> 00:49:43,080
where v1 is equal to zero, and
the other v's are unknowns, is
810
00:49:43,080 --> 00:49:47,590
the vector r, the
vector r is 0.
811
00:49:47,590 --> 00:49:50,030
0 reward in state 1.
812
00:49:50,030 --> 00:49:53,020
Unit reward in all other states,
because we're trying
813
00:49:53,020 --> 00:49:55,860
to get to this end.
814
00:49:55,860 --> 00:50:00,780
And then we have the matrix
here, t times v.
815
00:50:00,780 --> 00:50:04,780
So we want to solve this set of
linear equations, and what
816
00:50:04,780 --> 00:50:08,720
do we know about this set
of linear equations?
817
00:50:08,720 --> 00:50:11,890
We have an ergotic unit chain.
818
00:50:11,890 --> 00:50:16,410
We know that p has
an eigenvalue,
819
00:50:16,410 --> 00:50:18,700
which is equal to 1.
820
00:50:18,700 --> 00:50:22,040
We know that's a simple
eigenvalue.
821
00:50:22,040 --> 00:50:37,130
So that in fact, when we write
v equals r plus pv as zero
822
00:50:37,130 --> 00:50:50,070
equals r plus p minus
i times v.
823
00:50:50,070 --> 00:50:52,190
And we try to ask whether
v has any
824
00:50:52,190 --> 00:50:55,040
solution, what's the answer?
825
00:50:55,040 --> 00:50:59,140
Well, this matrix here has
an eigenvalue of 1.
826
00:50:59,140 --> 00:51:02,030
Since it has an eigenvalue of
one, and since it's a simple
827
00:51:02,030 --> 00:51:06,160
eigenvalue, there's a space of
solutions to this equation.
828
00:51:06,160 --> 00:51:11,330
The space of solutions is the
vector of all ones and the
829
00:51:11,330 --> 00:51:12,850
vector of all anything else.
830
00:51:12,850 --> 00:51:17,650
In other words, it's a vector of
v times any constant alpha.
831
00:51:17,650 --> 00:51:21,460
Now we've stuck this in here,
so now we want to find out
832
00:51:21,460 --> 00:51:25,200
what's the set of
solutions now.
833
00:51:25,200 --> 00:51:31,730
We observe v plus alpha e also
satisfies this equation if we
834
00:51:31,730 --> 00:51:33,500
found another solution.
835
00:51:33,500 --> 00:51:37,450
So if we found a solution, we
have a one dimensional family
836
00:51:37,450 --> 00:51:40,110
of solutions.
837
00:51:40,110 --> 00:51:47,520
Well, since this eigenvalue is a
simple eigenvalue, the space
838
00:51:47,520 --> 00:51:56,040
of vectors for which r is equal
to p minus i times v as
839
00:51:56,040 --> 00:51:59,390
a one dimensional space, and
therefore there has to be a
840
00:51:59,390 --> 00:52:02,350
unique solution to
this question.
841
00:52:02,350 --> 00:52:03,490
OK.
842
00:52:03,490 --> 00:52:07,460
So in fact, in only 15 minutes,
we've solved the
843
00:52:07,460 --> 00:52:13,710
problem in general, so that you
can deal with matrices of
844
00:52:13,710 --> 00:52:17,990
1,000 states, as opposed
to two states.
845
00:52:17,990 --> 00:52:20,170
And you still have
the same answer.
846
00:52:20,170 --> 00:52:21,840
OK.
847
00:52:21,840 --> 00:52:26,970
So this equation has a simple
solution, which says that you
848
00:52:26,970 --> 00:52:29,850
can program your computer to
solve this set of linear
849
00:52:29,850 --> 00:52:33,270
equations, and you're bound
to get an answer.
850
00:52:33,270 --> 00:52:35,740
And the answer will tell you
how long it takes to get to
851
00:52:35,740 --> 00:52:39,958
this particular state.
852
00:52:39,958 --> 00:52:40,390
OK.
853
00:52:40,390 --> 00:52:46,705
Let's go one to aggregate
rewards with a final reward.
854
00:52:51,420 --> 00:52:53,560
Starting to sound like-- yes?
855
00:52:53,560 --> 00:52:56,990
STUDENT: I'm sorry, for the
last example, how are we
856
00:52:56,990 --> 00:52:57,970
guaranteed that it's ergotic?
857
00:52:57,970 --> 00:53:01,370
Like, I possible you enter a
loop somewhere that can never
858
00:53:01,370 --> 00:53:05,670
go to your trapping
state, right?
859
00:53:05,670 --> 00:53:09,750
PROFESSOR: But I can't do that
because there always has to be
860
00:53:09,750 --> 00:53:12,520
a way of getting to the trapping
state, because
861
00:53:12,520 --> 00:53:14,770
there's only one recurrent
state.
862
00:53:14,770 --> 00:53:19,170
All these other states
are transient now.
863
00:53:19,170 --> 00:53:19,920
STUDENT: No, but I mean--
864
00:53:19,920 --> 00:53:21,467
OK, like, let's say you
start off with a
865
00:53:21,467 --> 00:53:22,655
general Markov chain.
866
00:53:22,655 --> 00:53:24,560
PROFESSOR: Oh, I start off with
a general Markov chain?
867
00:53:24,560 --> 00:53:27,060
You're absolutely right.
868
00:53:27,060 --> 00:53:30,060
Then there might be no way of
getting from some starting
869
00:53:30,060 --> 00:53:34,610
state to state 1, and therefore,
the amount of time
870
00:53:34,610 --> 00:53:36,890
that it takes you to get from
that state to the starting
871
00:53:36,890 --> 00:53:38,750
state is going to be infinite.
872
00:53:38,750 --> 00:53:40,250
You can't get there.
873
00:53:40,250 --> 00:53:43,960
So in fact, what you have to do
with a problem like this is
874
00:53:43,960 --> 00:53:48,730
to look at it first, and say,
are you in fact dealing with a
875
00:53:48,730 --> 00:53:49,760
unit chain?
876
00:53:49,760 --> 00:53:52,990
Or do you have multiple
recurrent sets?
877
00:53:52,990 --> 00:53:57,100
If you have multiple recurrent
sets, then the expected time
878
00:53:57,100 --> 00:54:00,770
to get into one of the recurrent
states, starting
879
00:54:00,770 --> 00:54:04,840
from either a transient state,
or from some other recurrent
880
00:54:04,840 --> 00:54:08,720
set is infinite.
881
00:54:08,720 --> 00:54:11,820
I mean, just like this business
we were going through
882
00:54:11,820 --> 00:54:13,480
at the beginning.
883
00:54:13,480 --> 00:54:16,050
What you would like to do is not
have to go through a lot
884
00:54:16,050 --> 00:54:20,750
of calculation when you have, or
a lot of thinking when you
885
00:54:20,750 --> 00:54:24,070
have multiple recurrent
sets of states.
886
00:54:24,070 --> 00:54:25,980
You just know what
happens there.
887
00:54:25,980 --> 00:54:28,540
There's no way to get from this
recurrent set to this
888
00:54:28,540 --> 00:54:30,020
recurrent set.
889
00:54:30,020 --> 00:54:31,440
So that's the end of it.
890
00:54:31,440 --> 00:54:31,888
STUDENT: OK.
891
00:54:31,888 --> 00:54:34,277
So like it works when you have
the unit chain, and then you
892
00:54:34,277 --> 00:54:36,585
choose your trapping state to
be one instance [INAUDIBLE].
893
00:54:36,585 --> 00:54:37,835
PROFESSOR: Yes.
894
00:54:39,700 --> 00:54:40,150
OK.
895
00:54:40,150 --> 00:54:41,400
Good.
896
00:54:44,220 --> 00:54:47,160
Now, yes?
897
00:54:47,160 --> 00:54:50,410
STUDENT: The previous equation
is true for any reward.
898
00:54:50,410 --> 00:54:51,692
But it's not necessary--
899
00:54:51,692 --> 00:54:53,950
PROFESSOR: Yeah, it is true for
any set of rewards, yes.
900
00:54:59,720 --> 00:55:02,090
Although what the interpretation
would be of any
901
00:55:02,090 --> 00:55:05,900
set of rewards is if you
have to sort that out.
902
00:55:05,900 --> 00:55:06,590
But yes.
903
00:55:06,590 --> 00:55:10,200
For any r that you choose,
there's going to be one unique
904
00:55:10,200 --> 00:55:15,530
solution, so long as one is
actually a trapping state, and
905
00:55:15,530 --> 00:55:16,950
everything else leads to one.
906
00:55:20,600 --> 00:55:25,875
OK, so why do I want to
put a-- ah, good.
907
00:55:25,875 --> 00:55:27,537
STUDENT: I feel like there's a
lot of the rewards that are
908
00:55:27,537 --> 00:55:30,625
designed for it, designed with
respect to being in a
909
00:55:30,625 --> 00:55:31,575
particular state.
910
00:55:31,575 --> 00:55:32,060
PROFESSOR: Yes.
911
00:55:32,060 --> 00:55:34,340
STUDENT: But if the rewards are
actually in transition, so
912
00:55:34,340 --> 00:55:38,012
for example, if you go from i to
j, there are going to be a
913
00:55:38,012 --> 00:55:40,015
different number from j to j.
914
00:55:40,015 --> 00:55:41,580
How do you deal with that?
915
00:55:41,580 --> 00:55:42,400
PROFESSOR: How do I
deal with that?
916
00:55:42,400 --> 00:55:45,000
Well, then let's talk
about that.
917
00:55:45,000 --> 00:55:48,200
And in fact, it's fairly simple
so long as you're only
918
00:55:48,200 --> 00:55:50,750
talking about expected
rewards.
919
00:55:50,750 --> 00:55:54,450
Because if I have a reward
associated with--
920
00:55:57,096 --> 00:56:18,574
if I have a reward rij, which is
the reward for transition i
921
00:56:18,574 --> 00:56:36,600
to j, then if I take the sum of
rij times p summed over j,
922
00:56:36,600 --> 00:56:51,768
what this gives me is the
expected reward associated
923
00:56:51,768 --> 00:56:53,940
with state j, with state i.
924
00:56:59,750 --> 00:57:02,680
Now, you have to be a little bit
careful with this because
925
00:57:02,680 --> 00:57:06,310
before we've been picking up
this reward as soon as we get
926
00:57:06,310 --> 00:57:09,860
to state i, and here suddenly
we have a slightly different
927
00:57:09,860 --> 00:57:14,560
situation where you have a
reward associated with state i
928
00:57:14,560 --> 00:57:17,230
but you don't pick it up
until the next set.
929
00:57:17,230 --> 00:57:23,780
So this is where this problem
of i or i plus 1 comes in.
930
00:57:23,780 --> 00:57:29,070
And you guys can do that much
better than I can, because at
931
00:57:29,070 --> 00:57:36,480
my age I start out with an age
of 60 and an age of 61 is the
932
00:57:36,480 --> 00:57:38,500
same thing.
933
00:57:38,500 --> 00:57:40,880
I mean, these are--
934
00:57:40,880 --> 00:57:42,130
OK.
935
00:57:44,790 --> 00:57:48,660
So anyway, the point of it is,
if you have rewards associated
936
00:57:48,660 --> 00:57:52,320
with transitions you can always
convert that to rewards
937
00:57:52,320 --> 00:57:53,570
associated with states.
938
00:57:58,320 --> 00:58:02,220
Oh, I didn't really
get to this.
939
00:58:02,220 --> 00:58:06,150
What I've been trying to say
now for a while is that
940
00:58:06,150 --> 00:58:13,120
sometimes, for some reason or
other, after you go through
941
00:58:13,120 --> 00:58:16,990
and end steps of this Markov
chain, when you get to the
942
00:58:16,990 --> 00:58:21,340
end, you want to consider some
particularly large reward for
943
00:58:21,340 --> 00:58:24,540
having gotten to the end, or
some particularly large cost
944
00:58:24,540 --> 00:58:27,950
of getting to the end, or
something which depends on the
945
00:58:27,950 --> 00:58:30,190
state that you happen
to be in.
946
00:58:30,190 --> 00:58:34,630
So we will assign some final
reward which in general can be
947
00:58:34,630 --> 00:58:37,820
different from the reward that
we're picking up at each of
948
00:58:37,820 --> 00:58:38,840
the other states.
949
00:58:38,840 --> 00:58:41,105
We're going to do this
in a particular way.
950
00:58:47,740 --> 00:58:50,920
You would think that what we
would want to do is, if we
951
00:58:50,920 --> 00:58:55,210
went through in steps, we would
associate this final
952
00:58:55,210 --> 00:58:57,580
reward with the n-th step.
953
00:58:57,580 --> 00:58:59,220
We're going to do it
a different way.
954
00:58:59,220 --> 00:59:02,180
We're going to go through n
steps, and then the final
955
00:59:02,180 --> 00:59:05,980
reward is what happens on
the state after that.
956
00:59:05,980 --> 00:59:09,480
So we're really turning the
problem of looking at n steps
957
00:59:09,480 --> 00:59:13,490
into a problem of looking
at n plus 1 steps.
958
00:59:13,490 --> 00:59:14,490
Why do we do that?
959
00:59:14,490 --> 00:59:16,320
Completely arbitrary.
960
00:59:16,320 --> 00:59:19,320
It turns out to be convenient
when we talk about dynamic
961
00:59:19,320 --> 00:59:24,720
programming, and you'll see
why in just a minute.
962
00:59:24,720 --> 00:59:29,770
So this extra final state is
just an arbitrary thing that
963
00:59:29,770 --> 00:59:34,230
you add, and we'll see
the main purpose for
964
00:59:34,230 --> 00:59:35,780
it in just a minute.
965
00:59:38,730 --> 00:59:39,380
OK.
966
00:59:39,380 --> 00:59:45,910
So we're going to now look at
what in principle is a much
967
00:59:45,910 --> 00:59:48,880
more complicated situation than
what we were looking at
968
00:59:48,880 --> 00:59:53,180
before, but you still have this
basic mark off condition
969
00:59:53,180 --> 00:59:56,310
which is making things
simple for you.
970
00:59:56,310 --> 01:00:00,990
So the idea is, you're looking
at a discrete time situation.
971
01:00:00,990 --> 01:00:04,260
Things happen in steps.
972
01:00:04,260 --> 01:00:07,655
There's a finite set of states
which don't change over time.
973
01:00:10,530 --> 01:00:13,690
At each unit of time, you're
going to be in one of the set
974
01:00:13,690 --> 01:00:20,420
of m states, and at each time l,
there's some decision maker
975
01:00:20,420 --> 01:00:24,520
sitting around who looks
at the state that
976
01:00:24,520 --> 01:00:26,530
you're in at time l.
977
01:00:26,530 --> 01:00:31,970
And the decision maker says I
have a choice between what
978
01:00:31,970 --> 01:00:38,570
reward I'm going to pick up
at this time and what the
979
01:00:38,570 --> 01:00:43,020
transition probabilities are for
going to the next state.
980
01:00:43,020 --> 01:00:46,110
OK, so it's kind of a
complicated thing.
981
01:00:46,110 --> 01:00:51,440
It's the same thing that
you face all the time.
982
01:00:51,440 --> 01:00:54,300
I mean, in the stock market for
example, you see that one
983
01:00:54,300 --> 01:00:57,010
stock is doing poorly,
so you have a choice.
984
01:00:57,010 --> 01:01:03,620
Should I sell it, eat my losses,
or should I keep on
985
01:01:03,620 --> 01:01:05,980
going and hope it'll
turnaround?
986
01:01:05,980 --> 01:01:09,980
If you're doing a thesis, you
have the even worse problem.
987
01:01:09,980 --> 01:01:13,540
You go for three months without
getting the result
988
01:01:13,540 --> 01:01:19,120
that you need, and you say,
well, I don't have a thesis.
989
01:01:19,120 --> 01:01:21,960
I can't say something
about this.
990
01:01:21,960 --> 01:01:25,280
Should I go on for one more
month, or should I can it and
991
01:01:25,280 --> 01:01:27,400
go on to another topic?
992
01:01:27,400 --> 01:01:30,460
OK, it's exactly the
same situation.
993
01:01:30,460 --> 01:01:34,900
So this is really a very broad
set of situations.
994
01:01:34,900 --> 01:01:37,858
The only thing that makes it
really different from real
995
01:01:37,858 --> 01:01:42,260
life is this Markov property
sitting there and the fact
996
01:01:42,260 --> 01:01:46,190
that you actually understand
what the rewards are and you
997
01:01:46,190 --> 01:01:48,180
can predict them in advance.
998
01:01:48,180 --> 01:01:51,990
You can't predict what state
you're going to be in, but you
999
01:01:51,990 --> 01:01:54,230
know that if you're in a
particular state, you know
1000
01:01:54,230 --> 01:01:58,560
what your choices are in the
future as well as now, and all
1001
01:01:58,560 --> 01:02:03,360
you have to do at each unit of
time is to make this choice
1002
01:02:03,360 --> 01:02:05,860
between various different
things.
1003
01:02:05,860 --> 01:02:08,485
You see an interesting
example of that here.
1004
01:02:13,890 --> 01:02:17,430
If you look at this Markov chain
here, it's a two state
1005
01:02:17,430 --> 01:02:18,680
Markov chain.
1006
01:02:21,770 --> 01:02:23,860
And what's the steady
state probability of
1007
01:02:23,860 --> 01:02:25,150
being in state one?
1008
01:02:32,420 --> 01:02:33,670
Anybody?
1009
01:02:35,596 --> 01:02:37,050
It's a half, yes.
1010
01:02:37,050 --> 01:02:40,480
Why is it a half, and
why don't you have
1011
01:02:40,480 --> 01:02:41,930
to solve for this?
1012
01:02:41,930 --> 01:02:45,150
Why can you look at it
and say it's a half?
1013
01:02:45,150 --> 01:02:46,740
Because it's completely
symmetric.
1014
01:02:46,740 --> 01:02:53,930
0.99 here, 0.99 here, 0.01
here, 0.01 here.
1015
01:02:53,930 --> 01:02:56,450
These rewards here had nothing
to do with the
1016
01:02:56,450 --> 01:02:58,290
Markov chain itself.
1017
01:02:58,290 --> 01:03:02,210
The Markov chain is symmetric
between states one and two,
1018
01:03:02,210 --> 01:03:05,200
and therefore, the steady state
probabilities have to be
1019
01:03:05,200 --> 01:03:06,660
one half each.
1020
01:03:06,660 --> 01:03:13,410
So here's something where, if
you happen to be in state two,
1021
01:03:13,410 --> 01:03:15,090
you're going to stay
there typically
1022
01:03:15,090 --> 01:03:17,100
for a very long time.
1023
01:03:17,100 --> 01:03:20,080
And while you're studying there
for a very long time,
1024
01:03:20,080 --> 01:03:24,020
you're going to be picking up
rewards one unit of reward
1025
01:03:24,020 --> 01:03:26,930
every unit of time.
1026
01:03:26,930 --> 01:03:30,720
You work for some very stable
employer who pays you very
1027
01:03:30,720 --> 01:03:33,540
little, and that's a
situation you have.
1028
01:03:33,540 --> 01:03:37,880
You're sitting here, you have
a job but you're not making
1029
01:03:37,880 --> 01:03:42,710
much, but still you're making
something, and you have a lot
1030
01:03:42,710 --> 01:03:45,510
of job security.
1031
01:03:45,510 --> 01:03:49,760
Now, we have a different choice
when we're sitting here
1032
01:03:49,760 --> 01:03:57,390
with a job in state two, we can,
for example, you can go
1033
01:03:57,390 --> 01:04:00,300
to the cash register and take
all the money out of it and
1034
01:04:00,300 --> 01:04:01,550
disappear from the company.
1035
01:04:03,920 --> 01:04:07,170
I don't advocate doing
that, except,
1036
01:04:07,170 --> 01:04:09,190
it's one of your choices.
1037
01:04:09,190 --> 01:04:13,730
So you pick up a big reward of
50, and then for a long period
1038
01:04:13,730 --> 01:04:18,820
of time you go back to this
state over here and you make
1039
01:04:18,820 --> 01:04:21,720
nothing in reward for a
long period of time
1040
01:04:21,720 --> 01:04:23,360
while you're in jail.
1041
01:04:23,360 --> 01:04:28,650
And then eventually you pop back
here, and if we assume
1042
01:04:28,650 --> 01:04:32,320
the judicial system is such
that it has no memory,
1043
01:04:32,320 --> 01:04:33,670
[INAUDIBLE]
1044
01:04:33,670 --> 01:04:40,020
you can cut into the cash
register, and, well, OK.
1045
01:04:40,020 --> 01:04:43,850
So anyway, this decision two,
you're looking for instant
1046
01:04:43,850 --> 01:04:45,410
gratification here.
1047
01:04:45,410 --> 01:04:48,340
You're getting a big reward all
at once, but by getting a
1048
01:04:48,340 --> 01:04:53,040
big reward with probability
one, you're going back to
1049
01:04:53,040 --> 01:04:54,390
state zero.
1050
01:04:54,390 --> 01:04:57,830
From state zero, it takes a long
time to get back to the
1051
01:04:57,830 --> 01:05:02,940
point where you can get a big
reward again, so you wonder,
1052
01:05:02,940 --> 01:05:07,020
is it better to use this policy
or is it better to use
1053
01:05:07,020 --> 01:05:08,270
this policy?
1054
01:05:10,670 --> 01:05:14,160
Now, there are two basic ways
to look at this problem.
1055
01:05:14,160 --> 01:05:16,280
I think it's important to
understand what they are
1056
01:05:16,280 --> 01:05:18,330
before we go further.
1057
01:05:18,330 --> 01:05:24,660
One of the ways is to say, OK,
let's suppose that I work out
1058
01:05:24,660 --> 01:05:30,440
which is the best policy
and I use it forever.
1059
01:05:30,440 --> 01:05:34,140
Namely, I use this policy
forever or I
1060
01:05:34,140 --> 01:05:36,910
use this policy forever.
1061
01:05:36,910 --> 01:05:40,570
And if I use this policy
forever, I can pretty easily
1062
01:05:40,570 --> 01:05:43,470
work out what the steady state
probabilities of these two
1063
01:05:43,470 --> 01:05:44,680
states are.
1064
01:05:44,680 --> 01:05:50,040
I can then work out what my
expected gain is per unit time
1065
01:05:50,040 --> 01:05:52,690
and I can compare
this with that.
1066
01:05:55,260 --> 01:05:58,140
And who thinks that this is
going to be better than that
1067
01:05:58,140 --> 01:06:01,370
and who thinks that this is
going to be better than that?
1068
01:06:01,370 --> 01:06:03,600
Well, you can work
it out easily.
1069
01:06:03,600 --> 01:06:07,080
It's kind of interesting because
the steady state gain
1070
01:06:07,080 --> 01:06:12,940
here and here are very
close to the same.
1071
01:06:12,940 --> 01:06:17,480
It turns out that this is just
a smidgen better than this,
1072
01:06:17,480 --> 01:06:19,610
only by a very small amount.
1073
01:06:19,610 --> 01:06:19,950
OK.
1074
01:06:19,950 --> 01:06:25,620
See, what happens here is that
here, you tend to go for about
1075
01:06:25,620 --> 01:06:28,110
100 steps here.
1076
01:06:28,110 --> 01:06:33,090
So you pick up every reward of
about 100 if you use this very
1077
01:06:33,090 --> 01:06:34,890
simple minded analysis.
1078
01:06:34,890 --> 01:06:37,810
Then for 100 steps, you're
sitting here, you're getting
1079
01:06:37,810 --> 01:06:43,020
no reward, so you think we ought
to get every reward of
1080
01:06:43,020 --> 01:06:46,310
one half on the average,
and that's exactly
1081
01:06:46,310 --> 01:06:48,090
what you do get here.
1082
01:06:48,090 --> 01:06:52,820
And here, you get this big
reward of 50, but then you go
1083
01:06:52,820 --> 01:06:57,560
over here and you spend 100
units of time in purgatory and
1084
01:06:57,560 --> 01:07:00,470
then you get back again, you get
another reward of 50 and
1085
01:07:00,470 --> 01:07:03,830
then spend hundreds units
of time in purgatory.
1086
01:07:03,830 --> 01:07:07,300
So again, you're getting pretty
close to a half of a
1087
01:07:07,300 --> 01:07:10,690
unit of reward, but it turns
out, when you work it out,
1088
01:07:10,690 --> 01:07:12,690
that here is just a smidgen.
1089
01:07:12,690 --> 01:07:18,190
It's 1% less than a half, so
this is not as good as that.
1090
01:07:18,190 --> 01:07:24,900
But suppose that you have
a shorter time horizon.
1091
01:07:24,900 --> 01:07:28,380
Suppose you don't want to wait
for 1,000 steps to see what's
1092
01:07:28,380 --> 01:07:32,280
going on, so you don't want
to look at the average.
1093
01:07:32,280 --> 01:07:34,280
Suppose this was a
gambling game.
1094
01:07:34,280 --> 01:07:38,230
You have your choice of these
two gambling options, and
1095
01:07:38,230 --> 01:07:41,820
suppose you're only going to be
playing for a short time.
1096
01:07:41,820 --> 01:07:43,180
Suppose you're going
to be only playing
1097
01:07:43,180 --> 01:07:44,830
for one unit of time.
1098
01:07:44,830 --> 01:07:47,180
You can only play for one unit
of time and then you have to
1099
01:07:47,180 --> 01:07:50,780
stop, you have to go home, you
have to go back to work, or
1100
01:07:50,780 --> 01:07:52,180
something else.
1101
01:07:52,180 --> 01:07:54,870
And you happen to be sitting
in state two.
1102
01:07:54,870 --> 01:07:57,180
What do you want to do
if you only have one
1103
01:07:57,180 --> 01:07:58,730
unit of time to play.
1104
01:07:58,730 --> 01:08:03,630
Well, obviously, you want to get
the reward of 50, because
1105
01:08:03,630 --> 01:08:07,620
delayed gratification doesn't
work here, because you don't
1106
01:08:07,620 --> 01:08:11,330
get any opportunity for that
gratification later.
1107
01:08:11,330 --> 01:08:14,900
So you pick up the big
reward at first.
1108
01:08:14,900 --> 01:08:18,630
So when you have this problem of
playing for a finite amount
1109
01:08:18,630 --> 01:08:24,649
of time, whatever kind of
situation you're in, what you
1110
01:08:24,649 --> 01:08:28,310
would like to do is say, for
this finite amount of time
1111
01:08:28,310 --> 01:08:34,290
that I'm going to play, what's
my best strategy then?
1112
01:08:34,290 --> 01:08:39,850
Dynamic programming is the
problem, which is the
1113
01:08:39,850 --> 01:08:43,600
algorithm which finds out what
the best thing to do is
1114
01:08:43,600 --> 01:08:45,000
dynamically.
1115
01:08:45,000 --> 01:08:48,670
Namely, if you're going to stop
in 10 steps, stop in 100
1116
01:08:48,670 --> 01:08:52,710
steps, stop in one step, it
tells you what to do under all
1117
01:08:52,710 --> 01:08:55,350
of those circumstances.
1118
01:08:55,350 --> 01:08:59,340
And the stationary policy tells
you what to do if you're
1119
01:08:59,340 --> 01:09:02,630
going to play forever.
1120
01:09:02,630 --> 01:09:05,760
But in a situation like this
where things happen rather
1121
01:09:05,760 --> 01:09:10,189
slowly, it might not be the
relevant thing to deal with.
1122
01:09:10,189 --> 01:09:13,170
A lot of the notes deal with
comparing the stationary
1123
01:09:13,170 --> 01:09:17,180
policy with this
dynamic policy.
1124
01:09:17,180 --> 01:09:21,399
And I'm not going to do that
here because, well, we have
1125
01:09:21,399 --> 01:09:23,290
too many other interesting
things that we
1126
01:09:23,290 --> 01:09:24,170
want to deal with.
1127
01:09:24,170 --> 01:09:26,939
So we're just going to skip
all of that stuff about
1128
01:09:26,939 --> 01:09:28,470
stationary policies.
1129
01:09:28,470 --> 01:09:30,670
You don't have to bother to
read it unless you're
1130
01:09:30,670 --> 01:09:32,580
interested in it.
1131
01:09:32,580 --> 01:09:35,029
I mean, if you're interested in
it, by all means, read it.
1132
01:09:35,029 --> 01:09:38,950
It's a very interesting topic.
1133
01:09:38,950 --> 01:09:41,580
It's not all that interesting
to find out what the best
1134
01:09:41,580 --> 01:09:42,990
stationary policy is.
1135
01:09:42,990 --> 01:09:45,210
That's kind of simple.
1136
01:09:45,210 --> 01:09:48,729
What's the interesting topic
is what's the comparison
1137
01:09:48,729 --> 01:09:53,100
between the dynamic policy and
the stationary policy.
1138
01:09:53,100 --> 01:09:56,500
But all we're going to do
is worry about what the
1139
01:09:56,500 --> 01:09:58,160
dynamic policy is.
1140
01:09:58,160 --> 01:10:03,460
That seems like a hard problem,
and someone by the
1141
01:10:03,460 --> 01:10:09,720
name of Bellman figured out what
the optimal solution to
1142
01:10:09,720 --> 01:10:12,025
that dynamic policy was.
1143
01:10:12,025 --> 01:10:16,900
And it turned out to be a
trivially simple algorithm,
1144
01:10:16,900 --> 01:10:20,030
and Bellman became
famous forever.
1145
01:10:20,030 --> 01:10:23,080
One of the things I want to
point out to you, again, I
1146
01:10:23,080 --> 01:10:27,250
keep coming back to this because
you people are just
1147
01:10:27,250 --> 01:10:29,970
starting a research career.
1148
01:10:29,970 --> 01:10:34,490
Everyone in this class, given
the formulation of this
1149
01:10:34,490 --> 01:10:38,670
dynamic programming problem,
could develop and would
1150
01:10:38,670 --> 01:10:43,440
develop, I'm pretty sure, the
dynamic programming algorithm.
1151
01:10:43,440 --> 01:10:47,020
Developing the algorithm,
understanding what the problem
1152
01:10:47,020 --> 01:10:50,210
is is a trivial matter.
1153
01:10:50,210 --> 01:10:52,390
Why is Bellman famous?
1154
01:10:52,390 --> 01:10:56,270
Because he formulated
the problem.
1155
01:10:56,270 --> 01:11:01,010
He said, aha, this dynamic
problem is interesting.
1156
01:11:01,010 --> 01:11:04,710
I don't have to go through
the stationary problem.
1157
01:11:04,710 --> 01:11:08,430
And in fact, my sense from
reading his book and from
1158
01:11:08,430 --> 01:11:11,470
reading things he's written is
that he couldn't have solved
1159
01:11:11,470 --> 01:11:14,240
the stationary problem because
he didn't understand
1160
01:11:14,240 --> 01:11:16,750
probability that well.
1161
01:11:16,750 --> 01:11:20,600
But he did understand how to
formulate what this really
1162
01:11:20,600 --> 01:11:24,330
important problem was
and he solved it.
1163
01:11:24,330 --> 01:11:27,880
So, all the more credit to him,
but when you're doing
1164
01:11:27,880 --> 01:11:32,460
research, the time you spend
on formulating the right
1165
01:11:32,460 --> 01:11:37,430
problem is far more important
than the time you spend
1166
01:11:37,430 --> 01:11:38,390
solving it.
1167
01:11:38,390 --> 01:11:41,490
If you start out with the right
problem, the solution is
1168
01:11:41,490 --> 01:11:45,650
trivial and you're all done.
1169
01:11:45,650 --> 01:11:49,930
It's hard to formulate the right
problem, and you learn
1170
01:11:49,930 --> 01:11:57,810
to formulate the problem not
by playing all of this
1171
01:11:57,810 --> 01:12:01,570
calculating things, but by
setting back and thinking
1172
01:12:01,570 --> 01:12:04,480
about the problem and trying
to look at things in a more
1173
01:12:04,480 --> 01:12:06,050
general way.
1174
01:12:06,050 --> 01:12:07,660
So just another plug.
1175
01:12:07,660 --> 01:12:10,440
I've been saying this, I will
probably say it every three or
1176
01:12:10,440 --> 01:12:14,420
four lectures throughout
the term.
1177
01:12:14,420 --> 01:12:14,860
OK.
1178
01:12:14,860 --> 01:12:18,450
So let's go back and look
at what the problem is.
1179
01:12:18,450 --> 01:12:21,330
We haven't quite formulated
it yet.
1180
01:12:21,330 --> 01:12:24,940
We're going to assume this
process of random transitions
1181
01:12:24,940 --> 01:12:27,790
combined with decisions based
on the current state.
1182
01:12:27,790 --> 01:12:30,380
In other words, in this decision
maker, the decision
1183
01:12:30,380 --> 01:12:34,960
maker at each unit of time sees
what state you're in at
1184
01:12:34,960 --> 01:12:37,040
this unit of time.
1185
01:12:37,040 --> 01:12:40,940
And seeing what state you're in
at this given unit of time,
1186
01:12:40,940 --> 01:12:45,020
the decision maker has a choice
between how much reward
1187
01:12:45,020 --> 01:12:51,740
is to be taken and along with
how much reward is to be
1188
01:12:51,740 --> 01:12:54,940
taken, what the transition
probabilities are
1189
01:12:54,940 --> 01:12:56,160
for the next state.
1190
01:12:56,160 --> 01:13:00,150
If you rob the cash register,
your transition probabilities
1191
01:13:00,150 --> 01:13:02,230
are going to be very different
than if you don't
1192
01:13:02,230 --> 01:13:04,680
rob the cash register.
1193
01:13:04,680 --> 01:13:08,190
By robbing the cash register,
your transition probabilities
1194
01:13:08,190 --> 01:13:10,770
go into a rather high transition
probability that
1195
01:13:10,770 --> 01:13:12,270
you're going to be caught.
1196
01:13:12,270 --> 01:13:16,050
OK, so you don't want that.
1197
01:13:16,050 --> 01:13:20,600
So you can't avoid the problem
of having the rewards at a
1198
01:13:20,600 --> 01:13:24,290
given time locked into what the
transition probabilities
1199
01:13:24,290 --> 01:13:27,990
are for going to the next state,
and that's the essence
1200
01:13:27,990 --> 01:13:29,890
of this problem.
1201
01:13:29,890 --> 01:13:30,500
OK.
1202
01:13:30,500 --> 01:13:33,470
So, the decision maker observers
the state and
1203
01:13:33,470 --> 01:13:36,530
chooses one of a finite
set of alternatives.
1204
01:13:36,530 --> 01:13:39,790
Each alternative consists of
recurrent reward which we'll
1205
01:13:39,790 --> 01:13:44,030
call r sub j of k, the
alternative is k, and a set of
1206
01:13:44,030 --> 01:13:45,980
transition probabilities.
1207
01:13:45,980 --> 01:13:50,250
pjl of k, one less than or
equal to a l less than or
1208
01:13:50,250 --> 01:13:52,750
equal to m for going
to the next state.
1209
01:13:52,750 --> 01:13:56,450
OK, the notation here is
horrifying, but the idea is
1210
01:13:56,450 --> 01:13:57,880
very simple.
1211
01:13:57,880 --> 01:14:01,370
I mean, once you get used to the
notation, there's nothing
1212
01:14:01,370 --> 01:14:04,880
complicated here at all.
1213
01:14:04,880 --> 01:14:08,940
OK, so in this example here,
well, we already
1214
01:14:08,940 --> 01:14:10,190
talked about that.
1215
01:14:13,120 --> 01:14:14,990
We're going to start
out at time m.
1216
01:14:17,960 --> 01:14:21,150
We're going to make a decision
at time m, pick up the
1217
01:14:21,150 --> 01:14:28,090
associated reward for that
decision, and pick the
1218
01:14:28,090 --> 01:14:30,970
transition probabilities that
we're going to use at that
1219
01:14:30,970 --> 01:14:33,460
time m, and then go on
to the next state.
1220
01:14:33,460 --> 01:14:36,380
We're going to continue doing
this until time m
1221
01:14:36,380 --> 01:14:37,960
plus n minus 1.
1222
01:14:37,960 --> 01:14:41,450
Mainly, we're going to do this
for n steps of time.
1223
01:14:41,450 --> 01:14:43,690
After the n-th decision--
1224
01:14:43,690 --> 01:14:47,140
you make the n-th decision
at m plus n minus t--
1225
01:14:47,140 --> 01:14:52,270
there's a final transition
based on that decision.
1226
01:14:52,270 --> 01:14:55,490
The final transition is based
on that decision, but the
1227
01:14:55,490 --> 01:14:58,345
final reward is fixed
ahead of time.
1228
01:14:58,345 --> 01:15:01,500
You know what the final reward
is going to be, which happens
1229
01:15:01,500 --> 01:15:03,480
at time m plus n.
1230
01:15:03,480 --> 01:15:07,465
So the things which are variable
is how much reward do
1231
01:15:07,465 --> 01:15:14,070
you get at each of these first
n time units, and what
1232
01:15:14,070 --> 01:15:17,870
probabilities you choose for
going through the next state.
1233
01:15:17,870 --> 01:15:20,170
Is this still a Markov chain?
1234
01:15:20,170 --> 01:15:21,420
Is this still Markov?
1235
01:15:24,750 --> 01:15:26,560
You can talk about this
for a long time.
1236
01:15:26,560 --> 01:15:30,460
You can think about it for
a long time because this
1237
01:15:30,460 --> 01:15:34,770
decision maker might or
might not be Markov.
1238
01:15:34,770 --> 01:15:38,870
What is Markov is the transition
probabilities that
1239
01:15:38,870 --> 01:15:41,380
are taking place in
each unit of time.
1240
01:15:41,380 --> 01:15:46,410
After I make a decision, the
transition probabilities are
1241
01:15:46,410 --> 01:15:51,370
fixed for that decision and
that initial state and had
1242
01:15:51,370 --> 01:15:54,650
nothing to do with the decisions
that had been made
1243
01:15:54,650 --> 01:15:58,650
before that or the states you've
been in before that.
1244
01:15:58,650 --> 01:16:02,220
The Markov condition says that
what happens in the next unit
1245
01:16:02,220 --> 01:16:06,020
of time is a function simply
of those transition
1246
01:16:06,020 --> 01:16:10,370
probabilities that
had been chosen.
1247
01:16:10,370 --> 01:16:13,530
We will see that when we look at
the algorithm, and then you
1248
01:16:13,530 --> 01:16:16,670
can sort out for yourselves
whether there's something
1249
01:16:16,670 --> 01:16:18,190
dishonest here or not.
1250
01:16:18,190 --> 01:16:25,480
Turns out there isn't, but to
Bellman's credit he did sort
1251
01:16:25,480 --> 01:16:28,740
out correctly that this worked,
and many people for a
1252
01:16:28,740 --> 01:16:30,520
long time did not
think it worked.
1253
01:16:34,080 --> 01:16:37,150
So the objective of dynamic
programming is both to
1254
01:16:37,150 --> 01:16:41,540
determine the optimal decision
at each time and to determine
1255
01:16:41,540 --> 01:16:45,040
the expected reward for each
starting state and for each
1256
01:16:45,040 --> 01:16:47,690
number and steps.
1257
01:16:47,690 --> 01:16:51,090
As one might suspect, now here's
the first thing that
1258
01:16:51,090 --> 01:16:52,500
Bellman did.
1259
01:16:52,500 --> 01:16:54,010
He said, here, I have
this problem.
1260
01:16:54,010 --> 01:16:57,880
I want to find out what happens
after 1,000 steps.
1261
01:16:57,880 --> 01:17:00,850
How do I solve the problem?
1262
01:17:00,850 --> 01:17:04,330
Well, anybody with any sense
will tell you don't solve the
1263
01:17:04,330 --> 01:17:06,740
problem with 1,000
steps first.
1264
01:17:06,740 --> 01:17:10,220
Solve the problem with one step
first, and then see if
1265
01:17:10,220 --> 01:17:13,330
you find out anything from it
and then maybe you can solve
1266
01:17:13,330 --> 01:17:17,030
the problem with two steps and
then maybe something nice will
1267
01:17:17,030 --> 01:17:20,600
happen, or maybe it won't.
1268
01:17:20,600 --> 01:17:25,320
When we do this, it'll turn
out that what we're really
1269
01:17:25,320 --> 01:17:30,820
doing is we're starting at the
end and working our way back,
1270
01:17:30,820 --> 01:17:34,010
and this algorithm is due to
Richard Bellman, as I said.
1271
01:17:34,010 --> 01:17:38,400
And he was the one who sorted
out how it worked.
1272
01:17:38,400 --> 01:17:40,630
So what is the algorithm?
1273
01:17:40,630 --> 01:17:45,250
We're going to start out making
a decision at time 1.
1274
01:17:45,250 --> 01:17:50,500
So we're going to
start at time n.
1275
01:17:50,500 --> 01:17:53,610
We're going to start
in a given state i.
1276
01:17:53,610 --> 01:17:58,580
You make a decision, decision
k at time m.
1277
01:17:58,580 --> 01:18:03,040
This provides a reward at time
m, and the selected transition
1278
01:18:03,040 --> 01:18:06,240
probabilities lead to a
final expected reward.
1279
01:18:06,240 --> 01:18:11,380
These are these final rewards
which occur at time n plus 1.
1280
01:18:11,380 --> 01:18:13,710
It's nice to have that n because
it's what let's us
1281
01:18:13,710 --> 01:18:15,550
generalize the problem.
1282
01:18:15,550 --> 01:18:18,460
So this was another clever
thing that went on here.
1283
01:18:18,460 --> 01:18:24,710
So the expected optimal
aggregate reward for a one
1284
01:18:24,710 --> 01:18:32,230
step problem is the sum of the
reward that you get at time m
1285
01:18:32,230 --> 01:18:37,260
plus this final reward you get
at time n plus 1, and you're
1286
01:18:37,260 --> 01:18:40,290
maximizing over the different
policies you have
1287
01:18:40,290 --> 01:18:41,490
available to you.
1288
01:18:41,490 --> 01:18:44,970
So it looks like a trivial
problem, but the optimal
1289
01:18:44,970 --> 01:18:47,980
reward with a one step
problem is just this.
1290
01:18:51,170 --> 01:18:54,820
OK, next you want to consider
the two step problem.
1291
01:18:54,820 --> 01:18:58,900
What's the maximum expected
reward starting at xm equals i
1292
01:18:58,900 --> 01:19:03,480
with decisions at times
m and n plus 1.
1293
01:19:03,480 --> 01:19:05,400
You make two decisions.
1294
01:19:05,400 --> 01:19:08,240
Now, before, we just made
one decision at time m.
1295
01:19:08,240 --> 01:19:13,000
Now we make a decision at time
m and at time n plus 1, and
1296
01:19:13,000 --> 01:19:17,750
finally we pick up a final
reward at time n plus 2.
1297
01:19:17,750 --> 01:19:20,540
Knowing what that final reward
is going to be is going to
1298
01:19:20,540 --> 01:19:26,230
affect the decision you make at
time n plus 1, but it's a
1299
01:19:26,230 --> 01:19:29,770
fixed reward which is a
function of the state.
1300
01:19:29,770 --> 01:19:32,720
You can adjust the transition
probabilities of getting to
1301
01:19:32,720 --> 01:19:35,110
those different rewards.
1302
01:19:35,110 --> 01:19:38,420
The key to dynamic programming
is an optimal decision at time
1303
01:19:38,420 --> 01:19:42,630
n plus 1 can be selected based
only on the state j
1304
01:19:42,630 --> 01:19:45,060
at time n plus 1.
1305
01:19:45,060 --> 01:19:48,960
This decision, given that you're
in state j at time n
1306
01:19:48,960 --> 01:19:53,600
plus 1, is optimal independent
of what you did before that,
1307
01:19:53,600 --> 01:19:55,770
which is why we're starting
out looking at what we're
1308
01:19:55,770 --> 01:19:59,240
going to do with time n plus 1
before we even worry about
1309
01:19:59,240 --> 01:20:02,630
what we're going to
do with time n.
1310
01:20:02,630 --> 01:20:06,340
So, whatever decision you made
at time n, you observe what
1311
01:20:06,340 --> 01:20:10,900
state you're at time n plus
1 and the maximal expected
1312
01:20:10,900 --> 01:20:15,510
reward over times n plus 1 and
n plus 2, given that you
1313
01:20:15,510 --> 01:20:20,610
happen to be in state j is just
maximal over k as the
1314
01:20:20,610 --> 01:20:26,430
reward you're going to get by
choosing policy k and the
1315
01:20:26,430 --> 01:20:30,670
expected value of the final
reward you get if you're using
1316
01:20:30,670 --> 01:20:32,480
this policy k.
1317
01:20:32,480 --> 01:20:36,850
This is just dj* of 1 and
u as you just found.
1318
01:20:36,850 --> 01:20:40,070
In other words, you have the
same situation at time n plus
1319
01:20:40,070 --> 01:20:42,090
1 as you have at time n.
1320
01:20:44,600 --> 01:20:49,785
Well, surprisingly, you've just
solved the whole problem.
1321
01:20:52,810 --> 01:20:58,410
So we've seen that what we
should do at time n plus 1 is
1322
01:20:58,410 --> 01:21:00,450
do this maximization.
1323
01:21:00,450 --> 01:21:05,670
So the optimal reward, aggregate
reward over times m,
1324
01:21:05,670 --> 01:21:11,815
n plus 1, and n plus 2 is what
we get maximizing over our
1325
01:21:11,815 --> 01:21:18,110
choice at time m of the reward
we get at time m plus the
1326
01:21:18,110 --> 01:21:21,750
decision plus the transition
probabilities which we've
1327
01:21:21,750 --> 01:21:27,340
decided on which get us to this
reward at time n plus 1
1328
01:21:27,340 --> 01:21:29,020
and n plus 2.
1329
01:21:29,020 --> 01:21:33,070
We found out what the reward
is for times n plus 1 and n
1330
01:21:33,070 --> 01:21:34,370
plus 2 together.
1331
01:21:34,370 --> 01:21:38,060
That's the reward to go, And
we know what that is, so we
1332
01:21:38,060 --> 01:21:40,210
have this same formula
we used before.
1333
01:21:40,210 --> 01:21:46,965
Why do we want to look at
these final rewards now?
1334
01:21:46,965 --> 01:21:50,980
Well, you can view this as a
final reward in state m.
1335
01:21:50,980 --> 01:21:54,220
It's the final reward which
tells you what you get both
1336
01:21:54,220 --> 01:21:57,930
from state n plus
1 and n plus 2.
1337
01:21:57,930 --> 01:22:04,790
And, going quickly, if we look
at playing this game for three
1338
01:22:04,790 --> 01:22:11,280
steps, the optimal reward for
the three step game is the
1339
01:22:11,280 --> 01:22:16,600
immediate reward optimized over
k plus the rewards at n
1340
01:22:16,600 --> 01:22:22,900
plus 1, n plus 2, and n plus 3,
which we've already found.
1341
01:22:22,900 --> 01:22:28,450
And in general, the optimal
reward at time n--
1342
01:22:28,450 --> 01:22:33,900
when you play the game for n
steps, the optimal reward is
1343
01:22:33,900 --> 01:22:35,170
maximum here.
1344
01:22:35,170 --> 01:22:39,980
So, all you do in the algorithm
is, for each value
1345
01:22:39,980 --> 01:22:43,500
of n when you start with n
equal to 1, you solve the
1346
01:22:43,500 --> 01:22:48,950
problem for all states and you
maximize over all policies you
1347
01:22:48,950 --> 01:22:52,100
have a choice over, and then you
go on to the next larger
1348
01:22:52,100 --> 01:22:56,140
value of n, you solve the
problem for all states and you
1349
01:22:56,140 --> 01:22:56,950
keep on going.
1350
01:22:56,950 --> 01:22:59,820
If you don't have many
states, it's easy.
1351
01:22:59,820 --> 01:23:05,100
If you have 100,000 states, it's
kind of tedious to run
1352
01:23:05,100 --> 01:23:05,880
the algorithm.
1353
01:23:05,880 --> 01:23:09,380
Today it's not bad, but today
we look at problems with
1354
01:23:09,380 --> 01:23:11,770
millions and millions of states
or billions of states,
1355
01:23:11,770 --> 01:23:18,100
and no matter how fast
computation gets, the
1356
01:23:18,100 --> 01:23:22,280
ingenuity people to invent
harder problems always makes
1357
01:23:22,280 --> 01:23:24,630
it hard to solve
these problems.
1358
01:23:24,630 --> 01:23:29,060
So anyway, that's the dynamic
programming algorithm.
1359
01:23:29,060 --> 01:23:31,320
And next time, we're going to
start on renewal processes.