WEBVTT

00:00:01.580 --> 00:00:03.920
The following content is
provided under a Creative

00:00:03.920 --> 00:00:05.340
Commons license.

00:00:05.340 --> 00:00:07.550
Your support will help
MIT OpenCourseWare

00:00:07.550 --> 00:00:11.640
continue to offer high quality
educational resources for free.

00:00:11.640 --> 00:00:14.180
To make a donation or to
view additional materials

00:00:14.180 --> 00:00:18.110
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:18.110 --> 00:00:19.090
at ocw.mit.edu.

00:00:22.420 --> 00:00:24.370
PROFESSOR WILLIAMS: OK,
so today's lecture--

00:00:27.010 --> 00:00:31.380
we're going to be talking about
probabilistic planning later,

00:00:31.380 --> 00:00:33.310
and in these cases
where you're planning

00:00:33.310 --> 00:00:36.570
a large state spaces
is very difficult.

00:00:36.570 --> 00:00:38.740
You do the MVP planning.

00:00:38.740 --> 00:00:41.690
It could be stress that
activity planning, or the likes.

00:00:41.690 --> 00:00:43.250
But you have to be
able to figure out

00:00:43.250 --> 00:00:44.950
how to deal with
these state spaces.

00:00:44.950 --> 00:00:48.160
So Monte Carlo tree searches
is one of the techniques

00:00:48.160 --> 00:00:51.040
that people can identify,
over last five years,

00:00:51.040 --> 00:00:54.667
is having an amazing performance
improvement over other kinds

00:00:54.667 --> 00:00:56.120
of sample-based approaches.

00:00:56.120 --> 00:00:58.510
So entity is very interesting
from that standpoint.

00:00:58.510 --> 00:01:00.845
And then if we [? link it to ?]
the last lecture,

00:01:00.845 --> 00:01:02.980
then the combination
of something,

00:01:02.980 --> 00:01:07.370
we just learn about [INAUDIBLE]
and combine it with search,

00:01:07.370 --> 00:01:11.035
is very powerful, in this case,
through the state-of-the-art

00:01:11.035 --> 00:01:15.472
techniques for that, as much as
tree search [INAUDIBLE] later

00:01:15.472 --> 00:01:20.610
[INAUDIBLE]

00:01:20.610 --> 00:01:22.110
PROFESSOR 2: Good
morning, everyone.

00:01:22.110 --> 00:01:24.411
As Professor Williams
just said, we

00:01:24.411 --> 00:01:26.910
are going to be talking about
Monte Carlo tree search today.

00:01:26.910 --> 00:01:30.102
My name is Eann
and I'll be leading

00:01:30.102 --> 00:01:32.310
the introduction and motivation
of this presentation.

00:01:32.310 --> 00:01:34.420
By the end of this
presentation, you

00:01:34.420 --> 00:01:36.890
will know not only why we
care about Monte Carlo tree

00:01:36.890 --> 00:01:37.390
searches.

00:01:37.390 --> 00:01:39.930
As Professor Williams said,
there's so many algorithms

00:01:39.930 --> 00:01:40.680
out there.

00:01:40.680 --> 00:01:43.440
Why do we care about
this specific one?

00:01:43.440 --> 00:01:46.260
And second, we'll be
going through the pros

00:01:46.260 --> 00:01:49.620
and cons of MCTS, as well
as the algorithm itself.

00:01:49.620 --> 00:01:52.440
And then lastly, we will
have a pretty cool demo

00:01:52.440 --> 00:01:55.650
on how it's applied to Super
Mario Brothers and the latest

00:01:55.650 --> 00:02:01.350
Alpha Go AI that built
the second best leading Go

00:02:01.350 --> 00:02:03.520
player in the world.

00:02:03.520 --> 00:02:05.820
So the outline for
today's presentation

00:02:05.820 --> 00:02:08.220
is, first, we're going to talk
about pre-MCTS algorithms.

00:02:08.220 --> 00:02:11.130
There are other algorithms
that currently exist out there,

00:02:11.130 --> 00:02:15.600
and just a few of them to lead
into why we do care about MCTS

00:02:15.600 --> 00:02:18.510
and why these other
algorithms fail.

00:02:18.510 --> 00:02:20.850
And second, we'll talk about
Monte Carlo tree searches

00:02:20.850 --> 00:02:21.890
itself with Yo.

00:02:21.890 --> 00:02:25.210
And lastly, Nick will tell you
more about the applications

00:02:25.210 --> 00:02:27.300
of Monte Carlo tree searches.

00:02:27.300 --> 00:02:31.800
So the motivation of
these kind of algorithms

00:02:31.800 --> 00:02:33.440
is we want to be
able to play games

00:02:33.440 --> 00:02:36.972
and we want to be able to create
programs to play these games,

00:02:36.972 --> 00:02:38.430
but we want to play
them optimally.

00:02:38.430 --> 00:02:40.880
We want to be able
to win, but we also

00:02:40.880 --> 00:02:43.890
want to be able do this in
a reasonable amount of time.

00:02:43.890 --> 00:02:45.965
So these three can
train itself leads

00:02:45.965 --> 00:02:47.680
to different kinds
of algorithms,

00:02:47.680 --> 00:02:50.600
and different algorithms
with different complexities

00:02:50.600 --> 00:02:53.420
and time, or times to search.

00:02:53.420 --> 00:02:55.025
And so that's why
today we're going

00:02:55.025 --> 00:02:57.140
to be talking about Monte
Carlo tree searches.

00:02:57.140 --> 00:03:00.614
And you'll figure out in a
few slides why we do care.

00:03:00.614 --> 00:03:02.900
So these are the types
of games we have.

00:03:02.900 --> 00:03:04.300
You have this
chart where there's

00:03:04.300 --> 00:03:07.940
fully observable games,
partially observable games,

00:03:07.940 --> 00:03:10.450
determinstic, and
games of chance.

00:03:10.450 --> 00:03:13.240
And so today, the games
that we care about

00:03:13.240 --> 00:03:16.790
are the games that are fully
observable and deterministic.

00:03:16.790 --> 00:03:21.470
And these games are games like
chess and checkers and Go.

00:03:21.470 --> 00:03:23.580
And we'll also be talking
about another example

00:03:23.580 --> 00:03:25.730
with Tic-tac-toe.

00:03:25.730 --> 00:03:29.280
So these pre-MCTS
algorithms include

00:03:29.280 --> 00:03:32.660
deterministic, fully observable
games, like we said earlier.

00:03:32.660 --> 00:03:36.510
And the idea of this, and the
nice thing about these games,

00:03:36.510 --> 00:03:38.760
is that they have
perfect information,

00:03:38.760 --> 00:03:41.180
and that you have
all of the states

00:03:41.180 --> 00:03:45.650
that you need and there's
no opportunity for chance.

00:03:45.650 --> 00:03:47.615
And so the idea is
that we can construct

00:03:47.615 --> 00:03:50.240
a tree that contains
all possible outcomes

00:03:50.240 --> 00:03:52.990
because everything
is fully determined.

00:03:52.990 --> 00:03:55.250
And so one of these
algorithms, to address this,

00:03:55.250 --> 00:03:58.960
is the algorithm Minimax, which
you might have heard before.

00:03:58.960 --> 00:04:00.600
And the idea of
Minimax to minimize

00:04:00.600 --> 00:04:02.317
the maximum possible loss.

00:04:02.317 --> 00:04:04.150
That sounds a little
weird in the beginning,

00:04:04.150 --> 00:04:06.540
but if you take a
look at this tree,

00:04:06.540 --> 00:04:08.440
this red dot, for
example, is the computer.

00:04:08.440 --> 00:04:11.730
And so in the computer's eyes,
it wants to beat its opponent.

00:04:11.730 --> 00:04:14.570
And we're assuming the
opponent wants to win also,

00:04:14.570 --> 00:04:16.815
so they're playing
their best game as well.

00:04:16.815 --> 00:04:21.990
And so the computer wants to
maximize his or her points,

00:04:21.990 --> 00:04:25.820
but also knowing that the
opponent, or the human,

00:04:25.820 --> 00:04:29.870
wants to maximize
their own win as well.

00:04:29.870 --> 00:04:31.730
And so in the
computer's eyes, it

00:04:31.730 --> 00:04:34.000
wants to minimize the
maximum possible lost.

00:04:34.000 --> 00:04:37.038
Does that make
sense to everyone?

00:04:37.038 --> 00:04:38.012
Yes?

00:04:38.012 --> 00:04:39.480
OK.

00:04:39.480 --> 00:04:41.310
And so in the
example of Minimax,

00:04:41.310 --> 00:04:42.810
we're going to start
with a connect,

00:04:42.810 --> 00:04:45.450
or a Tic-tac-toe board,
where the computer is

00:04:45.450 --> 00:04:49.230
this board right here, and
the blue Tic-tac-toe boards

00:04:49.230 --> 00:04:52.236
are the states that the
computer finally chooses.

00:04:52.236 --> 00:04:55.030
It's anticipating the
moves a human could play.

00:04:57.880 --> 00:04:59.820
So if you take a
look up here, here's

00:04:59.820 --> 00:05:02.292
the current state of the board.

00:05:02.292 --> 00:05:03.700
The current state of the board.

00:05:03.700 --> 00:05:09.380
And the possible options for the
human are this guy, this guy.

00:05:09.380 --> 00:05:09.880
Nope.

00:05:09.880 --> 00:05:11.820
Possible options
for the computer,

00:05:11.820 --> 00:05:13.540
we have three different options.

00:05:13.540 --> 00:05:16.200
And so you'll notice that this
is clearly the obvious winner.

00:05:16.200 --> 00:05:18.180
But in the state
of Minimax, it goes

00:05:18.180 --> 00:05:19.710
through the entire
tree, which is

00:05:19.710 --> 00:05:21.270
different from
depth-first search.

00:05:21.270 --> 00:05:24.520
It goes through the entire
tree until it finds the winning

00:05:24.520 --> 00:05:30.460
move and the minimize of
the maximum possible points

00:05:30.460 --> 00:05:31.800
it could win.

00:05:31.800 --> 00:05:34.080
So is there a way we
can make this better?

00:05:34.080 --> 00:05:34.620
Yes.

00:05:34.620 --> 00:05:36.720
I'm sure you've
heard about pruning,

00:05:36.720 --> 00:05:39.900
where, in our human
intuition, it makes sense.

00:05:39.900 --> 00:05:41.790
Well, why don't we
just stop when we win,

00:05:41.790 --> 00:05:43.940
or when we know
we're going to have

00:05:43.940 --> 00:05:47.116
a game that allows us to win?

00:05:47.116 --> 00:05:49.940
And so this idea is the
idea of simple pruning.

00:05:49.940 --> 00:05:54.250
And so when we combine Minimax
and simple pruning, we have--

00:05:54.250 --> 00:05:54.750
anyone know?

00:05:57.612 --> 00:05:58.570
AUDIENCE: Alpha, beta.

00:05:58.570 --> 00:05:59.278
PROFESSOR 3: Yes.

00:05:59.278 --> 00:06:02.915
Our 6.034 head TA
knows about this.

00:06:02.915 --> 00:06:05.800
We have alpha-beta pruning,
where we prune away any

00:06:05.800 --> 00:06:09.090
branches that cannot
influence the final decision.

00:06:09.090 --> 00:06:13.100
So in other words, you wouldn't
keep exploring the tree

00:06:13.100 --> 00:06:15.595
if you already knew that
a previous term would

00:06:15.595 --> 00:06:16.890
allow you to win.

00:06:16.890 --> 00:06:19.630
And so this idea in
alpha-beta pruning,

00:06:19.630 --> 00:06:21.150
we have an alpha and a beta.

00:06:21.150 --> 00:06:24.740
And so the details
aren't important

00:06:24.740 --> 00:06:26.850
for you to know right
now, but the idea

00:06:26.850 --> 00:06:29.490
is that we stop whenever
we know we don't

00:06:29.490 --> 00:06:31.930
need to go on any further.

00:06:31.930 --> 00:06:34.434
So in the games that
have Tic-tac-toe

00:06:34.434 --> 00:06:37.380
and Connect 4 and chess,
we have relatively low

00:06:37.380 --> 00:06:38.800
branching factor.

00:06:38.800 --> 00:06:41.130
So in the case of
Tic-tac-toe, we have 2

00:06:41.130 --> 00:06:43.720
to the fourth branching factor.

00:06:43.720 --> 00:06:46.230
But what if we have really
large branching factors,

00:06:46.230 --> 00:06:47.640
like Alpha Go?

00:06:47.640 --> 00:06:50.440
In Alpha Go, we
have 2 to the 250.

00:06:50.440 --> 00:06:53.760
Do you see that Mini Max,
or even alpha-beta pruning,

00:06:53.760 --> 00:06:57.140
would be an optimal
algorithm for this?

00:06:57.140 --> 00:06:59.169
The answer is?

00:06:59.169 --> 00:06:59.710
AUDIENCE: No.

00:06:59.710 --> 00:07:00.376
PROFESSOR 3: No.

00:07:00.376 --> 00:07:04.370
And this leads us
to out next section.

00:07:04.370 --> 00:07:08.210
Our goal is going to talk about
how we can use the Monte Carlo

00:07:08.210 --> 00:07:11.210
tree search algorithm for
games with really high

00:07:11.210 --> 00:07:16.120
branching factors, and using
the random extension to allow us

00:07:16.120 --> 00:07:21.490
to see, ultimately, how Alpha
Go, which is Google's AI,

00:07:21.490 --> 00:07:25.843
was able to beat the leading
Go player in the world.

00:07:29.140 --> 00:07:31.024
PROFESSOR 3: All right, guys.

00:07:31.024 --> 00:07:34.000
So this is the part
where we re-explain

00:07:34.000 --> 00:07:35.410
the algorithm itself.

00:07:35.410 --> 00:07:37.240
And before we dive
into this, I want

00:07:37.240 --> 00:07:38.860
to make something
really clear, which

00:07:38.860 --> 00:07:41.470
is that because these
are technical details

00:07:41.470 --> 00:07:43.700
and because we actually
want you to understand them,

00:07:43.700 --> 00:07:45.760
and because I definitely didn't
understand this the first three

00:07:45.760 --> 00:07:46.920
times I read the paper.

00:07:46.920 --> 00:07:49.420
I really want you to feel
free to ask any questions

00:07:49.420 --> 00:07:53.590
on your mind, with the knowledge
that, in my experience,

00:07:53.590 --> 00:07:56.492
it is very rare that someone
asks a question in class that's

00:07:56.492 --> 00:08:00.350
[INAUDIBLE] OK, so really,
whenever you have one.

00:08:00.350 --> 00:08:01.630
OK.

00:08:01.630 --> 00:08:04.130
So why are we doing this?

00:08:04.130 --> 00:08:06.860
Well, the ideal
goal behind MTCS is

00:08:06.860 --> 00:08:09.160
that we want to
selectively build up

00:08:09.160 --> 00:08:10.910
different parts of the tree.

00:08:10.910 --> 00:08:16.630
So the depth-first search
way, the exhaustive search,

00:08:16.630 --> 00:08:19.270
would have us exploring
the entire koopa tree,

00:08:19.270 --> 00:08:21.480
and that our depth
is limited by looking

00:08:21.480 --> 00:08:23.630
at all the possible
nodes of that level.

00:08:23.630 --> 00:08:25.270
But what we want is we want--

00:08:25.270 --> 00:08:28.350
because the amount of
computation required for that

00:08:28.350 --> 00:08:30.080
explodes really quickly.

00:08:30.080 --> 00:08:32.373
With the number of moves
that you're basically

00:08:32.373 --> 00:08:33.789
looking into the
future, we wanted

00:08:33.789 --> 00:08:37.495
to be able to search selectively
in certain parts of the tree.

00:08:37.495 --> 00:08:41.230
And so for example, if there are
less promising parts over here,

00:08:41.230 --> 00:08:44.290
then we care less about looking
into the future of those areas.

00:08:44.290 --> 00:08:46.030
But if we have a certain move--

00:08:46.030 --> 00:08:48.050
in chess, for example,
there's a certain move

00:08:48.050 --> 00:08:49.670
where in two moves, you're
going to be able to take

00:08:49.670 --> 00:08:50.545
the opponent's queen.

00:08:50.545 --> 00:08:52.412
You're really want
to search that region

00:08:52.412 --> 00:08:53.870
and figure out
whether that's going

00:08:53.870 --> 00:08:58.130
to end up being a significantly
positive group for me.

00:08:58.130 --> 00:09:00.230
And so the whole
goal of our algorithm

00:09:00.230 --> 00:09:02.977
is going to be growing
this asymmetric tree.

00:09:02.977 --> 00:09:03.810
How does that sound?

00:09:06.820 --> 00:09:08.700
OK, great.

00:09:08.700 --> 00:09:11.210
So how do we actually do this?

00:09:11.210 --> 00:09:13.200
We're going to go over
a high-level outline,

00:09:13.200 --> 00:09:14.800
but before we do
that, let's talk

00:09:14.800 --> 00:09:16.400
about our tree,
which you're going

00:09:16.400 --> 00:09:17.483
to get very familiar with.

00:09:20.250 --> 00:09:24.710
Can people see that this
is red and this is blue?

00:09:24.710 --> 00:09:28.850
So this is our game state
when we start our game.

00:09:28.850 --> 00:09:32.570
We can be given a Tic-tac-toe
board with a [INAUDIBLE] place,

00:09:32.570 --> 00:09:35.780
a game of chess with the lose
configured a certain way.

00:09:35.780 --> 00:09:38.420
And so our player,
which is the computer,

00:09:38.420 --> 00:09:41.070
has three separate
moves that it can take.

00:09:41.070 --> 00:09:43.560
And so each of those moves
are presented by a node.

00:09:43.560 --> 00:09:48.170
And each of those moves have
response moves by the opponent.

00:09:48.170 --> 00:09:50.870
So you can imagine
that if one of these

00:09:50.870 --> 00:09:53.730
is a Tic-tac-toe board with
just a circle, that one of these

00:09:53.730 --> 00:09:57.440
is with that circle and
the next place right by it.

00:09:57.440 --> 00:10:00.620
And as you go down
the this tree,

00:10:00.620 --> 00:10:02.840
you start understanding
basically,

00:10:02.840 --> 00:10:06.260
it's the way that humans think
about playing these games.

00:10:06.260 --> 00:10:10.160
If I go here, then
what if they go there,

00:10:10.160 --> 00:10:12.280
and then what if
I go right here.

00:10:12.280 --> 00:10:14.990
You try to think through
the set of future moves

00:10:14.990 --> 00:10:17.930
and try to evaluate
whether your move will

00:10:17.930 --> 00:10:20.799
be good in the long term sense.

00:10:20.799 --> 00:10:23.090
They way that are going to
expand our tree, as we said,

00:10:23.090 --> 00:10:26.464
to create an asymmetric
tree is first of all,

00:10:26.464 --> 00:10:28.130
we're going to descend
through the tree.

00:10:28.130 --> 00:10:30.296
We're going to start at the
top and we're basically,

00:10:30.296 --> 00:10:34.560
jump down some sequence of
branches until we figure out

00:10:34.560 --> 00:10:38.750
where we're going to place
our new node, which seems

00:10:38.750 --> 00:10:39.920
like a key operation here.

00:10:39.920 --> 00:10:42.018
To create an asymmetric
tree it's all about how

00:10:42.018 --> 00:10:43.707
you [INAUDIBLE].

00:10:43.707 --> 00:10:45.290
For example, in this
case, we're going

00:10:45.290 --> 00:10:48.580
to pick this sequence of nodes.

00:10:48.580 --> 00:10:51.596
And once we get to the bottom
and find every location,

00:10:51.596 --> 00:10:53.680
we're going to
create a new node.

00:10:53.680 --> 00:10:55.750
It's not very hard.

00:10:55.750 --> 00:10:59.690
Then we're going to simulate
a game from this new node.

00:10:59.690 --> 00:11:03.260
And this is the
key part of MCTS.

00:11:03.260 --> 00:11:06.296
Once you get to new
a location, what

00:11:06.296 --> 00:11:07.670
you're going to
be doing then, is

00:11:07.670 --> 00:11:10.465
you're going to be simulating
a game from that new location.

00:11:10.465 --> 00:11:11.840
We're going to
talk about how you

00:11:11.840 --> 00:11:17.300
go about simulating a game from
this more advanced game state

00:11:17.300 --> 00:11:18.907
that what we started out with.

00:11:18.907 --> 00:11:20.957
Does anyone have any
questions right now?

00:11:20.957 --> 00:11:23.040
We will be going in depth
into all of these steps,

00:11:23.040 --> 00:11:24.556
but just in a high level sense.

00:11:24.556 --> 00:11:25.420
AUDIENCE: Just a quick question.

00:11:25.420 --> 00:11:25.660
PROFESSOR 3: Yeah.

00:11:25.660 --> 00:11:27.035
AUDIENCE: To create
the new node,

00:11:27.035 --> 00:11:29.617
is it probabilistic, just
creating a new node as the most

00:11:29.617 --> 00:11:30.450
probable [INAUDIBLE]

00:11:30.450 --> 00:11:31.300
PROFESSOR 3: No, no.

00:11:31.300 --> 00:11:32.590
You're creating some new node.

00:11:32.590 --> 00:11:34.140
We'll talk about how
we pick that new node,

00:11:34.140 --> 00:11:36.806
but we're just making a new node
and we're not thinking anything

00:11:36.806 --> 00:11:37.780
about probability.

00:11:37.780 --> 00:11:40.030
The next thing is that we're
going to update the tree.

00:11:40.030 --> 00:11:43.195
So whatever the value of
the simulation delta was--

00:11:43.195 --> 00:11:50.360
delta, remember-- we're going to
propagate that up and basically

00:11:50.360 --> 00:11:52.550
add that to all
of the nodes that

00:11:52.550 --> 00:11:54.416
are in that parent of
that node in the tree

00:11:54.416 --> 00:11:56.332
and update some information
that goes in there

00:11:56.332 --> 00:11:58.090
and that they're storing.

00:11:58.090 --> 00:12:00.980
This is going to be good because
it's going to mean that--

00:12:00.980 --> 00:12:02.975
it's a lot like in
search algorithms where

00:12:02.975 --> 00:12:05.360
you have trees that then
the entirety of the tree

00:12:05.360 --> 00:12:07.713
remains up to date with the
information from every given

00:12:07.713 --> 00:12:08.642
simulation.

00:12:08.642 --> 00:12:10.100
And we're just
going to repeat this

00:12:10.100 --> 00:12:11.390
over and over and over again.

00:12:11.390 --> 00:12:13.640
And slowly, our
tree will grow out

00:12:13.640 --> 00:12:15.946
until whenever we
feel like stopping.

00:12:15.946 --> 00:12:17.570
This is actually one
of the nice things

00:12:17.570 --> 00:12:22.220
about MCTS, is that whenever
we decide that we're out

00:12:22.220 --> 00:12:25.510
of time, like for example, if
you're in a competition playing

00:12:25.510 --> 00:12:29.060
a champion Go player, you
can stop the simulation.

00:12:29.060 --> 00:12:30.710
And then all you
have to do is pick

00:12:30.710 --> 00:12:34.220
between one of the
best first moves

00:12:34.220 --> 00:12:35.780
that you're going to make.

00:12:35.780 --> 00:12:38.510
Because an the end of
the day, after you're

00:12:38.510 --> 00:12:41.010
doing all the simulation,
we're still right here.

00:12:41.010 --> 00:12:43.820
And we're still only picking
between the movies that go

00:12:43.820 --> 00:12:45.850
immediately where we started.

00:12:45.850 --> 00:12:47.260
Yeah.

00:12:47.260 --> 00:12:50.080
AUDIENCE: Could this
[INAUDIBLE] good tree?

00:12:50.080 --> 00:12:52.290
And then on some initial
region of interest,

00:12:52.290 --> 00:12:56.151
or is it arbitrary how
you get to create it?

00:12:56.151 --> 00:12:57.900
PROFESSOR 3: We'll go
through how you pick

00:12:57.900 --> 00:13:00.410
where to descend right now.

00:13:00.410 --> 00:13:04.030
I guess, it's any
possible move that starts

00:13:04.030 --> 00:13:06.412
at your starting game state.

00:13:06.412 --> 00:13:10.480
Does that make-- great.

00:13:10.480 --> 00:13:12.970
Before we move on to
the algorithm itself,

00:13:12.970 --> 00:13:17.360
let's talk about what we store
in each one of these nodes.

00:13:17.360 --> 00:13:19.400
So now we've added
these numbers.

00:13:19.400 --> 00:13:22.510
And these numbers
represent is that nk,

00:13:22.510 --> 00:13:25.730
as in the value of the
right, is the number of games

00:13:25.730 --> 00:13:28.500
that have been played that
involve a certain node.

00:13:28.500 --> 00:13:31.070
So for example, if
I look this node,

00:13:31.070 --> 00:13:33.410
that means that
four games have been

00:13:33.410 --> 00:13:34.737
played that involve this node.

00:13:34.737 --> 00:13:36.820
A game that has been played
that involves the node

00:13:36.820 --> 00:13:38.570
just means that
one of the states

00:13:38.570 --> 00:13:40.940
of the board at some
point in the game

00:13:40.940 --> 00:13:45.480
was the state of the board
that this represents.

00:13:45.480 --> 00:13:48.400
For example, if I have a
game that was played here,

00:13:48.400 --> 00:13:50.275
if I know that I've
played this once,

00:13:50.275 --> 00:13:51.650
then that guarantees
to me that I

00:13:51.650 --> 00:13:53.191
played this game
once because this is

00:13:53.191 --> 00:13:55.444
a precursor state to this one.

00:13:55.444 --> 00:13:56.920
Make sense?

00:13:56.920 --> 00:13:57.904
Yeah.

00:13:57.904 --> 00:14:00.734
AUDIENCE: How can the two
n's below that node not

00:14:00.734 --> 00:14:03.000
add up to a value of [INAUDIBLE]

00:14:03.000 --> 00:14:05.960
PROFESSOR 3: That will come when
we start expanding our game.

00:14:05.960 --> 00:14:07.180
But that's a great question.

00:14:07.180 --> 00:14:10.270
And intuitively
speaking, it should.

00:14:10.270 --> 00:14:12.940
AUDIENCE: You're saying you're
storing data from past games

00:14:12.940 --> 00:14:13.742
about what we've--

00:14:13.742 --> 00:14:14.450
PROFESSOR 3: Yes.

00:14:14.450 --> 00:14:15.944
AUDIENCE: --done before.

00:14:15.944 --> 00:14:18.360
AUDIENCE: If past game's outside
of the script simulation?

00:14:18.360 --> 00:14:19.360
PROFESSOR 3: No, no, no.

00:14:19.360 --> 00:14:21.850
Past game's in the
script simulation.

00:14:21.850 --> 00:14:23.590
And then the other
value is the number

00:14:23.590 --> 00:14:26.724
of wins associated
with a certain node.

00:14:26.724 --> 00:14:28.890
And these are going to be
wins for player one, which

00:14:28.890 --> 00:14:30.494
is red in this case.

00:14:30.494 --> 00:14:32.410
It would get confusing
if we put both of them,

00:14:32.410 --> 00:14:34.120
but they're complementary.

00:14:34.120 --> 00:14:37.020
So for example, three
out of the four times

00:14:37.020 --> 00:14:42.317
that the red player visited this
node, they won in that node.

00:14:42.317 --> 00:14:44.650
And these are the two numbers
that we're going to store.

00:14:44.650 --> 00:14:46.066
And we're going
to see why they're

00:14:46.066 --> 00:14:48.760
significant to store later.

00:14:48.760 --> 00:14:52.629
So first, descending the
key part of our algorithm

00:14:52.629 --> 00:14:53.670
that we're talking about.

00:14:53.670 --> 00:14:55.900
And when descending,
there are these two

00:14:55.900 --> 00:14:59.260
counterbalanced
desires that we have.

00:14:59.260 --> 00:15:03.670
The first of them is that
we want to explore really

00:15:03.670 --> 00:15:05.410
deeply into our tree.

00:15:05.410 --> 00:15:08.650
We want to think about, OK, if
they do this then I'll do this.

00:15:08.650 --> 00:15:11.427
And then, well, then I'll do
that unless I want it to forth.

00:15:11.427 --> 00:15:13.510
And we want to think through
a long term strategy.

00:15:13.510 --> 00:15:16.870
But at the same time, we don't
want to get caught in that.

00:15:16.870 --> 00:15:18.700
We want to make
sure that we're not

00:15:18.700 --> 00:15:22.750
missing a really promising
other movie that we weren't even

00:15:22.750 --> 00:15:24.670
considering because we
were really going down

00:15:24.670 --> 00:15:27.410
this certain rabbit
hole of the move

00:15:27.410 --> 00:15:28.840
that we had thought
about before.

00:15:28.840 --> 00:15:33.260
This is illustrated by the
x case [INAUDIBLE] SMBC.

00:15:33.260 --> 00:15:37.222
The SMBC comic about academia
and how someone tells you

00:15:37.222 --> 00:15:38.680
that a lot of really
great work has

00:15:38.680 --> 00:15:40.346
been done in an area,
that means nothing

00:15:40.346 --> 00:15:44.082
about how promising
the future will be.

00:15:44.082 --> 00:15:45.790
It's all about expansion
and exploration.

00:15:45.790 --> 00:15:47.831
And the way that we're
going to balance expansion

00:15:47.831 --> 00:15:49.520
and exploration
in order to create

00:15:49.520 --> 00:15:54.083
our really nice asymmetric
tree is the following formula.

00:15:54.083 --> 00:15:57.610
And it's fine if that looks
really confusing and messy.

00:15:57.610 --> 00:16:03.220
But actually, it breaks down
quite nicely into two parts.

00:16:03.220 --> 00:16:04.860
This formula is
known as the UCB.

00:16:04.860 --> 00:16:07.600
You don't need to know why it's
the Upper Confidence Bound.

00:16:07.600 --> 00:16:09.231
Let's just talk about
what's inside it.

00:16:09.231 --> 00:16:11.230
So first of all, you have
this term on the left.

00:16:11.230 --> 00:16:14.590
And this term on the left
is the extension term.

00:16:14.590 --> 00:16:18.030
It's basically proportional
to the likelihood

00:16:18.030 --> 00:16:21.050
that the expected number of
times that you're going to win,

00:16:21.050 --> 00:16:23.272
given that you are
in a certain node

00:16:23.272 --> 00:16:24.730
and that you were
a certain player.

00:16:27.334 --> 00:16:29.000
It's basically the
quality of your state

00:16:29.000 --> 00:16:30.310
in some abstract level.

00:16:30.310 --> 00:16:32.260
If we knew this
perfectly, then we

00:16:32.260 --> 00:16:33.760
would be doing
great because that's

00:16:33.760 --> 00:16:37.780
the thing we're looking for on
some grand level, The expected

00:16:37.780 --> 00:16:39.910
likelihood of winning
from a certain state.

00:16:39.910 --> 00:16:42.192
On the other hand, you
have this exploration term.

00:16:42.192 --> 00:16:44.150
And you may not be able
to read the font there.

00:16:44.150 --> 00:16:45.700
But what this is
basically saying

00:16:45.700 --> 00:16:49.150
is that it looks at
the number of games

00:16:49.150 --> 00:16:54.580
that I have been played through,
and it was the number of games

00:16:54.580 --> 00:16:56.470
that my parent has
been played through.

00:16:56.470 --> 00:17:00.460
And it tries to preserve those
numbers at a certain ratio,

00:17:00.460 --> 00:17:01.910
at a log ratio.

00:17:01.910 --> 00:17:06.849
And what that effectively means,
is that the number of times

00:17:06.849 --> 00:17:08.200
that I have been--

00:17:08.200 --> 00:17:10.490
if I have been visited
relatively few times,

00:17:10.490 --> 00:17:14.180
and the denominator is small.

00:17:14.180 --> 00:17:16.740
Whereas my parent has been
visited many times, which

00:17:16.740 --> 00:17:19.040
means that my siblings have
gotten much more attention,

00:17:19.040 --> 00:17:23.140
then the likelihood that I
will be visited again actually

00:17:23.140 --> 00:17:24.380
increases.

00:17:24.380 --> 00:17:27.480
So this is biased
on the one hand,

00:17:27.480 --> 00:17:29.450
towards nodes that
are really promising,

00:17:29.450 --> 00:17:32.200
and on the other
hand, towards nodes

00:17:32.200 --> 00:17:34.663
that haven't been explored
yet, where there's a gold mine

00:17:34.663 --> 00:17:36.996
and all you need to do is dig
a little bit, potentially.

00:17:39.650 --> 00:17:42.300
We don't actually have an
analytical expression for this.

00:17:42.300 --> 00:17:45.140
But we can approximate
it because you

00:17:45.140 --> 00:17:48.150
can think that the expected
value from a certain node

00:17:48.150 --> 00:17:51.860
is, roughly speaking,
approximately the ratio of wins

00:17:51.860 --> 00:17:54.080
at that node to
the ratio of times

00:17:54.080 --> 00:17:55.898
that that node has
been visit at all.

00:17:59.560 --> 00:18:01.820
Let's talk about actually
applying this statement.

00:18:01.820 --> 00:18:04.153
Because what the statement
is going to give you, is it's

00:18:04.153 --> 00:18:06.790
going to give you some number
for here and some number

00:18:06.790 --> 00:18:09.140
here, and some number
for here, and so on.

00:18:09.140 --> 00:18:10.890
When we start descending
through the tree,

00:18:10.890 --> 00:18:12.830
we're going to start
at the top node.

00:18:12.830 --> 00:18:15.520
And then we're going
to look at the three

00:18:15.520 --> 00:18:17.500
children of that node.

00:18:17.500 --> 00:18:19.290
And we're going to
compute this UCB

00:18:19.290 --> 00:18:21.560
value for each of
these children and pick

00:18:21.560 --> 00:18:23.780
whichever one is the highest.

00:18:23.780 --> 00:18:27.650
So just as a thought
for a moment,

00:18:27.650 --> 00:18:28.850
what if we ignore this one?

00:18:28.850 --> 00:18:31.600
And what if we're just
computing the UCB of these two?

00:18:31.600 --> 00:18:35.890
Does anyone have any intuition
on whether the UCB would

00:18:35.890 --> 00:18:39.088
be higher for this
node or for this node?

00:18:39.088 --> 00:18:40.430
AUDIENCE: The left node.

00:18:40.430 --> 00:18:42.170
PROFESSOR 3: The left node?

00:18:42.170 --> 00:18:43.040
OK.

00:18:43.040 --> 00:18:44.270
So why is that?

00:18:44.270 --> 00:18:46.460
AUDIENCE: It has
a win [INAUDIBLE]

00:18:46.460 --> 00:18:47.210
PROFESSOR 3: Yeah.

00:18:47.210 --> 00:18:47.967
It has a win.

00:18:47.967 --> 00:18:49.800
AUDIENCE: And they both
have a [INAUDIBLE]..

00:18:49.800 --> 00:18:50.675
PROFESSOR 3: Exactly.

00:18:50.675 --> 00:18:53.540
And so clearly, you think the
exploration term is the same

00:18:53.540 --> 00:18:56.040
because you know it's not that
one child has been loved less

00:18:56.040 --> 00:18:57.950
than the other, but
the expansion term

00:18:57.950 --> 00:18:59.404
is going to be different.

00:18:59.404 --> 00:19:01.320
And so it's definitely
going to pick this one.

00:19:01.320 --> 00:19:02.850
In this case, what
we're going to say

00:19:02.850 --> 00:19:05.475
is actually that this is so much
more promising than the others

00:19:05.475 --> 00:19:07.885
that it's actually going
to pick this left node.

00:19:07.885 --> 00:19:10.290
And so it's going to expand,
and it's going to look down.

00:19:10.290 --> 00:19:11.665
And then when it
looks down, it's

00:19:11.665 --> 00:19:13.150
going to compare
between these two.

00:19:13.150 --> 00:19:17.290
And this time, remember,
that this is a parent.

00:19:17.290 --> 00:19:22.590
A parent want to minimize the
number of wins that we have.

00:19:22.590 --> 00:19:24.250
Which means that our
opponent is going

00:19:24.250 --> 00:19:29.980
to want to pick the one that
were less likely to win in

00:19:29.980 --> 00:19:31.710
and they're more
likely to win in.

00:19:31.710 --> 00:19:34.570
This is the idea of
mini-max, minimizing how well

00:19:34.570 --> 00:19:36.520
my enemy does in this game.

00:19:40.190 --> 00:19:41.910
Although again,
the expiration term

00:19:41.910 --> 00:19:44.935
might counterbalance it a little
bit because, technically, this

00:19:44.935 --> 00:19:48.024
has been explored more.

00:19:48.024 --> 00:19:49.940
We're going to pick the
one on the left again.

00:19:49.940 --> 00:19:51.700
And we're going to
get to that location

00:19:51.700 --> 00:19:54.480
that we got to originally.

00:19:54.480 --> 00:19:57.750
Now when we're comparing
between these two,

00:19:57.750 --> 00:19:59.896
between a node that
has been visited once

00:19:59.896 --> 00:20:01.520
and a node that has
never been visited,

00:20:01.520 --> 00:20:06.121
can anyone guess which one
of these it is going to pick?

00:20:06.121 --> 00:20:06.620
Yeah.

00:20:06.620 --> 00:20:08.185
AUDIENCE: Never
has been visited.

00:20:08.185 --> 00:20:09.310
PROFESSOR 3: Yeah, exactly.

00:20:09.310 --> 00:20:11.690
Because this number is zero.

00:20:11.690 --> 00:20:14.056
And so if the
parent has ever been

00:20:14.056 --> 00:20:16.535
visited but the node hasn't,
this is going to be infinite

00:20:16.535 --> 00:20:18.909
and it's going to have to pick
the node that it has never

00:20:18.909 --> 00:20:20.512
seen before.

00:20:20.512 --> 00:20:22.262
So that's how we descend
through the tree.

00:20:22.262 --> 00:20:23.886
Does anyone have any
questions on that.

00:20:23.886 --> 00:20:25.070
Really, it's totally fine.

00:20:25.070 --> 00:20:27.440
We're going to be talking
about this for a while.

00:20:27.440 --> 00:20:28.056
Yeah.

00:20:28.056 --> 00:20:31.287
AUDIENCE: With the left node
that has the four for n sub k,

00:20:31.287 --> 00:20:36.344
wouldn't that be three because
there's two and one below?

00:20:36.344 --> 00:20:37.760
PROFESSOR 3: No
because of the way

00:20:37.760 --> 00:20:39.468
that we're going to
be updating the tree.

00:20:39.468 --> 00:20:41.490
Next, we'll talk about
some [INAUDIBLE]..

00:20:41.490 --> 00:20:42.698
AUDIENCE: I like the concept.

00:20:42.698 --> 00:20:44.742
But if it's a deterministic
game, why couldn't it

00:20:44.742 --> 00:20:46.499
hold it's [INAUDIBLE]
pretty strictly?

00:20:46.499 --> 00:20:48.040
PROFESSOR 3: That's
a great question.

00:20:48.040 --> 00:20:50.606
That's really up to
computer memory limits.

00:20:50.606 --> 00:20:54.280
As I think that Leah
mentioned, the number of stakes

00:20:54.280 --> 00:20:55.794
in the game of Go--

00:20:55.794 --> 00:20:57.835
it's a 19 by 19 board,
and you can play something

00:20:57.835 --> 00:20:58.500
at every state.

00:20:58.500 --> 00:21:00.150
It's only like 2 to the--

00:21:00.150 --> 00:21:01.150
PROFESSOR 2: [INAUDIBLE]

00:21:01.150 --> 00:21:01.340
PROFESSOR 3: What?

00:21:01.340 --> 00:21:02.048
PROFESSOR 2: 250.

00:21:02.048 --> 00:21:04.460
PROFESSOR 3: 250.

00:21:04.460 --> 00:21:07.000
You could never explore
the entire search tree.

00:21:07.000 --> 00:21:09.180
AUDIENCE: [INAUDIBLE]
over the first few layers

00:21:09.180 --> 00:21:12.010
or are we going polite.

00:21:12.010 --> 00:21:14.340
We try to do this real
time where you could

00:21:14.340 --> 00:21:15.710
have done something offline.

00:21:15.710 --> 00:21:17.330
PROFESSOR 3: It's
definitely true.

00:21:17.330 --> 00:21:18.440
If you know a state
that you're going

00:21:18.440 --> 00:21:20.814
to arrive at ahead of time,
then you can totally do that.

00:21:20.814 --> 00:21:22.420
But in a game
that's large enough

00:21:22.420 --> 00:21:25.660
that to do that for
all the possible states

00:21:25.660 --> 00:21:29.050
would take that much more time
and take that much more memory.

00:21:29.050 --> 00:21:30.970
It doesn't end up
making that much sense.

00:21:30.970 --> 00:21:32.550
Also, something
to point out here,

00:21:32.550 --> 00:21:34.841
is that for most of the games
that we're talking about,

00:21:34.841 --> 00:21:38.730
simulating a run through
of the game is really fast.

00:21:38.730 --> 00:21:40.460
So if you think about it--

00:21:40.460 --> 00:21:43.170
let's actually get to
that in next piece.

00:21:43.170 --> 00:21:44.890
But the point is
that building up

00:21:44.890 --> 00:21:46.885
this many levels of
a tree for a computer

00:21:46.885 --> 00:21:50.780
takes probably on the order
of less than millisecond.

00:21:50.780 --> 00:21:55.410
So doing this for a
really, really huge tree,

00:21:55.410 --> 00:21:58.504
it's peanuts because their
such simple operations.

00:21:58.504 --> 00:22:00.670
But it won't get expensive
when we start building up

00:22:00.670 --> 00:22:04.650
the tree to serious depths.

00:22:04.650 --> 00:22:08.425
AUDIENCE: But a game like Go,
how many nodes would you have?

00:22:08.425 --> 00:22:10.300
PROFESSOR 3: On each
level, in the beginning,

00:22:10.300 --> 00:22:12.280
we have something on
the order of 400 nodes.

00:22:12.280 --> 00:22:14.580
And we have a depth
of about, I think

00:22:14.580 --> 00:22:17.542
most games have up to 250
steps, or something like that.

00:22:17.542 --> 00:22:19.750
AUDIENCE: So just to build,
if you go in there blank,

00:22:19.750 --> 00:22:21.958
without any nodes built,
you have to in the computer,

00:22:21.958 --> 00:22:23.939
like you said, it
hasn't visited a node,

00:22:23.939 --> 00:22:26.450
it has to go there before
it descends further.

00:22:26.450 --> 00:22:27.782
Basically, like breadth first.

00:22:27.782 --> 00:22:30.240
PROFESSOR 3: It's sort of like
breadth first but not quite.

00:22:30.240 --> 00:22:31.823
There's an important
distinction here,

00:22:31.823 --> 00:22:37.387
which is that it doesn't have
to build up this or this node.

00:22:37.387 --> 00:22:39.220
It doesn't have to build
up all of the nodes

00:22:39.220 --> 00:22:40.430
at a certain level.

00:22:40.430 --> 00:22:44.970
All it has to do is, if it
branches down to a certain sub

00:22:44.970 --> 00:22:48.050
region, then can't
descend in that sub region

00:22:48.050 --> 00:22:51.160
below one of its siblings
without having at least looked

00:22:51.160 --> 00:22:52.410
once at all its siblings.

00:22:52.410 --> 00:22:55.190
After it looks once it
can do whatever it wants.

00:22:55.190 --> 00:22:57.130
And the point is,
that it doesn't

00:22:57.130 --> 00:22:59.440
mean the tree has to be
kept at an even level.

00:22:59.440 --> 00:23:02.551
All it means is that
the tree, in order

00:23:02.551 --> 00:23:04.300
to descend on a specific
part of the tree,

00:23:04.300 --> 00:23:10.220
it has to have at least visited
direct neighbors once before.

00:23:10.220 --> 00:23:12.400
Any more questions
on this before--

00:23:12.400 --> 00:23:12.940
Yeah.

00:23:12.940 --> 00:23:14.850
AUDIENCE: What's the
advantage necessarily

00:23:14.850 --> 00:23:16.779
of having to visit every single?

00:23:21.821 --> 00:23:23.320
PROFESSOR 3: The
advantage of having

00:23:23.320 --> 00:23:25.740
to visit every single--
the way that I think of it,

00:23:25.740 --> 00:23:28.470
is that you don't
want to be missing out

00:23:28.470 --> 00:23:32.860
on potentially being interested
in some of the things

00:23:32.860 --> 00:23:35.380
and not others.

00:23:35.380 --> 00:23:41.690
It comes back to the exploration
versus expectation distinction.

00:23:41.690 --> 00:23:46.050
We do want to descend into
the region of the tree that

00:23:46.050 --> 00:23:47.200
is really valuable to us.

00:23:47.200 --> 00:23:50.280
But at least have
explored a little bit,

00:23:50.280 --> 00:23:51.760
at least maintaining
some baseline,

00:23:51.760 --> 00:23:53.820
which really isn't
that costly compared

00:23:53.820 --> 00:23:55.120
to the size of the tree.

00:23:55.120 --> 00:23:59.444
400 moves is not that bad
compared with 400 and 250.

00:23:59.444 --> 00:24:01.110
AUDIENCE: Are these
simulations, they're

00:24:01.110 --> 00:24:02.180
just random simulations?

00:24:02.180 --> 00:24:03.835
PROFESSOR 3: We're going to
talk about that in a minute.

00:24:03.835 --> 00:24:05.626
Any more questions
before I move onto that?

00:24:08.790 --> 00:24:10.280
Next step is expanding.

00:24:10.280 --> 00:24:11.280
And this is very simple.

00:24:11.280 --> 00:24:15.619
You just create a node and you
set the two initial values.

00:24:15.619 --> 00:24:17.160
And the initial
values are the number

00:24:17.160 --> 00:24:18.840
of times it's been
visited is zero,

00:24:18.840 --> 00:24:20.720
and then number of times that
someone has won from there

00:24:20.720 --> 00:24:21.220
is zero.

00:24:21.220 --> 00:24:25.020
AUDIENCE: [INAUDIBLE] So
the easy part is solving it.

00:24:25.020 --> 00:24:27.180
PROFESSOR 3: Now, simulating.

00:24:27.180 --> 00:24:29.320
Simulating is really hard.

00:24:29.320 --> 00:24:31.470
You can imagine that if
you get to a single node

00:24:31.470 --> 00:24:33.480
and you've never seen
that node before,

00:24:33.480 --> 00:24:36.270
and you don't know what to
do from this node onward,

00:24:36.270 --> 00:24:39.484
that if we knew how the
game was going to play out,

00:24:39.484 --> 00:24:41.150
that is exactly what
were searching for,

00:24:41.150 --> 00:24:42.360
and we would be done.

00:24:42.360 --> 00:24:43.320
But we don't.

00:24:43.320 --> 00:24:47.770
And in fact, we have no idea
how to go about simulating

00:24:47.770 --> 00:24:49.790
a realistic game,
and a game that

00:24:49.790 --> 00:24:51.990
will tell us something
meaningful about the quality

00:24:51.990 --> 00:24:53.410
of a certain state.

00:24:53.410 --> 00:24:56.180
And so, as you
correctly guessed,

00:24:56.180 --> 00:24:58.560
we're going to do it randomly.

00:24:58.560 --> 00:25:00.380
We're going to be
at a certain state.

00:25:00.380 --> 00:25:01.960
And then from that
state, we're just

00:25:01.960 --> 00:25:04.530
going to pick random nodes
for each of the players

00:25:04.530 --> 00:25:07.280
until the game ends.

00:25:07.280 --> 00:25:11.990
And if we, as player one, win
then we're going to add one.

00:25:11.990 --> 00:25:13.980
Then we're going to say
delta equals plus one.

00:25:13.980 --> 00:25:18.140
And if we don't win,
or if we tie or lose,

00:25:18.140 --> 00:25:20.427
then we're going
to call it a zero.

00:25:20.427 --> 00:25:22.760
You can in this graph, we're
descending randomly and not

00:25:22.760 --> 00:25:23.510
thinking about it.

00:25:23.510 --> 00:25:25.370
And it turns out that
this is actually great

00:25:25.370 --> 00:25:28.570
because it's really, really
computationally efficient.

00:25:28.570 --> 00:25:31.860
If you have a board, even
if it has 400 open squares,

00:25:31.860 --> 00:25:33.810
populating it by a
bunch of random moves

00:25:33.810 --> 00:25:35.860
doesn't take you very
long, on the order

00:25:35.860 --> 00:25:38.276
of not that many machine can.

00:25:38.276 --> 00:25:40.390
AUDIENCE: That's why
does you don't score--

00:25:40.390 --> 00:25:44.332
if you go down a tree randomly,
you already have a simulation.

00:25:44.332 --> 00:25:46.560
So the node's going
to get to someplace.

00:25:46.560 --> 00:25:49.060
But you don't store it because
it would lose the randomness?

00:25:49.060 --> 00:25:51.920
PROFESSOR 3: You're totally
right, actually, in this case.

00:25:51.920 --> 00:25:54.420
I've thought through this, and
I can't come up with a reason

00:25:54.420 --> 00:25:55.780
why you wouldn't
store it, that's

00:25:55.780 --> 00:25:58.363
it's temporary values that you
find all the way down the tree.

00:25:58.363 --> 00:26:01.610
But they don't in most of
the literature [INAUDIBLE]

00:26:01.610 --> 00:26:03.574
But you're totally
right about that.

00:26:03.574 --> 00:26:06.270
Does everyone understand
that distinction?

00:26:06.270 --> 00:26:08.460
The fact that we only
hold onto the result

00:26:08.460 --> 00:26:10.110
here and don't
theoretically make

00:26:10.110 --> 00:26:13.320
nodes for every place down in
the tree just because we could,

00:26:13.320 --> 00:26:15.000
just because we've
seen them before.

00:26:15.000 --> 00:26:17.166
We don't, and it doesn't
really matter in this case.

00:26:17.166 --> 00:26:19.762
But it's theoretically a slight
speed up that you could do.

00:26:19.762 --> 00:26:22.420
AUDIENCE: But you reduce that
question to generalities?

00:26:22.420 --> 00:26:25.950
PROFESSOR 3: Yeah, a little bit.

00:26:25.950 --> 00:26:29.940
So we can look at an example of
simulating out a running game.

00:26:29.940 --> 00:26:32.610
We get some intuition for
why a random game would

00:26:32.610 --> 00:26:35.760
be correlated with how good
your board position is.

00:26:35.760 --> 00:26:38.470
For example, here we
have a Detecto game.

00:26:38.470 --> 00:26:40.210
Circle is going to move next.

00:26:40.210 --> 00:26:42.540
But as hopefully you can
see, because you have played

00:26:42.540 --> 00:26:46.120
Detecto before, this is not a
particularly promising board

00:26:46.120 --> 00:26:47.990
for x.

00:26:47.990 --> 00:26:51.500
Because no matter
what circle does,

00:26:51.500 --> 00:26:54.802
if x is an intelligent
player x can win right now.

00:26:54.802 --> 00:26:56.510
It has two different
options for winning.

00:26:56.510 --> 00:26:59.333
And so, if you simulated this
forward randomly, what you'll

00:26:59.333 --> 00:27:01.856
get is that 2/3 of the
time, x will in fact win,

00:27:01.856 --> 00:27:03.230
even if the players
aren't really

00:27:03.230 --> 00:27:04.410
thinking of it ahead of time.

00:27:04.410 --> 00:27:04.909
Yeah.

00:27:04.909 --> 00:27:07.170
AUDIENCE: Then why
not do n simulations

00:27:07.170 --> 00:27:09.860
at a node instead of
just a single simulation?

00:27:09.860 --> 00:27:10.510
PROFESSOR 3: You
totally can do that.

00:27:10.510 --> 00:27:12.470
That's in fact, something
that make sense to do

00:27:12.470 --> 00:27:13.740
and that some people do.

00:27:13.740 --> 00:27:16.110
Although what you'll
find somewhat soon,

00:27:16.110 --> 00:27:18.780
is that considering that
we're going down the tree,

00:27:18.780 --> 00:27:20.520
and that sometimes
soon we're going

00:27:20.520 --> 00:27:22.170
to explore all of
its children, there's

00:27:22.170 --> 00:27:24.930
a good question of why
you end simulations now

00:27:24.930 --> 00:27:28.080
when you could just descend
through the tree n times

00:27:28.080 --> 00:27:31.030
and thereby do n simulations
by going through the thing

00:27:31.030 --> 00:27:34.210
and also building
out the children?

00:27:34.210 --> 00:27:35.650
This case is-- yeah.

00:27:35.650 --> 00:27:37.360
AUDIENCE: This gives
more importance

00:27:37.360 --> 00:27:38.440
to why you do randomness.

00:27:38.440 --> 00:27:40.605
Because if you're doing
random simulations

00:27:40.605 --> 00:27:42.696
you would ignore the
possibility of the best one.

00:27:42.696 --> 00:27:45.255
When you first ran a simulation
here was that o wins.

00:27:45.255 --> 00:27:47.090
If I ignore this node--

00:27:47.090 --> 00:27:48.230
PROFESSOR 3: Absolutely.

00:27:48.230 --> 00:27:52.530
Which is why it matters that we
do this so many times that we

00:27:52.530 --> 00:27:55.515
drown out all the noise that
is associated with playing

00:27:55.515 --> 00:27:57.010
a game out randomly.

00:27:57.010 --> 00:27:58.930
Let's talk about that.

00:27:58.930 --> 00:28:02.010
If there's a lot of distance
between where we are right now

00:28:02.010 --> 00:28:03.600
and our end result--

00:28:03.600 --> 00:28:05.100
For example, in
this game, if I were

00:28:05.100 --> 00:28:08.522
to tell you how good is this
board position, if you are one

00:28:08.522 --> 00:28:10.730
of those people who played
out every game of Detecto,

00:28:10.730 --> 00:28:12.900
you'll know that this is
great if you want it to be

00:28:12.900 --> 00:28:15.660
[INAUDIBLE]

00:28:15.660 --> 00:28:17.820
Anyway, the point
is, that is not

00:28:17.820 --> 00:28:20.550
easy to do if you are doing
random simulations from where

00:28:20.550 --> 00:28:21.730
you start.

00:28:21.730 --> 00:28:24.500
The correlation between
your friend's board state

00:28:24.500 --> 00:28:27.989
and the quality of that state
actually drops precipitously.

00:28:27.989 --> 00:28:29.780
And this for me is one
of the hardest parts

00:28:29.780 --> 00:28:31.940
to study about Monte
Carlo Tree Search.

00:28:31.940 --> 00:28:33.890
Although, as Nick
will explain to you,

00:28:33.890 --> 00:28:36.270
it actually works quite well.

00:28:36.270 --> 00:28:38.840
And one of the reasons that it
works quite well in practice

00:28:38.840 --> 00:28:40.215
for more complicated
applications

00:28:40.215 --> 00:28:42.320
is they do away
with the assumption

00:28:42.320 --> 00:28:43.480
of random simulation.

00:28:43.480 --> 00:28:45.032
Because even the
random simulations

00:28:45.032 --> 00:28:47.240
does allow you to explore
all the states, if you have

00:28:47.240 --> 00:28:50.600
some idea of where a reasonable
quality approach would be,

00:28:50.600 --> 00:28:54.510
then using that, as long as it's
not that much more expensive

00:28:54.510 --> 00:28:56.770
computationally, can help
you with your simulation.

00:28:56.770 --> 00:28:59.140
Right now we're still talking
about total randomness.

00:28:59.140 --> 00:29:00.640
How are people doing
with that idea?

00:29:04.205 --> 00:29:06.330
Now we're going to update
the tree with the results

00:29:06.330 --> 00:29:07.320
of our simulation.

00:29:07.320 --> 00:29:10.330
So given that we had
some result lambda,

00:29:10.330 --> 00:29:12.140
we're going to try to
get up the parents.

00:29:12.140 --> 00:29:13.960
And for each parent
we're going to add

00:29:13.960 --> 00:29:15.780
that the game has been
played there once,

00:29:15.780 --> 00:29:20.790
and that the result
of that simulation

00:29:20.790 --> 00:29:24.460
gets added if it was a one.

00:29:24.460 --> 00:29:27.300
So for example, if there
was a win in this game,

00:29:27.300 --> 00:29:30.520
than this becomes one, one
because now it's won once

00:29:30.520 --> 00:29:32.190
and it's been visited once.

00:29:32.190 --> 00:29:34.630
And these two get
incremented by one,

00:29:34.630 --> 00:29:37.280
and these two get
incremented by one.

00:29:37.280 --> 00:29:41.060
That in itself comprises
a complete iteration,

00:29:41.060 --> 00:29:44.610
the complete single iteration
of running Monte Carlo Tree

00:29:44.610 --> 00:29:49.950
Search, which means that
now we can keep doing this

00:29:49.950 --> 00:29:52.620
over and over again,
building up the tree

00:29:52.620 --> 00:29:55.350
and slowly making it
deeper, and making it deeper

00:29:55.350 --> 00:29:56.740
in selective areas.

00:29:56.740 --> 00:29:59.450
And having these numbers
increase and increase.

00:29:59.450 --> 00:30:01.080
And be more and
more proportional

00:30:01.080 --> 00:30:05.430
to the actual expected value
of the quality of the state,

00:30:05.430 --> 00:30:06.080
until--

00:30:06.080 --> 00:30:08.226
does anyone have any
questions about this idea?--

00:30:11.740 --> 00:30:12.550
until we terminate.

00:30:12.550 --> 00:30:15.040
And we have to come up
with a way to terminate it.

00:30:15.040 --> 00:30:18.670
Now again, we said we're going
to pick what the best child is

00:30:18.670 --> 00:30:21.850
going to be, what the best
immediate move from the start

00:30:21.850 --> 00:30:24.335
state is going to be.

00:30:24.335 --> 00:30:26.650
That's the move that were
actually going to play.

00:30:26.650 --> 00:30:29.010
And so, how do we
determine what the best is?

00:30:29.010 --> 00:30:33.240
Well, the trivial solution
is just the highest

00:30:33.240 --> 00:30:36.790
expected win given k.

00:30:36.790 --> 00:30:38.550
What that, in our
case, is going to be

00:30:38.550 --> 00:30:41.190
is the ratio of number
of times that I've

00:30:41.190 --> 00:30:44.250
win from a given early
state to the number of times

00:30:44.250 --> 00:30:45.880
that I visited.

00:30:45.880 --> 00:30:48.745
However, this doesn't actually
work as well as we might hope.

00:30:48.745 --> 00:30:50.530
Let's suppose the
following scenario,

00:30:50.530 --> 00:30:54.020
which is that you have the
Detecto game like this.

00:30:54.020 --> 00:30:57.220
And you have been exploring
the tree for a while.

00:30:57.220 --> 00:31:00.520
And you're really mostly
looking at these two nodes.

00:31:00.520 --> 00:31:04.390
One of these nodes, if
you think it through,

00:31:04.390 --> 00:31:06.070
this node is quite
promising and you've

00:31:06.070 --> 00:31:07.339
been exploring it for a while.

00:31:07.339 --> 00:31:09.130
There is a winning
strategy from this node.

00:31:09.130 --> 00:31:11.260
It's that circle goes
here, and then x goes here,

00:31:11.260 --> 00:31:13.964
and then circle loses because
x has two options to win.

00:31:16.694 --> 00:31:18.610
However, if you explore
this a bunch of times,

00:31:18.610 --> 00:31:20.480
and for some reason,
due to the randomness,

00:31:20.480 --> 00:31:21.970
this is at 11 out of 20.

00:31:21.970 --> 00:31:25.687
Whereas this state, which
is inherently inferior,

00:31:25.687 --> 00:31:28.020
is at three out of five because
of a bunch of randomness

00:31:28.020 --> 00:31:30.380
and because it hasn't
been explored as much.

00:31:30.380 --> 00:31:32.390
And if we had looked at
this one as exhaustively

00:31:32.390 --> 00:31:35.631
we had at this one,
that you probably

00:31:35.631 --> 00:31:37.880
would actually say that this
state is actually better.

00:31:37.880 --> 00:31:40.900
And so, you can create
an alternative criteria,

00:31:40.900 --> 00:31:43.920
which is that it's the
highest expected win

00:31:43.920 --> 00:31:46.060
value of one of the children.

00:31:46.060 --> 00:31:49.602
But also, that value
has to be the node that

00:31:49.602 --> 00:31:51.310
has been most visited
so that they aren't

00:31:51.310 --> 00:31:54.590
explored by different amounts.

00:31:54.590 --> 00:31:56.080
What this sacrifice
is however, is

00:31:56.080 --> 00:32:01.050
that this means that we
can't terminate on demand.

00:32:01.050 --> 00:32:02.870
This is not always
going to be true,

00:32:02.870 --> 00:32:05.161
and therefore, we're going
to have to let the algorithm

00:32:05.161 --> 00:32:07.362
run until that's true for
some start state, which

00:32:07.362 --> 00:32:09.320
means that maybe is not
a criteria that we want

00:32:09.320 --> 00:32:11.886
to apply even though we know
that it would be wise to do so.

00:32:11.886 --> 00:32:13.260
Are there any
questions about how

00:32:13.260 --> 00:32:15.222
we pick the terminating guide?

00:32:19.280 --> 00:32:20.435
That was the whole thing.

00:32:20.435 --> 00:32:22.560
And now we're going to do
it lots and lots of times

00:32:22.560 --> 00:32:25.780
until you guys are sick of
Monte Carlo Tree Search.

00:32:25.780 --> 00:32:26.970
So this our tree.

00:32:26.970 --> 00:32:29.826
It's more or less
what we've had before.

00:32:29.826 --> 00:32:31.200
The first thing
we're going to do

00:32:31.200 --> 00:32:32.616
is we're going to
look at the top.

00:32:32.616 --> 00:32:35.250
And then we're going to
pick one of these children.

00:32:35.250 --> 00:32:37.260
Now let's say that
we looked at this,

00:32:37.260 --> 00:32:39.210
and it turns out that the one
on the left is really valuable.

00:32:39.210 --> 00:32:40.060
I think it's the one.

00:32:40.060 --> 00:32:40.583
Nope, yeah.

00:32:40.583 --> 00:32:41.083
Never mind.

00:32:41.083 --> 00:32:42.319
It's wrong.

00:32:42.319 --> 00:32:43.860
The one on the left
has been explored

00:32:43.860 --> 00:32:44.818
a whole bunch of times.

00:32:44.818 --> 00:32:47.730
Remember, this term
starts becoming larger

00:32:47.730 --> 00:32:49.980
than the ones that haven't
been visited as much.

00:32:49.980 --> 00:32:53.390
And so we're going to
descend from this one.

00:32:53.390 --> 00:32:57.175
And now we're going to descend,
and we have these two options.

00:32:57.175 --> 00:33:00.612
Given what you know,
would you expect

00:33:00.612 --> 00:33:02.070
that this is going
to pick is going

00:33:02.070 --> 00:33:04.153
to be the one on the right
or the one on the left?

00:33:04.153 --> 00:33:05.295
AUDIENCE: [INAUDIBLE]

00:33:05.295 --> 00:33:06.250
PROFESSOR 3: On the
right because it's never

00:33:06.250 --> 00:33:06.880
been visited before.

00:33:06.880 --> 00:33:08.380
And so, this term
is going to explode.

00:33:08.380 --> 00:33:10.060
And so, we're going
to build a node there.

00:33:10.060 --> 00:33:11.726
And then we're going
to simulate a game.

00:33:11.726 --> 00:33:15.954
And the result is a win,
which is bad for this player.

00:33:15.954 --> 00:33:18.370
That means that he probably
didn't want to make that move.

00:33:18.370 --> 00:33:20.820
And so we're going to
propagate that value up.

00:33:20.820 --> 00:33:24.420
And we're going to start
the algorithm again.

00:33:24.420 --> 00:33:26.430
And it's going to compare
between these three.

00:33:26.430 --> 00:33:31.500
And now it's going to
pick the one on the left.

00:33:34.686 --> 00:33:36.310
Now that it picked
the one on the left,

00:33:36.310 --> 00:33:39.420
it going to compare
between these two states.

00:33:39.420 --> 00:33:43.933
Which of the two is going to
have a higher expansion factor?

00:33:43.933 --> 00:33:46.397
AUDIENCE: The left.

00:33:46.397 --> 00:33:47.980
AUDIENCE: Don't you
invert it, though,

00:33:47.980 --> 00:33:49.280
because this is the opponent.

00:33:49.280 --> 00:33:50.480
PROFESSOR 3: Exactly.

00:33:50.480 --> 00:33:52.331
Because two out of three
is actually better.

00:33:52.331 --> 00:33:54.247
Because it's one out of
three for the opponent

00:33:54.247 --> 00:33:55.530
that's currently
making the move.

00:33:55.530 --> 00:33:57.440
So the one on the left is going
to have a higher expansion

00:33:57.440 --> 00:33:58.485
factor, and the
one on the right is

00:33:58.485 --> 00:33:59.560
going to have a higher
exploration factor.

00:33:59.560 --> 00:34:01.412
Does that make sense for people?

00:34:01.412 --> 00:34:05.114
It's OK if it doesn't.

00:34:05.114 --> 00:34:07.280
So we're actually going to
pick the one on the right

00:34:07.280 --> 00:34:09.969
because the other one was
is doing three and has lots

00:34:09.969 --> 00:34:11.994
of it's mother's
love than that one's.

00:34:11.994 --> 00:34:14.129
Anyone else need a drink?

00:34:14.129 --> 00:34:15.462
We're going to expand that node.

00:34:15.462 --> 00:34:16.370
It doesn't matter.

00:34:16.370 --> 00:34:18.502
They are both equally
likely to be expanded.

00:34:18.502 --> 00:34:20.918
We're going to simulate forward,
and it's going to be one.

00:34:20.918 --> 00:34:24.639
Which means that that was
probably a wise countermove.

00:34:24.639 --> 00:34:25.220
Yeah.

00:34:25.220 --> 00:34:26.969
AUDIENCE: So when it's
the opponent's turn

00:34:26.969 --> 00:34:29.300
versus your turn, the
exploration factor

00:34:29.300 --> 00:34:33.562
is the same but we complement
the expansion factor, right?

00:34:33.562 --> 00:34:34.270
PROFESSOR 3: Yes.

00:34:34.270 --> 00:34:36.739
So the key here
being that this takes

00:34:36.739 --> 00:34:39.162
in both the state that
you're talking about

00:34:39.162 --> 00:34:40.870
and the player that
you're talking about.

00:34:40.870 --> 00:34:42.239
AUDIENCE: But regardless
of the player,

00:34:42.239 --> 00:34:44.570
the exploration factor will
always be like this is.

00:34:44.570 --> 00:34:46.945
PROFESSOR 3: Because it's only
the number of visits it's.

00:34:46.945 --> 00:34:49.716
It has nothing to do with
results of exploration.

00:34:53.176 --> 00:34:55.584
AUDIENCE: If you win and
you have the plus one,

00:34:55.584 --> 00:34:57.375
double plus one, and
you've propagated out,

00:34:57.375 --> 00:35:00.485
but I'm wondering--

00:35:00.485 --> 00:35:03.602
so if the opponent wins
do you also propagate

00:35:03.602 --> 00:35:07.950
out the win increment itself?

00:35:07.950 --> 00:35:09.574
If the opponent's
winning, wouldn't you

00:35:09.574 --> 00:35:11.572
want to [INAUDIBLE] node here?

00:35:11.572 --> 00:35:13.530
PROFESSOR 3: If the
opponent wins then what you

00:35:13.530 --> 00:35:14.780
do is you propagate up a zero.

00:35:14.780 --> 00:35:20.390
Which means that wk is not
incremented, but nk is.

00:35:23.695 --> 00:35:26.580
Have we seen a zero yet?

00:35:26.580 --> 00:35:28.440
There's one soon.

00:35:28.440 --> 00:35:31.900
But the idea is that rather
than subtract or anything,

00:35:31.900 --> 00:35:34.010
all you do is propagate
up the result of the game,

00:35:34.010 --> 00:35:37.820
which in this case is zero.

00:35:37.820 --> 00:35:39.360
Which means that
all of those states

00:35:39.360 --> 00:35:41.820
seems to become more valuable
to the blue and less valuable

00:35:41.820 --> 00:35:42.830
to the red.

00:35:42.830 --> 00:35:46.159
Because these numbers are
lower than the other ones were.

00:35:46.159 --> 00:35:46.700
AUDIENCE: OK.

00:35:50.750 --> 00:35:52.250
PROFESSOR 3: So we
propagate this up

00:35:52.250 --> 00:35:54.580
and this becomes better.

00:35:54.580 --> 00:35:56.930
What we've done here
is we've figured out

00:35:56.930 --> 00:36:00.057
a theoretical countermove
to blue moving here.

00:36:00.057 --> 00:36:02.140
That's how you should think
about this whole tree.

00:36:02.140 --> 00:36:04.540
It's really a lot like
the way the humans think

00:36:04.540 --> 00:36:05.440
about these things.

00:36:05.440 --> 00:36:07.950
If I do this, then
what if they do this?

00:36:07.950 --> 00:36:09.140
Well, then I'll do this.

00:36:09.140 --> 00:36:14.142
And I see that I'm
successful when I do that.

00:36:14.142 --> 00:36:16.580
We're going to look
again at the top.

00:36:16.580 --> 00:36:18.765
And we're going to pick
the one on the left

00:36:18.765 --> 00:36:20.015
because it's really promising.

00:36:20.015 --> 00:36:21.687
Five out of six
is a good number.

00:36:21.687 --> 00:36:23.270
And we're going to
look at both sides.

00:36:23.270 --> 00:36:25.571
And which one is blue
going to pick now?

00:36:25.571 --> 00:36:27.321
Well, it's going to
pick the one that it's

00:36:27.321 --> 00:36:29.090
going to be more successful
in, which is two out of three.

00:36:29.090 --> 00:36:31.246
I realize that this is
actually not the kind of thing

00:36:31.246 --> 00:36:32.746
where I could
necessarily ask people

00:36:32.746 --> 00:36:37.680
because I'm the one who's
decided which node to stop.

00:36:37.680 --> 00:36:39.360
Then we go down here.

00:36:39.360 --> 00:36:41.430
And there's an equal
likelihood of picking

00:36:41.430 --> 00:36:42.347
either of those nodes.

00:36:42.347 --> 00:36:44.054
And so we're going to
pick one at random.

00:36:44.054 --> 00:36:45.530
So that's going to
be the left one.

00:36:45.530 --> 00:36:47.227
And we're going to
create an empty node.

00:36:47.227 --> 00:36:48.560
Then we're going to play it out.

00:36:48.560 --> 00:36:50.660
And it was a success
for blue, which

00:36:50.660 --> 00:36:54.080
is amazing because what this
means now is that suddenly,

00:36:54.080 --> 00:36:57.180
in this tree of this really
good move that red could make

00:36:57.180 --> 00:36:59.090
the blue wasn't find a
response to, suddenly

00:36:59.090 --> 00:37:02.280
there's hope because we're
going to propagate this back.

00:37:02.280 --> 00:37:03.830
And that means
that blue actually

00:37:03.830 --> 00:37:06.772
has a response move to that
sequence of red's moves.

00:37:06.772 --> 00:37:08.380
And so it's going
to propagate up.

00:37:08.380 --> 00:37:10.915
And this state's going to be
more promising to blue and less

00:37:10.915 --> 00:37:12.350
promising of red.

00:37:12.350 --> 00:37:14.230
That region of the tree
that we had dug into

00:37:14.230 --> 00:37:17.164
is a little less promising.

00:37:17.164 --> 00:37:18.330
We're going to look back up.

00:37:18.330 --> 00:37:19.788
And this time,
instead, we're going

00:37:19.788 --> 00:37:22.880
to evaluate the thing
that is both promising

00:37:22.880 --> 00:37:25.910
from the expansion
factor, and also

00:37:25.910 --> 00:37:27.800
promising because
we haven't looked

00:37:27.800 --> 00:37:29.930
at it very much [INAUDIBLE]
exploration factor.

00:37:29.930 --> 00:37:31.513
We're going to pick
between these two.

00:37:31.513 --> 00:37:33.589
Which one is going
to be picked here?

00:37:33.589 --> 00:37:37.392
AUDIENCE: [INAUDIBLE]

00:37:37.392 --> 00:37:39.683
PROFESSOR 3: Because the
exploration factor is the same

00:37:39.683 --> 00:37:44.080
but the expansion factor is
higher for the one on the left.

00:37:44.080 --> 00:37:45.700
And it's going to
show us a node.

00:37:45.700 --> 00:37:48.649
And the result is going to
be a win for a red, which

00:37:48.649 --> 00:37:51.190
means that red has found a good
countermove to the thing that

00:37:51.190 --> 00:37:52.760
was previously
promising for blue.

00:37:52.760 --> 00:37:53.926
And we propagate it back up.

00:37:53.926 --> 00:37:57.610
And finally, we're going to pick
the one furthest on the right.

00:37:57.610 --> 00:37:59.360
Because even though
it's terrible for red,

00:37:59.360 --> 00:38:01.443
and even though it's never
won when it's tried it,

00:38:01.443 --> 00:38:04.285
it has to obey his idea
of the exploration mode

00:38:04.285 --> 00:38:06.910
to find out whether maybe there
isn't something possible there.

00:38:06.910 --> 00:38:09.110
So it explores,
and it goes down,

00:38:09.110 --> 00:38:10.870
and it has to pick
the one on the right.

00:38:10.870 --> 00:38:12.090
And so it does.

00:38:12.090 --> 00:38:13.420
And it plays this game out.

00:38:13.420 --> 00:38:16.180
And it's a loss, again.

00:38:16.180 --> 00:38:19.212
Which goes to show
you, that blue

00:38:19.212 --> 00:38:20.670
has found yet
another superior move

00:38:20.670 --> 00:38:22.570
to this really bad
move of red, where

00:38:22.570 --> 00:38:24.886
probably this move of red,
if this is a game of chess,

00:38:24.886 --> 00:38:26.260
is like putting
my queen directly

00:38:26.260 --> 00:38:27.926
in front of the
opponent's row of pawns,

00:38:27.926 --> 00:38:29.010
and I just leave it there.

00:38:29.010 --> 00:38:31.175
There's nothing good that's
ever going to come of it

00:38:31.175 --> 00:38:33.070
but we have to explore
it just to find out

00:38:33.070 --> 00:38:36.290
whether there isn't some magical
way that I should protect.

00:38:36.290 --> 00:38:39.250
And as you can see,
we've built up this tree

00:38:39.250 --> 00:38:40.460
over and over and over again.

00:38:40.460 --> 00:38:41.970
And it's starting
to look asymmetric.

00:38:41.970 --> 00:38:43.845
And we're starting to
see that there's really

00:38:43.845 --> 00:38:47.170
this disparity between exploring
the regions that are crossing

00:38:47.170 --> 00:38:49.420
this tree and exploring
the regions that are not

00:38:49.420 --> 00:38:52.500
and that don't really
matter to us very much.

00:38:52.500 --> 00:38:55.940
And that this is exactly what we
wanted from Monte Carlo trees.

00:38:55.940 --> 00:38:58.475
That was why we started
the whole endeavor

00:38:58.475 --> 00:39:00.030
in the first place.

00:39:00.030 --> 00:39:02.530
The next thing I'm going to
talk about is the pros and cons.

00:39:02.530 --> 00:39:03.905
But before I do
that, does anyone

00:39:03.905 --> 00:39:06.771
have any more questions
about the algorithm?

00:39:06.771 --> 00:39:07.270
Yeah.

00:39:07.270 --> 00:39:09.966
AUDIENCE: It's still not
clear how we're getting nodes

00:39:09.966 --> 00:39:11.322
with different denominators--

00:39:11.322 --> 00:39:13.500
[INAUDIBLE]

00:39:13.500 --> 00:39:16.011
PROFESSOR 3: The reason for
that is because of the way

00:39:16.011 --> 00:39:17.260
that we're simulating through.

00:39:17.260 --> 00:39:19.970
We're actually not holding
onto to the results

00:39:19.970 --> 00:39:23.130
of the simulation as we're
going farther down the tree

00:39:23.130 --> 00:39:25.200
than the lowest node we expand.

00:39:25.200 --> 00:39:27.540
For example, when you
simulate from here,

00:39:27.540 --> 00:39:31.030
you're going to propagate that
value here and here, and so on.

00:39:31.030 --> 00:39:32.800
But then when we
expand below, even

00:39:32.800 --> 00:39:35.110
if in the course of
this guy's simulation

00:39:35.110 --> 00:39:36.300
it happened to go
through one of the states

00:39:36.300 --> 00:39:38.160
that we expanded
below, it will not

00:39:38.160 --> 00:39:40.140
have incremented the
values of that state

00:39:40.140 --> 00:39:42.836
because we weren't
keeping track of it.

00:39:42.836 --> 00:39:44.210
Theoretically, if
we were to keep

00:39:44.210 --> 00:39:47.050
track of all of the simulations
that we have in fact run,

00:39:47.050 --> 00:39:51.480
the numbers beneath these
things would be higher.

00:39:51.480 --> 00:39:54.114
AUDIENCE: If you've already
run a simulation from that--

00:39:54.114 --> 00:39:55.530
if you've already
run a simulation

00:39:55.530 --> 00:39:58.080
from that red node when
you first built it,

00:39:58.080 --> 00:40:02.480
and then when you created those
two ones, each of those have

00:40:02.480 --> 00:40:03.156
[INAUDIBLE]

00:40:03.156 --> 00:40:03.822
PROFESSOR 3: OK.

00:40:03.822 --> 00:40:04.520
I see.

00:40:04.520 --> 00:40:06.228
AUDIENCE: So would
the denominator always

00:40:06.228 --> 00:40:08.100
be one more than the
sum of the children?

00:40:08.100 --> 00:40:10.960
PROFESSOR 3: Yeah,
in [INAUDIBLE] Yeah.

00:40:13.581 --> 00:40:15.330
AUDIENCE: I understand
how you built that.

00:40:18.304 --> 00:40:20.900
Is there a rule of thumb, like
it's time to choose a move?

00:40:20.900 --> 00:40:23.150
And it seems like you
have very low numbers here

00:40:23.150 --> 00:40:25.080
to make a [INAUDIBLE]

00:40:25.080 --> 00:40:27.000
Is there a rule of
thumb on giving games

00:40:27.000 --> 00:40:29.550
like it's 2 to the 4 or 2
to the 350, whatever it is.

00:40:29.550 --> 00:40:32.335
What kind of numbers do
you need for that first row

00:40:32.335 --> 00:40:35.582
before you [INAUDIBLE]?

00:40:35.582 --> 00:40:38.010
PROFESSOR 3: What we'll get
to soon is that isn't one.

00:40:38.010 --> 00:40:40.750
That's one of the
problem with MCTS.

00:40:40.750 --> 00:40:44.210
But in terms of which of
the moves you will choose,

00:40:44.210 --> 00:40:48.350
there are actually variants of
MCTS that suggest that you more

00:40:48.350 --> 00:40:51.480
selectively age or
insert new children based

00:40:51.480 --> 00:40:56.430
on something more than just
the blind look right now.

00:40:56.430 --> 00:41:00.320
In terms of, if I'm here and
it's creating my next children

00:41:00.320 --> 00:41:02.805
as the equivalent, then there
are some intelligent guesses

00:41:02.805 --> 00:41:04.430
that you can make in
terms of which one

00:41:04.430 --> 00:41:05.674
you should score first.

00:41:05.674 --> 00:41:07.340
Although it doesn't
particularly matter.

00:41:07.340 --> 00:41:09.540
AUDIENCE: I'm just
saying computational time

00:41:09.540 --> 00:41:11.400
being what it is,
you might say, OK,

00:41:11.400 --> 00:41:13.860
if this is the timeline
of this game I can expect

00:41:13.860 --> 00:41:16.274
to do a million simulations,
which will give me

00:41:16.274 --> 00:41:18.440
if there's 400 nodes, I'm
going to have so much use.

00:41:18.440 --> 00:41:21.310
In other words, is
that enough time

00:41:21.310 --> 00:41:22.962
to say that I can
play through a game?

00:41:22.962 --> 00:41:24.920
I couldn't play through
a game with 400 options

00:41:24.920 --> 00:41:26.982
if I've gotten five out
of seven [INAUDIBLE]

00:41:26.982 --> 00:41:28.190
three out of four [INAUDIBLE]

00:41:28.190 --> 00:41:29.190
PROFESSOR 3: Absolutely.

00:41:29.190 --> 00:41:30.810
And I would say that
so far as I know,

00:41:30.810 --> 00:41:32.260
that's something
that's basically very

00:41:32.260 --> 00:41:33.096
high experimentally.

00:41:33.096 --> 00:41:34.770
They don't have
good balance on it.

00:41:34.770 --> 00:41:35.645
[INAUDIBLE]

00:41:35.645 --> 00:41:37.020
So let's get on
the first comment

00:41:37.020 --> 00:41:39.070
because that is a
computer element.

00:41:39.070 --> 00:41:41.962
So why should you
use this algorithm?

00:41:41.962 --> 00:41:43.920
Even though we've seen
tremendous breakthroughs

00:41:43.920 --> 00:41:45.607
in this algorithm,
and you're going

00:41:45.607 --> 00:41:47.440
to have to ignore
everything that I tell you

00:41:47.440 --> 00:41:49.020
and remember that
this does actually

00:41:49.020 --> 00:41:51.540
work quite well in
certain scenarios.

00:41:51.540 --> 00:41:53.920
Should we use it or not?

00:41:53.920 --> 00:41:56.225
The pros are that it
actually does the thing

00:41:56.225 --> 00:41:57.141
that we want it to do.

00:41:57.141 --> 00:41:58.515
It grows the tree
asymmetrically.

00:41:58.515 --> 00:42:00.380
It means that we do
not have to explore.

00:42:00.380 --> 00:42:02.340
And it doesn't
explode exponentially

00:42:02.340 --> 00:42:06.049
with the number of moves that
we're looking into the future.

00:42:06.049 --> 00:42:08.590
And that it selectively grows
the tree towards the areas that

00:42:08.590 --> 00:42:11.050
are most promising.

00:42:11.050 --> 00:42:13.010
The other huge
benefit, if you'll

00:42:13.010 --> 00:42:15.290
notice from what we've
just talked through,

00:42:15.290 --> 00:42:17.220
is that it never
relies on anything

00:42:17.220 --> 00:42:19.120
other than the strict
rules of the game.

00:42:19.120 --> 00:42:21.555
What that means is that the
only weight of the game that's

00:42:21.555 --> 00:42:23.580
factored in is that the
game is what tells us

00:42:23.580 --> 00:42:26.120
what the next moves we can
take from a given state are,

00:42:26.120 --> 00:42:32.310
and whether a given state
is a victory or a defeat.

00:42:32.310 --> 00:42:35.070
And that's kind of
amazing because we

00:42:35.070 --> 00:42:37.650
had no external heuristic
information about this game.

00:42:37.650 --> 00:42:39.850
Which means that if I
took a completely new game

00:42:39.850 --> 00:42:42.720
that someone had just invented,
and I plugged MCTS into it,

00:42:42.720 --> 00:42:47.720
MCTS would be a slightly or
someone competitive player

00:42:47.720 --> 00:42:50.600
for this game, which
is a powerful idea.

00:42:50.600 --> 00:42:52.350
It leads to our next two pros.

00:42:52.350 --> 00:42:56.220
The first of which is that it's
very easy to adapt to new games

00:42:56.220 --> 00:42:58.850
that it hasn't seen
before, or even that people

00:42:58.850 --> 00:43:02.160
haven't seen before.

00:43:02.160 --> 00:43:03.720
This is clearly valuable.

00:43:03.720 --> 00:43:05.100
But the other nice
thing about it

00:43:05.100 --> 00:43:07.020
is that even though
heuristics are not

00:43:07.020 --> 00:43:11.810
required to make MCTS
work [INAUDIBLE],,

00:43:11.810 --> 00:43:12.840
it can work [INAUDIBLE].

00:43:12.840 --> 00:43:14.340
There are a number
of [? advanced ?]

00:43:14.340 --> 00:43:16.340
places in the algorithm
that you can actually

00:43:16.340 --> 00:43:17.630
incorporate heuristics into.

00:43:17.630 --> 00:43:20.880
Nick is going to talk about how
AlphaGo uses this very heavily.

00:43:20.880 --> 00:43:22.460
AlphaGo is not vanilla Go.

00:43:22.460 --> 00:43:24.270
It has a lot of
external information

00:43:24.270 --> 00:43:26.430
that's built into the
way that it works.

00:43:26.430 --> 00:43:29.841
But MCTS is a framework-- you
can imagine your heuristics you

00:43:29.841 --> 00:43:31.257
can apply in the
simulation, there

00:43:31.257 --> 00:43:33.420
are heuristics you can
apply in the UCB in the way

00:43:33.420 --> 00:43:35.550
that we choose the next node.

00:43:35.550 --> 00:43:37.210
There are places
that it can fit in.

00:43:37.210 --> 00:43:39.376
And this services as a nice
infrastructure to do so.

00:43:41.320 --> 00:43:45.150
The other benefit is that it's
an on demand algorithm, which

00:43:45.150 --> 00:43:47.660
is particularly valuable when
you're under some sort of time

00:43:47.660 --> 00:43:49.909
pressure, when you're competing
against someone that's

00:43:49.909 --> 00:43:53.100
a mathematician, or when
something is about to explode

00:43:53.100 --> 00:43:57.240
and you have to make a decision
on which reactor to shut down.

00:43:57.240 --> 00:44:00.180
And lastly-- or not
lastly, actually, it's

00:44:00.180 --> 00:44:02.590
complete, which is
really nice because you

00:44:02.590 --> 00:44:04.740
know that if you run
this game for long enough

00:44:04.740 --> 00:44:08.270
it's going to start looking
at a lot like a BFS tree.

00:44:08.270 --> 00:44:09.936
No, it's actually
going to start looking

00:44:09.936 --> 00:44:14.820
like an alpha-beta tree, if
it is what it is converted to.

00:44:14.820 --> 00:44:16.650
It's a nice property to have.

00:44:16.650 --> 00:44:18.470
Although, this
property does slightly

00:44:18.470 --> 00:44:20.595
get compromised if you
remove the red in this idea,

00:44:20.595 --> 00:44:24.690
and if only simulate
these [INAUDIBLE]..

00:44:24.690 --> 00:44:25.594
Yeah.

00:44:25.594 --> 00:44:27.370
PROFESSOR: You made
an interesting comment

00:44:27.370 --> 00:44:29.410
when you said, oh, it
looks like -beta tree.

00:44:29.410 --> 00:44:32.290
So it looked like
a mini-max tree.

00:44:32.290 --> 00:44:35.190
But have they also
incorporated notions

00:44:35.190 --> 00:44:37.530
of pruning in the
MCTS, which would make

00:44:37.530 --> 00:44:38.947
it look like an -beta tree?

00:44:38.947 --> 00:44:40.780
PROFESSOR 3: Sorry,
you're completely right.

00:44:40.780 --> 00:44:42.990
It does look like
a mini-max tree.

00:44:42.990 --> 00:44:45.380
I think I've seen variants
where they do pruning,

00:44:45.380 --> 00:44:46.963
but I haven't looked
into it as much.

00:44:46.963 --> 00:44:48.690
But I would imagine
that they would

00:44:48.690 --> 00:44:50.500
converge to whatever
you know pruning

00:44:50.500 --> 00:44:52.020
a certain tree [INAUDIBLE].

00:44:52.020 --> 00:44:54.780
AUDIENCE: But people have
explored incorporating pruning

00:44:54.780 --> 00:44:55.380
into MCTS?

00:44:55.380 --> 00:44:57.170
PROFESSOR 3: I think so.

00:44:57.170 --> 00:45:01.350
I can't say [INAUDIBLE]
And then lastly, it's

00:45:01.350 --> 00:45:02.680
really parallelizable.

00:45:02.680 --> 00:45:05.610
You'll notice, none of
the regions of this tree,

00:45:05.610 --> 00:45:08.005
other than the
original choice, ever

00:45:08.005 --> 00:45:09.380
have to interact
with each other.

00:45:09.380 --> 00:45:12.030
So if you have 200
processors and you decide,

00:45:12.030 --> 00:45:15.169
OK, I'm going to break up this
tree in the first 200 decisions

00:45:15.169 --> 00:45:16.710
and then have each
one of those flesh

00:45:16.710 --> 00:45:20.600
out one of those decisions, that
actually means that they can

00:45:20.600 --> 00:45:22.400
all combine information
right at the end

00:45:22.400 --> 00:45:24.025
and make a decision
[INAUDIBLE],, which

00:45:24.025 --> 00:45:29.280
is a really nice, powerful
principle as you [INAUDIBLE]..

00:45:29.280 --> 00:45:31.290
It does have its fair
share of problems.

00:45:31.290 --> 00:45:34.950
The first problem being
that it does breakdown

00:45:34.950 --> 00:45:38.290
under extreme tree depth.

00:45:38.290 --> 00:45:41.340
The main reason for this
being that as you increase

00:45:41.340 --> 00:45:45.150
more moves between you
and the end of the game,

00:45:45.150 --> 00:45:47.250
you're increasing
the probability--

00:45:47.250 --> 00:45:49.604
you are decreasing the
correlation between your game

00:45:49.604 --> 00:45:51.270
state and whether a
random playoff would

00:45:51.270 --> 00:45:54.750
suggest that you're in a good
position or a bad position.

00:45:54.750 --> 00:45:56.397
The same goes for
branching factors.

00:45:56.397 --> 00:45:58.605
One of the things that people
sometimes talk about it

00:45:58.605 --> 00:46:03.930
as if MCTS AI's cannot
play first-person shooters

00:46:03.930 --> 00:46:07.590
because the distance between the
number of things that you can

00:46:07.590 --> 00:46:11.460
do at every given moment, and
what would be a successful

00:46:11.460 --> 00:46:14.200
approach in the long term
after meeting many, many,

00:46:14.200 --> 00:46:16.360
many moves that each have
many branching factors,

00:46:16.360 --> 00:46:20.937
is that never begins to explore
the size of the search tree.

00:46:20.937 --> 00:46:22.770
For the most part, it's
not really coming up

00:46:22.770 --> 00:46:24.460
with a long term policy.

00:46:24.460 --> 00:46:27.736
It's really thinking about what
are the next sequence of moves

00:46:27.736 --> 00:46:31.190
that I should [INAUDIBLE].

00:46:31.190 --> 00:46:34.000
Another problem is
that it requires

00:46:34.000 --> 00:46:38.530
simulation to be very
easy and very repeatable.

00:46:38.530 --> 00:46:42.820
So for example, if we
wanted to tell our AI,

00:46:42.820 --> 00:46:44.920
how do I take over Ontario?

00:46:44.920 --> 00:46:46.630
There's not a
particularly good way

00:46:46.630 --> 00:46:49.480
that you can simulate
taking over Ontario?

00:46:49.480 --> 00:46:50.995
If you try it once,
you're not going

00:46:50.995 --> 00:46:52.810
to have an opportunity
to try it again,

00:46:52.810 --> 00:46:56.470
at least with the same
set of configurations.

00:46:56.470 --> 00:46:59.030
And actually, one of the things
that we really took advantage

00:46:59.030 --> 00:47:01.238
of, if that random simulation
happens really quickly,

00:47:01.238 --> 00:47:02.865
on the order of microseconds.

00:47:02.865 --> 00:47:07.494
On other hand, the
bigger your computational

00:47:07.494 --> 00:47:08.910
resources that you
have access to,

00:47:08.910 --> 00:47:10.300
the better the algorithm works.

00:47:10.300 --> 00:47:12.799
That means that I can't run it
off my Mac particularly well.

00:47:12.799 --> 00:47:15.670
It would be like large games.

00:47:15.670 --> 00:47:18.195
It relies on this tenuous
assumption of random play

00:47:18.195 --> 00:47:21.257
be weakly correlated with the
quality of our game state.

00:47:21.257 --> 00:47:23.132
And this is one of the
first assumptions that

00:47:23.132 --> 00:47:25.548
is going to be thrown out the
window for a lot of the more

00:47:25.548 --> 00:47:27.880
advanced MCTS approaches,
which are going to have

00:47:27.880 --> 00:47:29.857
more intelligent play outs.

00:47:29.857 --> 00:47:31.940
But those are going to
lose some of the generality

00:47:31.940 --> 00:47:35.380
that we had before.

00:47:35.380 --> 00:47:38.719
Something that goes off of that
is that MCTS is a framework.

00:47:38.719 --> 00:47:41.260
But in order to actually make
it effective for a lot of games

00:47:41.260 --> 00:47:44.251
it does require a lot of
tuning, in the sense that there

00:47:44.251 --> 00:47:45.500
are a whole bunch of variants.

00:47:45.500 --> 00:47:47.140
And that you need to be
able to implement whatever

00:47:47.140 --> 00:47:48.745
flavor is best suited for you.

00:47:48.745 --> 00:47:51.290
Which means that it's not
quite as nice and black boxy

00:47:51.290 --> 00:47:54.890
as we would want it to be
as far as give it the rules

00:47:54.890 --> 00:47:58.270
and have it magically come up
with a strategy [INAUDIBLE]..

00:47:58.270 --> 00:48:00.160
And then lastly,
as you mentioned,

00:48:00.160 --> 00:48:03.280
there is not a great amount
of literature right now

00:48:03.280 --> 00:48:06.080
about the properties of
MCTS and its convergence,

00:48:06.080 --> 00:48:09.040
and what the actual
proportion of time

00:48:09.040 --> 00:48:11.950
to quality of your solution is.

00:48:11.950 --> 00:48:15.610
This is true of all modern
machine learning things,

00:48:15.610 --> 00:48:18.261
is that there is certainly a lot
more work that could be done.

00:48:18.261 --> 00:48:19.760
But right now,
that's a gap in terms

00:48:19.760 --> 00:48:23.577
of using this for a simulation
that's supposed to be reliable.

00:48:23.577 --> 00:48:27.630
Anyone have any questions
on the Pros and Cons?

00:48:27.630 --> 00:48:29.940
Before we jump dive
into applications,

00:48:29.940 --> 00:48:32.320
let's talk through
a few examples

00:48:32.320 --> 00:48:34.770
of what games could be
solved and could not

00:48:34.770 --> 00:48:36.750
be solved by MCTS.

00:48:36.750 --> 00:48:38.842
Do you guys think that
checkers is a game that

00:48:38.842 --> 00:48:40.610
could be solved by MCTS?

00:48:40.610 --> 00:48:41.369
AUDIENCE: Yes.

00:48:41.369 --> 00:48:43.160
PROFESSOR 3: It's
completely deterministic.

00:48:43.160 --> 00:48:43.826
It's two-player.

00:48:43.826 --> 00:48:46.770
It satisfies all of the criteria
that we've laid out before.

00:48:46.770 --> 00:48:48.240
Checkers is
definitely a game that

00:48:48.240 --> 00:48:51.240
can and has been solved by
MCTS, although not solved

00:48:51.240 --> 00:48:53.760
to the extent that you can
defeat the thing that actually

00:48:53.760 --> 00:48:57.270
has the solution [INAUDIBLE].

00:48:57.270 --> 00:48:58.860
How about "Settlers of Catan?"

00:48:58.860 --> 00:49:00.234
This one's a little
bit trickier.

00:49:00.234 --> 00:49:02.680
Do you guys think that MCTS
is likely to be able to play

00:49:02.680 --> 00:49:04.650
"Settlers of Catan?"

00:49:04.650 --> 00:49:07.844
If not, let's throw out reason
why or why not it would be

00:49:07.844 --> 00:49:09.080
[INAUDIBLE].

00:49:09.080 --> 00:49:09.580
Yeah.

00:49:09.580 --> 00:49:11.800
AUDIENCE: No because
there's randomness.

00:49:11.800 --> 00:49:14.050
PROFESSOR 3: So yes, that
is absolutely the criticism.

00:49:14.050 --> 00:49:16.640
And that's why we
can't apply it vanilla.

00:49:16.640 --> 00:49:18.820
I put this on here
as a trick question,

00:49:18.820 --> 00:49:20.500
though, because it
turns out that MCTS

00:49:20.500 --> 00:49:22.460
is robust to randomness.

00:49:22.460 --> 00:49:23.990
That you can actually play--

00:49:23.990 --> 00:49:25.614
and I realize that's
just me and we do.

00:49:25.614 --> 00:49:26.470
[LAUGHTER]

00:49:26.470 --> 00:49:29.349
You can actually
play through games.

00:49:29.349 --> 00:49:30.765
If you think about
the simulation,

00:49:30.765 --> 00:49:32.680
the simulation is
actually applicable

00:49:32.680 --> 00:49:35.692
even if the game is
not deterministic

00:49:35.692 --> 00:49:37.650
because it does give you
a sense of the quality

00:49:37.650 --> 00:49:38.870
of your position.

00:49:38.870 --> 00:49:42.830
And the MCTS-based
AI to play "Settlers"

00:49:42.830 --> 00:49:47.946
is, I think, at least 49%
competitive with the best AI

00:49:47.946 --> 00:49:50.860
to play, at least in the
autonomous non-scale space.

00:49:50.860 --> 00:49:53.780
So it does work.

00:49:53.780 --> 00:49:57.710
Let's talk about the war
operations plan response.

00:49:57.710 --> 00:50:00.831
Who here has seen the
movie "War Games?"

00:50:00.831 --> 00:50:01.330
OK.

00:50:01.330 --> 00:50:04.030
Well, it should be more of you.

00:50:04.030 --> 00:50:06.730
The idea of "War
Games" is that one

00:50:06.730 --> 00:50:09.640
of the core characters
in this world

00:50:09.640 --> 00:50:11.380
is this computer
that has been put

00:50:11.380 --> 00:50:15.130
in charge of the national
defense strategy with respect

00:50:15.130 --> 00:50:16.700
to Russia.

00:50:16.700 --> 00:50:19.092
And that it needs to think
through the possible future

00:50:19.092 --> 00:50:21.550
scenarios and decide whether
it's going to launch the nukes

00:50:21.550 --> 00:50:23.170
or not.

00:50:23.170 --> 00:50:27.810
Do you think that WOPR
can be MCTS-based?

00:50:27.810 --> 00:50:29.010
AUDIENCE: No.

00:50:29.010 --> 00:50:29.900
PROFESSOR 3: No.

00:50:29.900 --> 00:50:32.191
AUDIENCE: It could, it
just wouldn't be very good.

00:50:32.191 --> 00:50:33.190
PROFESSOR 3: Absolutely.

00:50:33.190 --> 00:50:34.606
Once you fire the
nukes you're not

00:50:34.606 --> 00:50:36.060
going to get another chance.

00:50:36.060 --> 00:50:37.600
So you can't
particularly simulate

00:50:37.600 --> 00:50:39.620
through what the possible
scenarios are going to be like.

00:50:39.620 --> 00:50:39.910
Yeah.

00:50:39.910 --> 00:50:41.160
AUDIENCE: So what if you had--

00:50:41.160 --> 00:50:43.390
I agree you can't simulate
it in the real world.

00:50:43.390 --> 00:50:45.790
But what if you had
a really good model

00:50:45.790 --> 00:50:47.710
and you just simulated
based on that model?

00:50:51.074 --> 00:50:52.990
PROFESSOR 3: In that
case, it probably depends

00:50:52.990 --> 00:50:55.536
on the quality of your model.

00:50:55.536 --> 00:50:59.236
If you have a good model for
how World War III is going to

00:50:59.236 --> 00:50:59.736
[INAUDIBLE].

00:50:59.736 --> 00:51:02.850
[LAUGHTER]

00:51:02.850 --> 00:51:05.170
AUDIENCE: It is the case
that the military does

00:51:05.170 --> 00:51:10.200
have simulators and they
do war games in simulation.

00:51:10.200 --> 00:51:12.070
PROFESSOR 3: Yes, that's true.

00:51:12.070 --> 00:51:14.655
They could certainly try it
and run MCTS if they wanted.

00:51:14.655 --> 00:51:16.560
And that's what
happened in the movie.

00:51:16.560 --> 00:51:18.667
[INTERPOSING VOICES]

00:51:18.667 --> 00:51:20.542
AUDIENCE: And there
you're putting your money

00:51:20.542 --> 00:51:22.430
in the simulation not in the--

00:51:22.430 --> 00:51:24.910
AUDIENCE: It's like having an
MCTS play SOCOM or something

00:51:24.910 --> 00:51:25.410
like that.

00:51:25.410 --> 00:51:26.160
PROFESSOR 3: Yeah.

00:51:26.160 --> 00:51:29.142
It's definitely about putting
money into the simulation

00:51:29.142 --> 00:51:30.600
and you get really
good simulation.

00:51:30.600 --> 00:51:33.400
If you have a really
good simulations then you

00:51:33.400 --> 00:51:35.490
[INAUDIBLE] to play WOPR.

00:51:35.490 --> 00:51:36.066
Yeah.

00:51:36.066 --> 00:51:37.816
AUDIENCE: Back to
"Settlers" for a second.

00:51:37.816 --> 00:51:40.600
I'm curious if there's a way
for the whole player training

00:51:40.600 --> 00:51:42.970
resources thing,
or would it have

00:51:42.970 --> 00:51:47.216
to be only purely
like using the ports.

00:51:47.216 --> 00:51:48.760
PROFESSOR 3: That's
a good question.

00:51:48.760 --> 00:51:53.620
I haven't looked closely at
whether they do that or not.

00:51:53.620 --> 00:51:55.400
If it's playing a
two-player game,

00:51:55.400 --> 00:51:58.790
then I would imagine that they
wouldn't because you don't

00:51:58.790 --> 00:52:00.290
really trade in to play a game.

00:52:00.290 --> 00:52:01.748
But if they weren't,
I bet that you

00:52:01.748 --> 00:52:03.257
can incorporate it with WOPR.

00:52:03.257 --> 00:52:05.090
AUDIENCE: Is it limited
to two-player games?

00:52:05.090 --> 00:52:06.070
PROFESSOR 3: No, not at all.

00:52:06.070 --> 00:52:07.240
In fact, there are
lots of purchases

00:52:07.240 --> 00:52:08.890
that do only
one-player games, where

00:52:08.890 --> 00:52:11.482
you think of what's the best
movie that you can make.

00:52:11.482 --> 00:52:12.190
AUDIENCE: I know.

00:52:12.190 --> 00:52:15.215
But I mean, couldn't MCTS handle
three- or four-player games?

00:52:15.215 --> 00:52:16.840
PROFESSOR 3: Yeah,
it absolutely could.

00:52:16.840 --> 00:52:19.622
I'm not sure how they
computed their head-to-head.

00:52:19.622 --> 00:52:21.730
That might be
completely flat cursors.

00:52:21.730 --> 00:52:24.734
I'm not even sure how
the settlers interact.

00:52:24.734 --> 00:52:25.234
Yeah.

00:52:25.234 --> 00:52:26.359
AUDIENCE: A quick question.

00:52:26.359 --> 00:52:29.460
So at first you know if I
reduce the chess board to only 4

00:52:29.460 --> 00:52:32.530
by 4 or 5 by 5, and
I run MCTS versus

00:52:32.530 --> 00:52:35.050
the traditional algorithm that
AlphaGo offered as a tree.

00:52:35.050 --> 00:52:38.222
Do you think MCTS will
prefer theory and perform

00:52:38.222 --> 00:52:40.219
this computational requirement.

00:52:40.219 --> 00:52:42.510
PROFESSOR 3: The thing about
the way that Deep Blue is,

00:52:42.510 --> 00:52:44.570
which is the AI that
ended the Kasparov

00:52:44.570 --> 00:52:47.370
thing, a bunch of his
chess grand master,

00:52:47.370 --> 00:52:49.970
is that it has a tremendous
amount of heuristic

00:52:49.970 --> 00:52:50.587
information.

00:52:50.587 --> 00:52:52.170
There's a lot of
external stuff that's

00:52:52.170 --> 00:52:54.310
incorporated into the
system that makes it

00:52:54.310 --> 00:52:57.250
able to explore the best paths.

00:52:57.250 --> 00:52:59.500
What I would say is
that knoledgesless

00:52:59.500 --> 00:53:03.730
MCTS based on randomness,
would take a very long

00:53:03.730 --> 00:53:07.850
computational time to even
become competitive with those

00:53:07.850 --> 00:53:10.382
kinds of algorithms, and
probably feasibly never would.

00:53:10.382 --> 00:53:12.340
What if you incorporated
heuristic information,

00:53:12.340 --> 00:53:15.320
I think that there's a bunch of
hope in terms of getting MCTS

00:53:15.320 --> 00:53:16.600
to start performing better.

00:53:16.600 --> 00:53:18.850
And you can look at what
next I'm going to talk about,

00:53:18.850 --> 00:53:19.460
AlphaGo.

00:53:19.460 --> 00:53:22.040
It takes inspiration for how
we go about incorporating

00:53:22.040 --> 00:53:22.940
these new circuits.

00:53:22.940 --> 00:53:27.147
AUDIENCE: So only the
circuit you [INAUDIBLE]

00:53:27.147 --> 00:53:28.980
PROFESSOR 3: It definitely
seems like if you

00:53:28.980 --> 00:53:33.330
have a really good
heuristic model for what

00:53:33.330 --> 00:53:38.530
good states in the game are,
that if it's a smaller search

00:53:38.530 --> 00:53:42.420
space, that some other
models could perform better.

00:53:42.420 --> 00:53:44.546
Although, I'm probably
going to eat my foot here

00:53:44.546 --> 00:53:47.390
because this is going to be
on OCW some massive amount,

00:53:47.390 --> 00:53:49.936
massive chess
playing algorithms.

00:53:49.936 --> 00:53:53.429
Eat my shoe not my foot.

00:53:53.429 --> 00:53:55.430
[LAUGHTER]

00:53:55.430 --> 00:53:58.100
One last game.

00:53:58.100 --> 00:54:00.364
Does anyone know
what this game is?

00:54:00.364 --> 00:54:01.280
AUDIENCE: "Total War?"

00:54:01.280 --> 00:54:02.380
PROFESSOR 3: Yes.

00:54:02.380 --> 00:54:03.060
Nice.

00:54:03.060 --> 00:54:04.850
This is "Rome, Total War II."

00:54:04.850 --> 00:54:09.890
It's a simulator for this
tremendous real time strategy

00:54:09.890 --> 00:54:13.100
game, where you play, I
think, the Roman Empire.

00:54:13.100 --> 00:54:17.540
And you're controlling armies
and huge infrastructure systems

00:54:17.540 --> 00:54:20.980
that move and conquer
states and continents,

00:54:20.980 --> 00:54:24.530
and meet in the field, and
manage resources, and do

00:54:24.530 --> 00:54:26.860
all of these incredible
diplomacy feats.

00:54:26.860 --> 00:54:29.347
And so do you think that this
game can be solved by MCTS?

00:54:29.347 --> 00:54:29.930
AUDIENCE: Yes.

00:54:29.930 --> 00:54:32.409
AUDIENCE: Yes.

00:54:32.409 --> 00:54:33.450
PROFESSOR 3: Lets say no.

00:54:33.450 --> 00:54:34.658
But I guess I put it on here.

00:54:34.658 --> 00:54:36.870
So that's good on you.

00:54:36.870 --> 00:54:40.925
The way that the AI in
"Rome, Total War II" is built

00:54:40.925 --> 00:54:43.200
is that it's built
on an MCTS structure.

00:54:43.200 --> 00:54:45.980
And it in fact does
do resource allocation

00:54:45.980 --> 00:54:47.780
and a lot of its
political maneuvers

00:54:47.780 --> 00:54:49.439
based on Monte Carlo
Tree Search moves.

00:54:49.439 --> 00:54:51.355
There are a bunch of
reasons that they explain

00:54:51.355 --> 00:54:53.390
in the game for
why they do this,

00:54:53.390 --> 00:54:54.961
or in papers released
about the game.

00:54:54.961 --> 00:54:56.835
But one of the nice ones
is that it's random,

00:54:56.835 --> 00:54:58.293
which means that
you're never going

00:54:58.293 --> 00:55:01.280
to play against the same kind
of AI twice because every time

00:55:01.280 --> 00:55:02.750
the set of decisions that
it's going to think about

00:55:02.750 --> 00:55:03.736
is completely different.

00:55:03.736 --> 00:55:04.608
AUDIENCE: I have
a quick question.

00:55:04.608 --> 00:55:05.358
PROFESSOR 3: Yeah.

00:55:05.358 --> 00:55:07.646
AUDIENCE: So if I want to
model any game with MCTS,

00:55:07.646 --> 00:55:10.996
does it have to be that the
actions in playing a game

00:55:10.996 --> 00:55:14.272
has to be able to discretize.

00:55:14.272 --> 00:55:14.980
PROFESSOR 3: Yes.

00:55:14.980 --> 00:55:17.755
So far as I know, I haven't
seen many continuous variants

00:55:17.755 --> 00:55:19.520
in MCTS.

00:55:19.520 --> 00:55:22.680
And so, I think that it is about
choosing these reactions, which

00:55:22.680 --> 00:55:26.130
on it's most narrow level does
actually bring it down to here.

00:55:26.130 --> 00:55:27.630
I think one of the
reasons that this

00:55:27.630 --> 00:55:30.046
is nice is that there are so
many different decisions that

00:55:30.046 --> 00:55:32.610
could be made that MCTS is
really the only approach that

00:55:32.610 --> 00:55:35.525
could even begin to handle the
massive branching factor that's

00:55:35.525 --> 00:55:37.810
associated with the
game Rome, Total War.

00:55:37.810 --> 00:55:38.579
Yeah.

00:55:38.579 --> 00:55:40.162
AUDIENCE: This is
also the consequence

00:55:40.162 --> 00:55:43.407
of this year you get the play
off when this game comes.

00:55:43.407 --> 00:55:44.740
PROFESSOR 3: That's interesting.

00:55:44.740 --> 00:55:46.600
That's probably totally it.

00:55:46.600 --> 00:55:47.170
That's cool.

00:55:50.640 --> 00:55:54.710
That's everything about how
the algorithm actually works.

00:55:54.710 --> 00:55:56.134
I'm going to pass
it off to Nick,

00:55:56.134 --> 00:55:58.550
and he's going to talk to us
about some actual limitations

00:55:58.550 --> 00:56:00.317
for this game [INAUDIBLE].

00:56:04.221 --> 00:56:05.596
PROFESSOR 3: So
as you have said,

00:56:05.596 --> 00:56:07.980
I'm going to start diving
into some applications here.

00:56:07.980 --> 00:56:12.180
And not only applications
but also some modifications

00:56:12.180 --> 00:56:13.460
or augmentations of MCTS.

00:56:13.460 --> 00:56:16.590
It should hopefully clarify
some of the side questions

00:56:16.590 --> 00:56:21.610
you all have been having
on slight tweaks to MCTS.

00:56:21.610 --> 00:56:23.470
Now let's get started.

00:56:23.470 --> 00:56:24.230
Wait for it.

00:56:24.230 --> 00:56:25.267
Now let's get started.

00:56:25.267 --> 00:56:25.766
[LAUGHTER]

00:56:25.766 --> 00:56:27.600
Part III, applications.

00:56:27.600 --> 00:56:29.025
First thing we're
going to look at

00:56:29.025 --> 00:56:31.110
is an MCTS-based
"Mario" controller.

00:56:31.110 --> 00:56:35.290
And "Mario" might seem like
some weird thing to test AI on,

00:56:35.290 --> 00:56:38.320
but there actually is a "Super
Mario Bros" AI benchmark,

00:56:38.320 --> 00:56:39.930
which it used to
test a lot of AI

00:56:39.930 --> 00:56:42.280
on how well they could
play this platform.

00:56:42.280 --> 00:56:45.290
In case any of you don't
know what "Super Mario

00:56:45.290 --> 00:56:47.420
Bros" is, this is a screenshot.

00:56:47.420 --> 00:56:49.170
Basically, you control
this one character.

00:56:49.170 --> 00:56:52.780
It's a single-player game.

00:56:52.780 --> 00:56:55.920
The ultimate goal is to
reach this flag at the end.

00:56:55.920 --> 00:56:58.180
But along the way
there's enemies,

00:56:58.180 --> 00:57:01.046
there's some bonus
shrooms you can get.

00:57:01.046 --> 00:57:03.870
If you break open some
boxes you might get coins,

00:57:03.870 --> 00:57:06.360
things like that.

00:57:06.360 --> 00:57:09.642
But first, let's just highlight
some of the modifications that

00:57:09.642 --> 00:57:12.100
need to be made, or some of
the differences between vanilla

00:57:12.100 --> 00:57:16.590
MCTS and an MCTS that's going
to be able to work for "Mario."

00:57:16.590 --> 00:57:18.296
First thing is that
it's single-player.

00:57:18.296 --> 00:57:21.000
The second is, we use a
slightly different simulation

00:57:21.000 --> 00:57:25.130
strategy than the initial
just vanilla simulation.

00:57:25.130 --> 00:57:27.760
And someone actually hinted at
doing more than one simulation

00:57:27.760 --> 00:57:32.280
because you, you're watching
us to n simulations, I think.

00:57:32.280 --> 00:57:33.810
We'll touch on that.

00:57:33.810 --> 00:57:36.630
Then this also introduces
what I would consider

00:57:36.630 --> 00:57:38.840
to be domain knowledge.

00:57:38.840 --> 00:57:42.854
Then finally, there's a 50 to
40 millisecond computation time.

00:57:42.854 --> 00:57:45.270
And that has to do with the
frames per second of the game.

00:57:45.270 --> 00:57:48.240
So you would think that
"Mario" is a continuous game,

00:57:48.240 --> 00:57:50.950
but if we discretize
time into these chunks,

00:57:50.950 --> 00:57:54.380
then we can use MTTS.

00:57:54.380 --> 00:57:56.570
Now let's just think about
how we could possibly

00:57:56.570 --> 00:57:57.840
formulate this problem.

00:57:57.840 --> 00:58:00.630
Can anyone think of
what each of these nodes

00:58:00.630 --> 00:58:02.586
would be if we're
playing "Super Mario?"

00:58:02.586 --> 00:58:04.169
AUDIENCE: Jump.

00:58:04.169 --> 00:58:04.960
PROFESSOR 3: Sorry?

00:58:04.960 --> 00:58:05.585
AUDIENCE: Jump.

00:58:05.585 --> 00:58:07.960
It would be like, first
node you're going to jump.

00:58:07.960 --> 00:58:11.600
PROFESSOR 3: That might
be a way to formulate it.

00:58:11.600 --> 00:58:13.522
But I think that could get--

00:58:13.522 --> 00:58:16.099
AUDIENCE: Oh, it's not your
control at inputs [INAUDIBLE]..

00:58:16.099 --> 00:58:16.890
PROFESSOR 3: Right.

00:58:16.890 --> 00:58:22.110
So the node itself isn't
going to be an action.

00:58:22.110 --> 00:58:23.490
AUDIENCE: Equal frames.

00:58:23.490 --> 00:58:24.698
PROFESSOR 3: Yeah, basically.

00:58:24.698 --> 00:58:27.320
So it's going to be the
state of a game, what

00:58:27.320 --> 00:58:28.490
we'll call a state.

00:58:28.490 --> 00:58:30.360
So it's basically
just a screen grab.

00:58:30.360 --> 00:58:31.990
And it take it,
in this case, it's

00:58:31.990 --> 00:58:35.190
a 15 by 19 grid screen
grab of the game.

00:58:35.190 --> 00:58:37.083
And it will have
information about-- it

00:58:37.083 --> 00:58:40.240
knows Mario's position, it knows
the enemy's position, position

00:58:40.240 --> 00:58:42.600
of the blocks, et cetera.

00:58:42.600 --> 00:58:45.370
And then, as Yo
was saying, in MCTS

00:58:45.370 --> 00:58:47.660
we have values associated
with our nodes.

00:58:47.660 --> 00:58:49.240
And so it will
also have a value.

00:58:49.240 --> 00:58:52.890
But we'll get into the
value in the next slide

00:58:52.890 --> 00:58:56.820
because I can't really
fit it all in here.

00:58:56.820 --> 00:58:58.930
With that being said
for our node, that

00:58:58.930 --> 00:59:02.490
being the state of the game,
what makes sense for the edge?

00:59:02.490 --> 00:59:03.420
Does anyone know?

00:59:03.420 --> 00:59:05.990
How do we transition from
one state to another state?

00:59:05.990 --> 00:59:07.220
AUDIENCE: Jump.

00:59:07.220 --> 00:59:07.710
PROFESSOR 3: Yeah, exactly.

00:59:07.710 --> 00:59:09.560
So this is where the
jump and all the action

00:59:09.560 --> 00:59:10.268
have been played.

00:59:10.268 --> 00:59:11.970
So the actions that you take--

00:59:11.970 --> 00:59:13.230
I didn't list all the actions.

00:59:13.230 --> 00:59:16.502
You can also have a jump left,
jump right, all those things.

00:59:16.502 --> 00:59:17.960
But basically, the
actions are what

00:59:17.960 --> 00:59:19.209
takes you from state to state.

00:59:19.209 --> 00:59:22.792
So I just drew out
what a node might

00:59:22.792 --> 00:59:24.375
look like if you
used the jump action.

00:59:24.375 --> 00:59:27.078
You might have Mario
go up in the sky.

00:59:27.078 --> 00:59:28.572
Are there questions?

00:59:28.572 --> 00:59:30.790
AUDIENCE: Does it just
run the rest of it?

00:59:30.790 --> 00:59:34.520
Because that little thing's
moving as they move on?

00:59:34.520 --> 00:59:37.260
PROFESSOR 3: Well, it's not
moving in this moment in time.

00:59:37.260 --> 00:59:39.799
We're discretizing
time right now.

00:59:39.799 --> 00:59:41.840
AUDIENCE: But I'm saying,
if your action is jump,

00:59:41.840 --> 00:59:46.047
just you would have 1,000
nodes because if you did

00:59:46.047 --> 00:59:48.130
plan out where that thing's
moving, left or right,

00:59:48.130 --> 00:59:48.710
then it could be--

00:59:48.710 --> 00:59:49.751
PROFESSOR 3: Yeah, right.

00:59:49.751 --> 00:59:52.450
So in each state we
have the enemy position.

00:59:52.450 --> 00:59:54.110
And we know the
speed and direction.

00:59:54.110 --> 00:59:57.290
And so we know when we go from
this node to one time step

00:59:57.290 --> 01:00:00.694
later, we'll know where
the enemy's moving.

01:00:00.694 --> 01:00:04.530
Any other questions?

01:00:04.530 --> 01:00:06.550
Moving on.

01:00:06.550 --> 01:00:07.290
Sorry.

01:00:07.290 --> 01:00:09.080
Let me just preface
this part real quick.

01:00:09.080 --> 01:00:11.970
So in our other simulations,
at the end of the simulation

01:00:11.970 --> 01:00:14.700
we would get either a one or a
zero, if we'd won tic-tac-toe

01:00:14.700 --> 01:00:17.910
or we lost tic-tac-toe.

01:00:17.910 --> 01:00:21.600
But that won't really work
too well here because there's

01:00:21.600 --> 01:00:23.850
a lot of other factors
that go into play when

01:00:23.850 --> 01:00:25.410
you're playing "Mario."

01:00:25.410 --> 01:00:28.450
Also, if you're doing a
simulation, more than likely,

01:00:28.450 --> 01:00:30.210
you're going to end
up hitting an enemy

01:00:30.210 --> 01:00:32.380
and dying or falling
into a gap and dying.

01:00:32.380 --> 01:00:34.830
So a lot of these simulations
might all return zero.

01:00:34.830 --> 01:00:38.440
And that is, you can't really
distinguish between them.

01:00:38.440 --> 01:00:41.670
So this is why I say,
this version of MCTS

01:00:41.670 --> 01:00:44.370
introduces what I would
consider to be domain knowledge.

01:00:44.370 --> 01:00:46.860
Basically, they're
assigning scores

01:00:46.860 --> 01:00:50.680
to potential things that
could happened along the way.

01:00:50.680 --> 01:00:55.630
And this is basically telling
the AI that collecting a flower

01:00:55.630 --> 01:00:57.936
is a little bit better
than collecting a mushroom.

01:00:57.936 --> 01:00:59.760
It's telling it that
getting hurt is bad.

01:00:59.760 --> 01:01:02.130
Right off the bat, all
these things in the score

01:01:02.130 --> 01:01:05.100
are giving the AI some domain
knowledge about "Super Mario

01:01:05.100 --> 01:01:07.901
Bros," that it's helping
it calculate the simulation

01:01:07.901 --> 01:01:08.400
results.

01:01:10.985 --> 01:01:14.280
As it says here, it's just doing
a multi-objective weighted sum

01:01:14.280 --> 01:01:15.450
of all these things.

01:01:15.450 --> 01:01:17.825
Throughout the simulation it's
just adding up your score.

01:01:17.825 --> 01:01:20.742
And then that's the score that
is going to be propagated.

01:01:20.742 --> 01:01:24.479
Are there questions
about the score?

01:01:24.479 --> 01:01:26.520
AUDIENCE: You said that
it adds up all these guys

01:01:26.520 --> 01:01:27.730
and it propagates it over.

01:01:27.730 --> 01:01:33.086
Is it possible to just propagate
the multi-part sum [INAUDIBLE]

01:01:33.086 --> 01:01:37.655
as opposed to propagating
one value that you create?

01:01:37.655 --> 01:01:39.600
Are you essentially
propagating all--

01:01:39.600 --> 01:01:42.375
what's this?-- 15 values
upwards at every node, or are

01:01:42.375 --> 01:01:43.500
you propagating one value--

01:01:43.500 --> 01:01:44.220
PROFESSOR 3: Well,
it's one value.

01:01:44.220 --> 01:01:45.000
It's the collective--

01:01:45.000 --> 01:01:46.315
AUDIENCE: Then you make
them add it together

01:01:46.315 --> 01:01:48.065
and you got each one
of them a sub factor.

01:01:50.164 --> 01:01:52.330
PROFESSOR 3: Then also,
just one thing to note here,

01:01:52.330 --> 01:01:55.230
is distance, you get 0.1.

01:01:55.230 --> 01:01:58.650
And these are all parameters
that have been tuned.

01:01:58.650 --> 01:02:01.470
In the initial version,
distance was, I think,

01:02:01.470 --> 01:02:05.035
a reward of five,
but probably realized

01:02:05.035 --> 01:02:08.500
that that made Mario skip past
a lot of coins and things.

01:02:08.500 --> 01:02:11.050
And so he tweaked
the score for that.

01:02:11.050 --> 01:02:13.100
And also, time left is two.

01:02:13.100 --> 01:02:14.460
So there's some weight there.

01:02:14.460 --> 01:02:16.700
You want to get to the
very end of the game.

01:02:16.700 --> 01:02:18.720
AUDIENCE: If you're
pushing up this score,

01:02:18.720 --> 01:02:20.850
it's no longer a
win over losses.

01:02:20.850 --> 01:02:22.920
So it's not w over n.

01:02:22.920 --> 01:02:23.950
What is it affecting?

01:02:23.950 --> 01:02:25.764
PROFESSOR 3: You can
just use the score.

01:02:25.764 --> 01:02:26.930
AUDIENCE: The score is the--

01:02:26.930 --> 01:02:27.860
PROFESSOR 3: Yeah.

01:02:27.860 --> 01:02:31.620
In MCTS you have
this idea of when

01:02:31.620 --> 01:02:36.540
you're propagating your q value,
you could have that to be zero,

01:02:36.540 --> 01:02:37.080
one.

01:02:37.080 --> 01:02:39.665
AUDIENCE: It's like the sum of
all the scores and the nodes

01:02:39.665 --> 01:02:41.290
below over the number
of games you win.

01:02:41.290 --> 01:02:42.150
PROFESSOR 3: So
basically, what you

01:02:42.150 --> 01:02:44.700
would be getting when you divide
by the number of simulations

01:02:44.700 --> 01:02:47.980
is your average
score at that node.

01:02:47.980 --> 01:02:50.330
AUDIENCE: OK.

01:02:50.330 --> 01:02:53.550
AUDIENCE: When you have
killsByFire and [INAUDIBLE]

01:02:53.550 --> 01:02:56.490
like that, if you
have a positive value,

01:02:56.490 --> 01:02:58.562
then isn't it good
to be killed by fire,

01:02:58.562 --> 01:02:59.520
or something like that?

01:02:59.520 --> 01:03:01.436
PROFESSOR 3: This is
killing an enemy by fire.

01:03:01.436 --> 01:03:04.350
Like Mario could collect a
certain flower or mushroom?

01:03:04.350 --> 01:03:07.060
I think flower, then you
have a fire breath and you

01:03:07.060 --> 01:03:07.560
[INAUDIBLE].

01:03:07.560 --> 01:03:10.864
AUDIENCE: So that's Mario's
status if Mario never dies?

01:03:10.864 --> 01:03:11.530
PROFESSOR 3: No.

01:03:11.530 --> 01:03:13.120
Mario's status is--

01:03:13.120 --> 01:03:15.086
I believe, Mario's
status is the fact

01:03:15.086 --> 01:03:17.710
that you could upgrade Mario by
collecting [INAUDIBLE] mushroom

01:03:17.710 --> 01:03:20.154
from a fire Mario.

01:03:20.154 --> 01:03:21.570
So that gives you
a lot of points.

01:03:21.570 --> 01:03:22.945
Because if you
become fire Mario,

01:03:22.945 --> 01:03:26.795
then you're more likely to not
die by running into enemies

01:03:26.795 --> 01:03:28.434
because you have fire-spewing--

01:03:28.434 --> 01:03:30.225
AUDIENCE: You said they
spent a lot of time

01:03:30.225 --> 01:03:31.900
tuning these parameters.

01:03:31.900 --> 01:03:34.170
Isn't it generally, though,
just an optimization

01:03:34.170 --> 01:03:36.540
framework if that's
some formula?

01:03:36.540 --> 01:03:38.215
So they tuned the
parameters just

01:03:38.215 --> 01:03:41.370
to make behave the way
that we think is nice.

01:03:41.370 --> 01:03:43.480
But if you change
the values, they'll

01:03:43.480 --> 01:03:45.120
do the right thing
for that equation.

01:03:45.120 --> 01:03:45.870
PROFESSOR 3: Yeah.

01:03:45.870 --> 01:03:46.320
AUDIENCE: OK.

01:03:46.320 --> 01:03:47.028
PROFESSOR 3: Yes.

01:03:47.028 --> 01:03:50.830
But they were tuning this to
make it play how they wanted.

01:03:50.830 --> 01:03:56.069
AUDIENCE: [INAUDIBLE] can't just
be a reflection of [INAUDIBLE]

01:03:56.069 --> 01:03:57.360
PROFESSOR 3: That's a strategy.

01:03:57.360 --> 01:03:59.380
If you choose that,
I don't see why not.

01:03:59.380 --> 01:04:02.352
That might affect
certain things.

01:04:02.352 --> 01:04:04.560
Obviously, you can change
these to whatever you want.

01:04:04.560 --> 01:04:06.730
It'll slightly tweak
which simulations

01:04:06.730 --> 01:04:10.240
as to working better, in terms
of changing which nodes you

01:04:10.240 --> 01:04:12.640
end up choosing [INAUDIBLE].

01:04:12.640 --> 01:04:15.670
So we move on.

01:04:15.670 --> 01:04:17.820
So we know about
scoring simulations.

01:04:17.820 --> 01:04:20.160
Now we're going to look at
exactly the simulation type

01:04:20.160 --> 01:04:23.580
that's used to play
this MCTS controller.

01:04:23.580 --> 01:04:25.500
So the regular version
that Yo talked about

01:04:25.500 --> 01:04:28.290
is just choosing a
random node at each level

01:04:28.290 --> 01:04:30.300
in your simulation.

01:04:30.300 --> 01:04:31.840
But there are some
other strategies.

01:04:31.840 --> 01:04:32.965
And someone brought one up.

01:04:32.965 --> 01:04:34.500
The first is, look at best of n.

01:04:34.500 --> 01:04:39.280
So in this one, you choose three
random nodes at each level,

01:04:39.280 --> 01:04:43.122
except that you stick with
the best of those three.

01:04:43.122 --> 01:04:45.080
Choose three random nodes,
stick with this one.

01:04:45.080 --> 01:04:45.871
Go to the next one.

01:04:45.871 --> 01:04:48.671
You would choose n random
three, take the best one,

01:04:48.671 --> 01:04:49.920
and then go to the next level.

01:04:49.920 --> 01:04:53.216
You are able to do that
in this game because

01:04:53.216 --> 01:04:54.590
of the way the
scoring works, you

01:04:54.590 --> 01:04:56.923
don't have to get to the end
of the game for your score.

01:04:56.923 --> 01:05:00.425
You actually could collect
a coin along the way.

01:05:00.425 --> 01:05:01.860
If this is jump,
and then it gets

01:05:01.860 --> 01:05:03.610
to be a coin versus
moving left and right.

01:05:03.610 --> 01:05:04.720
That doesn't give
you any points.

01:05:04.720 --> 01:05:07.310
Then this is the node that would
give you the highest scores,

01:05:07.310 --> 01:05:09.376
so I would choose
that one, et cetera.

01:05:09.376 --> 01:05:11.250
And then the final one,
which is the one that

01:05:11.250 --> 01:05:12.791
is actually used
for this controller,

01:05:12.791 --> 01:05:14.700
is multi-simulation.

01:05:14.700 --> 01:05:17.043
This was brought up by him.

01:05:17.043 --> 01:05:18.005
I don't know your name.

01:05:18.005 --> 01:05:18.505
Sorry.

01:05:18.505 --> 01:05:21.410
But basically, you run
multiple random simulations

01:05:21.410 --> 01:05:22.035
from your node.

01:05:22.035 --> 01:05:24.990
And then you propagate up
whichever of those simulations

01:05:24.990 --> 01:05:26.580
give you the highest value.

01:05:26.580 --> 01:05:28.600
And the reason to do
multiple simulations

01:05:28.600 --> 01:05:33.767
is to attempt to increase the
accuracy of your simulations.

01:05:33.767 --> 01:05:35.141
If you just do
one simulation you

01:05:35.141 --> 01:05:36.307
might just get really lucky.

01:05:36.307 --> 01:05:40.490
But if you do three then you
can take the highest value

01:05:40.490 --> 01:05:42.450
use that as your value.

01:05:42.450 --> 01:05:45.424
Since the whole point of this is
to try make moves that get you

01:05:45.424 --> 01:05:47.330
the highest values,
then that will

01:05:47.330 --> 01:05:50.700
make your random simulation
value more accurate.

01:05:50.700 --> 01:05:52.915
Are there questions
about multi-simulation?

01:05:52.915 --> 01:05:55.040
AUDIENCE: So what do you
think about the simulation

01:05:55.040 --> 01:05:58.940
[INAUDIBLE] how many [INAUDIBLE]

01:05:58.940 --> 01:06:00.990
PROFESSOR 3: So there's
a trade off here.

01:06:00.990 --> 01:06:04.655
The more simulations you
do the more accurate--

01:06:04.655 --> 01:06:06.280
the more representative
your simulation

01:06:06.280 --> 01:06:08.120
will be at the end of the game.

01:06:10.720 --> 01:06:13.600
You could run two to
the whatever simulations

01:06:13.600 --> 01:06:16.390
to try to get every
single possible action

01:06:16.390 --> 01:06:17.700
and then take the max of that.

01:06:17.700 --> 01:06:19.450
And that would give
you the maximum value.

01:06:19.450 --> 01:06:20.791
That would be ideal.

01:06:20.791 --> 01:06:22.290
But obviously, that
takes more time.

01:06:22.290 --> 01:06:24.248
So there's a trade off
between computation time

01:06:24.248 --> 01:06:25.990
and the number of
simulations you run.

01:06:25.990 --> 01:06:27.820
And that's just something
that they probably just

01:06:27.820 --> 01:06:28.611
played around with.

01:06:28.611 --> 01:06:37.472
AUDIENCE: Do you
use [INAUDIBLE] have

01:06:37.472 --> 01:06:41.422
to finish the decision losing a
couple of minutes or 10 minutes

01:06:41.422 --> 01:06:43.380
or they're going to take
your [INAUDIBLE] away.

01:06:43.380 --> 01:06:46.340
PROFESSOR 3: In this competition
there is different computation

01:06:46.340 --> 01:06:48.480
time budgets that you get.

01:06:48.480 --> 01:06:51.140
And I believe the reason for
the different computation time

01:06:51.140 --> 01:06:53.091
budgets is the frame
per second of the game.

01:06:58.390 --> 01:07:00.380
I told you all
about the setup, we

01:07:00.380 --> 01:07:03.800
went over, the scoring, the
nodes, what the advantages are,

01:07:03.800 --> 01:07:05.935
what the simulation
strategy is used.

01:07:05.935 --> 01:07:08.490
So you probably want
to see it in action.

01:07:08.490 --> 01:07:11.150
So this is always a risky move
trying to get video to play.

01:07:11.150 --> 01:07:12.830
AUDIENCE: It's actually
in the back up.

01:07:12.830 --> 01:07:14.000
Hit Escape.

01:07:14.000 --> 01:07:14.800
PROFESSOR 3: OK.

01:07:14.800 --> 01:07:15.660
Got it.

01:07:15.660 --> 01:07:17.075
AUDIENCE: And now, I guess, we--

01:07:17.075 --> 01:07:18.765
PROFESSOR 3: And
drag it over again?

01:07:18.765 --> 01:07:19.390
AUDIENCE: Yeah.

01:07:24.667 --> 01:07:26.250
PROFESSOR 3: Running
this full screen.

01:07:26.250 --> 01:07:28.760
AUDIENCE: Hit the [INAUDIBLE]

01:07:28.760 --> 01:07:33.330
PROFESSOR 3:
[INAUDIBLE] All right.

01:07:33.330 --> 01:07:35.887
Here's this MCTS-based
"Mario" playing controller.

01:07:35.887 --> 01:07:37.470
You can see he's
actually wrecking, so

01:07:37.470 --> 01:07:39.240
doing some serious damage here.

01:07:39.240 --> 01:07:42.840
But those lines that you
see, the reason they're

01:07:42.840 --> 01:07:46.616
different colors it's not
showing different players,

01:07:46.616 --> 01:07:47.532
or anything like that.

01:07:47.532 --> 01:07:49.157
It's just using
different colors so you

01:07:49.157 --> 01:07:52.065
can see the different
layers of this tree search.

01:07:52.065 --> 01:07:53.940
You can see he actually
went backwards there.

01:07:53.940 --> 01:07:55.520
And that's because
in a simulation,

01:07:55.520 --> 01:07:58.670
when one of the backward
ones landed on an enemy--

01:07:58.670 --> 01:08:01.710
and in fact gets you points
from our scoring system versus

01:08:01.710 --> 01:08:04.376
if you had just gone forward you
would have gotten some distance

01:08:04.376 --> 01:08:06.110
points but not--

01:08:06.110 --> 01:08:08.740
also, he is just [INAUDIBLE]

01:08:08.740 --> 01:08:12.754
The simulation is quickly being
able to figure out that he

01:08:12.754 --> 01:08:13.920
can jump on all his enemies.

01:08:13.920 --> 01:08:16.340
So he's just wrecking
all these guys.

01:08:16.340 --> 01:08:19.350
Getting lots of points here,
collecting the coin, et cetera.

01:08:19.350 --> 01:08:20.760
You get the idea.

01:08:20.760 --> 01:08:22.230
It's pretty awesome to watch.

01:08:22.230 --> 01:08:23.979
There's that flower
we were talking about.

01:08:23.979 --> 01:08:28.532
So now he's actually a
fire-spewing Mario demon.

01:08:28.532 --> 01:08:30.990
He's doing some serious
damage with that.

01:08:30.990 --> 01:08:31.979
Stepping on missiles.

01:08:31.979 --> 01:08:34.689
I didn't even know you
could step on the missiles.

01:08:34.689 --> 01:08:36.930
All right.

01:08:36.930 --> 01:08:38.370
You could watch
this for a while.

01:08:38.370 --> 01:08:41.599
But we'll exit now.

01:08:41.599 --> 01:08:44.430
It looks super
promising in this video.

01:08:44.430 --> 01:08:46.450
I don't know how
close max stuff.

01:08:46.450 --> 01:08:48.590
AUDIENCE: Just click
on back [INAUDIBLE]

01:08:48.590 --> 01:08:50.300
PROFESSOR 3: There it is.

01:08:50.300 --> 01:08:51.819
OK.

01:08:51.819 --> 01:08:54.300
The demo looks really cool,
looks really promising.

01:08:54.300 --> 01:08:57.330
Let's take a look at the
charts here because we all

01:08:57.330 --> 01:09:00.240
want some quantitative stuff.

01:09:00.240 --> 01:09:01.286
This is the chart.

01:09:01.286 --> 01:09:02.410
The score is on the y-axis.

01:09:02.410 --> 01:09:05.060
The bottom is computation
budget, which is something

01:09:05.060 --> 01:09:06.439
that you were talking about.

01:09:06.439 --> 01:09:11.760
I just want to highlight
to make this a little more

01:09:11.760 --> 01:09:13.319
visually appealing here.

01:09:13.319 --> 01:09:16.870
All of these things
that I highlighted,

01:09:16.870 --> 01:09:17.939
it's labelled as UCT.

01:09:17.939 --> 01:09:20.387
That's Upper
Confidence Bound Tree.

01:09:20.387 --> 01:09:22.470
Remember, Yo talked about
upper confidence bounds.

01:09:22.470 --> 01:09:24.990
That's essentially
what's used in that TTS

01:09:24.990 --> 01:09:26.262
for guiding your tree search.

01:09:26.262 --> 01:09:27.470
So these are all the methods.

01:09:27.470 --> 01:09:31.200
But then UCT multi, which is
this purple square, that's

01:09:31.200 --> 01:09:34.930
saying it's using MCTS but it's
doing the multiple simulations.

01:09:34.930 --> 01:09:41.090
And you can see this multi
plus care is also in the top.

01:09:41.090 --> 01:09:43.510
Both these use the
multi-simulation technique.

01:09:43.510 --> 01:09:47.279
And then the plus car is
they added an extra scoring

01:09:47.279 --> 01:09:49.800
mechanism for carries.

01:09:49.800 --> 01:09:52.670
I believe that's probably
like carrying a shell.

01:09:52.670 --> 01:09:54.715
That made it do better.

01:09:54.715 --> 01:09:56.340
Then these ones that
aren't highlighted

01:09:56.340 --> 01:10:01.130
are using plain Astar, and then
a refined version of Astar.

01:10:01.130 --> 01:10:03.630
With increasing time,
the do increase scores,

01:10:03.630 --> 01:10:07.950
but they're even worse
than just your UCT

01:10:07.950 --> 01:10:13.424
with just random simulation,
no multi-simulations.

01:10:13.424 --> 01:10:15.340
We're running low on
time, which is not ideal.

01:10:15.340 --> 01:10:19.810
But another thing that I want to
point out is down at the bottom

01:10:19.810 --> 01:10:23.830
here, these are the
multi-simulations.

01:10:23.830 --> 01:10:27.540
They have the lowest
maximal search depth, which

01:10:27.540 --> 01:10:30.846
at first would seem like, what?

01:10:30.846 --> 01:10:33.840
I have the lowest search depth
but my score is the most?

01:10:33.840 --> 01:10:35.730
But that comes
into play when you

01:10:35.730 --> 01:10:38.220
were saying about the trade
off between the simulations

01:10:38.220 --> 01:10:41.220
and the amount time it takes.

01:10:41.220 --> 01:10:43.530
So because I'm doing
multiple simulations,

01:10:43.530 --> 01:10:46.110
I'm taking more
time at each node.

01:10:46.110 --> 01:10:49.770
But that's giving me a more
accurate value assessment.

01:10:49.770 --> 01:10:52.300
So that let's me choose
my actions more carefully,

01:10:52.300 --> 01:10:53.907
or with more information.

01:10:53.907 --> 01:10:56.240
And so that's what's able to
give me this better scores.

01:10:59.590 --> 01:11:00.400
That's all "Mario."

01:11:00.400 --> 01:11:02.010
So we're going to
moving onto AlphaGo.

01:11:02.010 --> 01:11:04.590
Are there any questions about
"Mario" before I go to AlphaGo?

01:11:04.590 --> 01:11:05.090
Yeah.

01:11:05.090 --> 01:11:07.300
AUDIENCE: What's the table
[INAUDIBLE] inference?

01:11:07.300 --> 01:11:08.800
PROFESSOR 3: That's
a good question.

01:11:08.800 --> 01:11:12.420
I have a feeling it's because
if you're doing best of n,

01:11:12.420 --> 01:11:16.480
that's really heavily relying
on your scoring metrics.

01:11:19.360 --> 01:11:21.786
Let's say at one step
if I jump and collect

01:11:21.786 --> 01:11:23.660
a coin versus if I go
left or right and play,

01:11:23.660 --> 01:11:25.326
I'll get more points
if I get that coin.

01:11:25.326 --> 01:11:27.690
But maybe, a missile is
going to hit me in the face

01:11:27.690 --> 01:11:28.690
if I do that.

01:11:28.690 --> 01:11:30.940
It gets rid of
some of the-- it's

01:11:30.940 --> 01:11:32.670
forcing you to do certain moves.

01:11:32.670 --> 01:11:34.873
AUDIENCE: Is the A*
heuristically using the same

01:11:34.873 --> 01:11:37.718
value, the same value
that you're getting

01:11:37.718 --> 01:11:38.861
by your simulation?

01:11:38.861 --> 01:11:39.610
PROFESSOR 3: Yeah.

01:11:39.610 --> 01:11:42.820
I'm not exactly sure what
the Astar heuristic is.

01:11:42.820 --> 01:11:48.520
The whole reason that A* is
difficult is because coming up

01:11:48.520 --> 01:11:51.900
with heuristics for
these types of games are.

01:11:51.900 --> 01:11:54.670
But this is not his
version of Astar.

01:11:54.670 --> 01:11:57.190
I believe this is the
Astar that was used by--

01:11:57.190 --> 01:12:00.890
I forget the name of the guy--
but he won the AI competition

01:12:00.890 --> 01:12:04.060
a couple of years ago.

01:12:04.060 --> 01:12:06.390
I'm going to try to
move onto AlphaGo.

01:12:06.390 --> 01:12:09.985
Does someone have how
many minutes I have left?

01:12:09.985 --> 01:12:10.610
AUDIENCE: Four.

01:12:10.610 --> 01:12:11.276
PROFESSOR 3: OK.

01:12:11.276 --> 01:12:13.005
We're going to power through.

01:12:13.005 --> 01:12:14.124
Here's AlphaGo.

01:12:14.124 --> 01:12:15.540
Hopefully, you all
know the rules.

01:12:15.540 --> 01:12:17.490
Just in case, I'll just
go through a quick--

01:12:17.490 --> 01:12:18.220
19 by 19.

01:12:18.220 --> 01:12:20.300
You alternate black
stones and white stones.

01:12:20.300 --> 01:12:23.480
You collect enemy stones by
completely surrounding them.

01:12:23.480 --> 01:12:25.740
You can surround a single
stone. groups of stones.

01:12:25.740 --> 01:12:28.365
And your score is your
territory plus the number

01:12:28.365 --> 01:12:29.115
of captive pieces.

01:12:29.115 --> 01:12:31.573
So your territory is just the
area that you're surrounding,

01:12:31.573 --> 01:12:34.506
and then you just add the
stones you've collected.

01:12:34.506 --> 01:12:35.880
The rules aren't
super important.

01:12:35.880 --> 01:12:39.399
The main emphasis is there's
very few rules so you

01:12:39.399 --> 01:12:40.690
would think it's really simple.

01:12:40.690 --> 01:12:43.105
But the complexity of the
game is quite extreme.

01:12:45.660 --> 01:12:50.290
At each turn you have about
250 options that you can play.

01:12:50.290 --> 01:12:52.440
Each Go game lasts
about 150 turns.

01:12:52.440 --> 01:12:54.750
So that gives you a total
of 10 to the 761 games,

01:12:54.750 --> 01:12:56.370
approximately.

01:12:56.370 --> 01:12:58.470
And to put that in
comparison, here's chess.

01:12:58.470 --> 01:12:59.820
You can read those numbers.

01:12:59.820 --> 01:13:01.400
Chess is also pretty complex.

01:13:01.400 --> 01:13:03.610
But there's 35
options for turns.

01:13:03.610 --> 01:13:06.432
Deep Blue.

01:13:06.432 --> 01:13:08.890
I think you were talking about
building out the whole tree.

01:13:08.890 --> 01:13:12.100
So Deep Blue would build
out the tree for six levels.

01:13:12.100 --> 01:13:14.545
And then use this
hard core chess

01:13:14.545 --> 01:13:17.745
master inputted heuristic
evaluation that it

01:13:17.745 --> 01:13:20.230
used to find the best move.

01:13:20.230 --> 01:13:22.712
Except with Go, you
have 250 options,

01:13:22.712 --> 01:13:26.670
which already is adding
a lot more complexity.

01:13:26.670 --> 01:13:30.970
So that strategy won't
work quite as nicely.

01:13:30.970 --> 01:13:31.870
What do we do?

01:13:31.870 --> 01:13:34.075
We use a modified
version of MCTS.

01:13:34.075 --> 01:13:35.210
Well, it's not what we do.

01:13:35.210 --> 01:13:39.220
That's what Google's
DeepMind team did with Go.

01:13:39.220 --> 01:13:41.900
They combined neural
networks with MCTS.

01:13:41.900 --> 01:13:45.430
Coincidentally, we learned about
neural networks last class.

01:13:45.430 --> 01:13:47.900
Probably not a coincidence.

01:13:47.900 --> 01:13:49.400
PROFESSOR 3: It's
not a coincidence.

01:13:49.400 --> 01:13:51.500
PROFESSOR 3: The we
ordered two policy networks

01:13:51.500 --> 01:13:53.430
in the AlphaGo, and
one value network.

01:13:53.430 --> 01:13:55.580
And another big
coincidence here,

01:13:55.580 --> 01:13:57.475
the two policy
networks are actually

01:13:57.475 --> 01:14:00.140
CNN's, which we learned
specifically about last class,

01:14:00.140 --> 01:14:01.390
convolutional neural nets.

01:14:01.390 --> 01:14:04.515
And the reason for
that is the input

01:14:04.515 --> 01:14:07.995
to the policy neural networks
is an image of the game.

01:14:07.995 --> 01:14:10.120
And remember, convolutional
neural nets work really

01:14:10.120 --> 01:14:12.170
well with images.

01:14:12.170 --> 01:14:15.520
What it outputs, though, is
a probability distribution

01:14:15.520 --> 01:14:16.770
over the legal moves.

01:14:16.770 --> 01:14:20.740
And the idea is, that if a
move has a higher probability

01:14:20.740 --> 01:14:23.830
it will be a more promising
move for you to take.

01:14:23.830 --> 01:14:27.195
But another key point is
that it's not deterministic.

01:14:27.195 --> 01:14:28.820
It's not telling you
to take this move.

01:14:28.820 --> 01:14:32.310
It's just assigning a higher
probability to this move.

01:14:32.310 --> 01:14:34.990
And this network was generated
by doing supervised learning

01:14:34.990 --> 01:14:39.390
on 30 million positions
from human expert games.

01:14:39.390 --> 01:14:42.640
Apparently, there's a giant
database of Go expert games.

01:14:42.640 --> 01:14:44.260
So that came in handy.

01:14:44.260 --> 01:14:46.870
And there were two
different networks trained.

01:14:46.870 --> 01:14:49.810
One of them was a slow policy,
the other was a fast policy.

01:14:49.810 --> 01:14:54.980
The slow was able to predict an
expert move with 57% accuracy,

01:14:54.980 --> 01:14:57.250
which to me was mind blowing.

01:14:57.250 --> 01:15:00.460
Using this neural
network, 57% of the time

01:15:00.460 --> 01:15:04.260
it could pin where the
expert would place his move.

01:15:04.260 --> 01:15:05.780
That took 3,000 microseconds.

01:15:05.780 --> 01:15:08.995
Versus the fast policy, which
suffered a bit in the accuracy,

01:15:08.995 --> 01:15:11.259
but it's 1,500 times faster.

01:15:11.259 --> 01:15:12.675
And we'll see where
they used each

01:15:12.675 --> 01:15:15.580
of these different
policies later on.

01:15:15.580 --> 01:15:20.680
But it could predict the
expert move with 57% accuracy.

01:15:20.680 --> 01:15:22.472
The other Go team was,
that's not our goal.

01:15:22.472 --> 01:15:24.138
We don't want to
predict an expert move.

01:15:24.138 --> 01:15:25.630
We want to predict
a winning move.

01:15:25.630 --> 01:15:28.180
And so to do that, they
took their policy network,

01:15:28.180 --> 01:15:30.170
and then they would use
reinforcement learning.

01:15:30.170 --> 01:15:32.950
That's where you play the
network against iterations

01:15:32.950 --> 01:15:35.830
of itself in order to hone
in a better policy that's

01:15:35.830 --> 01:15:39.560
geared towards winning moves.

01:15:39.560 --> 01:15:42.889
Then they tested this
against Pachi, which uses--

01:15:42.889 --> 01:15:44.555
for the camera, I
have no idea if that's

01:15:44.555 --> 01:15:45.340
how you pronounce Pachi.

01:15:45.340 --> 01:15:46.295
It might be Patchey.

01:15:46.295 --> 01:15:47.320
I'm not sure.

01:15:47.320 --> 01:15:52.330
But there's 100,000 MCTS
simulations at each turn.

01:15:52.330 --> 01:15:55.420
So this is purely MCTS.

01:15:55.420 --> 01:15:59.842
If it were playing just
the AlphaGo policy network,

01:15:59.842 --> 01:16:03.700
the policy network
won 85% of the game.

01:16:03.700 --> 01:16:06.780
So without any sort of trained
search or anything involved,

01:16:06.780 --> 01:16:08.680
it won 85%, which
is pretty great.

01:16:08.680 --> 01:16:11.810
And that suggests that
maybe intuition wins

01:16:11.810 --> 01:16:13.880
over long reflections in Go.

01:16:13.880 --> 01:16:16.535
And interestingly, if you
talk to expert Go players

01:16:16.535 --> 01:16:19.340
and you ask them why they did a
certain move, they'll just say,

01:16:19.340 --> 01:16:22.840
It felt good, or I
had a hunch in this.

01:16:22.840 --> 01:16:26.660
That's indicative there.

01:16:26.660 --> 01:16:28.330
Hopefully, I'm not
going overtime.

01:16:28.330 --> 01:16:29.970
Sorry.

01:16:29.970 --> 01:16:31.505
Those are the two
policy networks.

01:16:31.505 --> 01:16:32.713
There's also a value network.

01:16:32.713 --> 01:16:35.450
What the value network does
is it takes in a board,

01:16:35.450 --> 01:16:40.140
and they'll give you a value,
like how good is this board?

01:16:40.140 --> 01:16:42.260
They'll give you a win
probability number.

01:16:42.260 --> 01:16:45.592
So 77%, it would say,
77% of the time you

01:16:45.592 --> 01:16:47.362
should win from the board.

01:16:47.362 --> 01:16:49.820
That's similar to the evaluation
that comes from Deep Blue.

01:16:49.820 --> 01:16:53.570
But rather than a Go master
coming in and telling you,

01:16:53.570 --> 01:16:55.730
well, if these are
connected in this way,

01:16:55.730 --> 01:16:57.730
and down here we have
this certain thing

01:16:57.730 --> 01:17:00.060
then here's the score
we should expect,

01:17:00.060 --> 01:17:01.579
in chess, they
had chess masters,

01:17:01.579 --> 01:17:03.620
like if the knight is here
and the queen is here,

01:17:03.620 --> 01:17:04.840
all these specific things.

01:17:04.840 --> 01:17:07.649
This was actually learned from
the reinforcement learning that

01:17:07.649 --> 01:17:09.440
was happening when the
policy networks were

01:17:09.440 --> 01:17:10.240
playing each other.

01:17:10.240 --> 01:17:12.890
The value network was
learning about those positions

01:17:12.890 --> 01:17:14.294
during that time.

01:17:14.294 --> 01:17:16.210
And the predictions get
better towards the end

01:17:16.210 --> 01:17:21.590
of the game, which I think
Yo mentioned in his talk.

01:17:21.590 --> 01:17:23.651
So how do you combine
all these into MCTS?

01:17:23.651 --> 01:17:25.359
The slow policy network,
if you remember,

01:17:25.359 --> 01:17:27.830
is slower but should
give us stronger moves.

01:17:27.830 --> 01:17:29.780
It is used to guide our
tree search in order

01:17:29.780 --> 01:17:33.322
to help us decide which
nodes to expand next.

01:17:33.322 --> 01:17:35.840
When we expand that
node to get the value,

01:17:35.840 --> 01:17:38.390
the value of the state is
the simulation, like before,

01:17:38.390 --> 01:17:41.000
like normal MCTS,
except it's not

01:17:41.000 --> 01:17:42.560
a completely random simulation.

01:17:42.560 --> 01:17:45.200
We use our fast policy network
to give us a more educated

01:17:45.200 --> 01:17:46.117
simulation here.

01:17:46.117 --> 01:17:47.700
But we're using a
fast one, obviously,

01:17:47.700 --> 01:17:49.990
to save some computation time.

01:17:49.990 --> 01:17:53.810
It's giving us probably a more
indicative random simulation

01:17:53.810 --> 01:17:55.920
of what's going to
actually happen.

01:17:55.920 --> 01:17:58.830
And then we also combine that
with our value network output.

01:17:58.830 --> 01:18:01.070
So we run our value network
on this node, as well.

01:18:01.070 --> 01:18:02.695
And we add that to
our simulation value

01:18:02.695 --> 01:18:03.800
and we propagate it.

01:18:03.800 --> 01:18:06.440
Interestingly, the
AlphaGo team tested out

01:18:06.440 --> 01:18:09.140
just using the fast
policy simulation value

01:18:09.140 --> 01:18:11.180
and scrapping the value network.

01:18:11.180 --> 01:18:13.160
And they also just
used the value network

01:18:13.160 --> 01:18:14.720
and scrapped the
simulation value.

01:18:14.720 --> 01:18:17.030
And those both performed
worse than if it had these.

01:18:17.030 --> 01:18:19.625
And another added
interesting point here,

01:18:19.625 --> 01:18:22.630
is that these two
factors in our value

01:18:22.630 --> 01:18:24.130
have about the same weight.

01:18:24.130 --> 01:18:27.970
They were both about
equally important.

01:18:27.970 --> 01:18:29.635
I think I'll get
into that later.

01:18:29.635 --> 01:18:30.410
But first--

01:18:30.410 --> 01:18:32.160
AUDIENCE: Can I just
ask a quick question?

01:18:32.160 --> 01:18:33.980
PROFESSOR 3: Yeah.

01:18:33.980 --> 01:18:36.230
AUDIENCE: So when you said
the policy network is used,

01:18:36.230 --> 01:18:38.240
is that used when you're
navigating to the tree

01:18:38.240 --> 01:18:40.350
to get to a leaf,
or is policy network

01:18:40.350 --> 01:18:43.470
being used to do the
simulation once you're

01:18:43.470 --> 01:18:46.340
at the leaf, or both?

01:18:46.340 --> 01:18:49.177
PROFESSOR 3: The slow policy
is done for this part.

01:18:49.177 --> 01:18:51.176
Then the fast policy is
used for the simulation.

01:18:51.176 --> 01:18:54.455
Because the slow policy does
take 1,500 faster than--

01:18:54.455 --> 01:18:58.052
or the slow takes 1,500 times
longer than the fast policy.

01:18:58.052 --> 01:19:00.010
You don't want to use
that in your simulations.

01:19:00.010 --> 01:19:01.927
That would just
take way too long.

01:19:01.927 --> 01:19:03.510
It's basically just
a way of making it

01:19:03.510 --> 01:19:05.260
so our simulation isn't
completely random.

01:19:05.260 --> 01:19:06.755
It has some educated moves.

01:19:09.490 --> 01:19:11.392
Why use policy and
value network synergy?

01:19:11.392 --> 01:19:13.100
Why can't we just use
the policy network?

01:19:13.100 --> 01:19:15.070
Why can't we just use
the value network?

01:19:15.070 --> 01:19:16.920
If we have the
value network alone,

01:19:16.920 --> 01:19:18.572
we'll actually--
here's a side point.

01:19:18.572 --> 01:19:20.030
Remember, the value
network learned

01:19:20.030 --> 01:19:21.320
from the policy network.

01:19:21.320 --> 01:19:23.140
And then also, later
on, the policy network

01:19:23.140 --> 01:19:26.070
is improved by our values.

01:19:26.070 --> 01:19:27.400
They work hand-in-hand.

01:19:27.400 --> 01:19:29.114
But if we had the
value network alone,

01:19:29.114 --> 01:19:30.780
when we're deciding
on it the next move,

01:19:30.780 --> 01:19:33.113
we're going to have to evaluate
every single move, which

01:19:33.113 --> 01:19:34.510
would take forever.

01:19:34.510 --> 01:19:36.010
And so, what the
policy network does

01:19:36.010 --> 01:19:41.040
is project the best move
with a probably distribution.

01:19:41.040 --> 01:19:43.110
And it narrows our search space.

01:19:43.110 --> 01:19:45.010
And then, if we had the
policy network alone,

01:19:45.010 --> 01:19:48.366
we'd be unable to compare nodes
in different parts of our tree.

01:19:48.366 --> 01:19:50.450
The policy network
is able to tell us

01:19:50.450 --> 01:19:52.370
a distribution over
which move we should

01:19:52.370 --> 01:19:54.230
take from a certain node.

01:19:54.230 --> 01:19:57.350
But then, if I ask it if
I'm in a better position

01:19:57.350 --> 01:19:59.724
here than in some other
place, it won't know.

01:19:59.724 --> 01:20:01.390
That's where the value
network comes in.

01:20:01.390 --> 01:20:06.140
It will give us an estimated
number of the value assigned

01:20:06.140 --> 01:20:08.570
and open an evaluation
of that node.

01:20:08.570 --> 01:20:10.760
And then these
values are later used

01:20:10.760 --> 01:20:12.860
to direct our tree
searches based

01:20:12.860 --> 01:20:16.360
on updating the policy
once it realizes,

01:20:16.360 --> 01:20:19.470
oh, I thought this would be
a good path but the value is

01:20:19.470 --> 01:20:23.240
this, so update all that.

01:20:23.240 --> 01:20:25.440
Then why do we combine
neural networks with MCTS?

01:20:25.440 --> 01:20:27.500
Remember, the
policy network alone

01:20:27.500 --> 01:20:31.000
played against Pachi,
which was purely MCTS,

01:20:31.000 --> 01:20:33.000
and it did pretty well.

01:20:33.000 --> 01:20:37.220
So how does MCTS improve
our policy network?

01:20:37.220 --> 01:20:42.055
Remember, MCTS did win
15% of those games.

01:20:42.055 --> 01:20:44.900
So already, that makes you
think there's something there

01:20:44.900 --> 01:20:47.145
that maybe the policy
network is missing.

01:20:47.145 --> 01:20:49.220
Also, the policy network
is just a prediction.

01:20:49.220 --> 01:20:51.410
So by using this
tree structure, we're

01:20:51.410 --> 01:20:57.730
able to use these Monte Carlo
rollouts to adjust our policy

01:20:57.730 --> 01:21:01.520
to move towards nodes that are
actually evaluated to be good.

01:21:01.520 --> 01:21:03.960
And then, how do neural
networks improve MCTS?

01:21:03.960 --> 01:21:06.280
The point should
probably be clear by now.

01:21:06.280 --> 01:21:09.930
We're able to more intelligently
lead our tree exploration.

01:21:09.930 --> 01:21:13.420
Our simulations are more
reflective of actual games.

01:21:13.420 --> 01:21:17.530
And the value network
and our simulation value

01:21:17.530 --> 01:21:21.400
are complementary, which
I've mentioned before.

01:21:21.400 --> 01:21:25.150
And just to highlight that,
basically, the value network

01:21:25.150 --> 01:21:27.910
is going to give us a
value that is reflective

01:21:27.910 --> 01:21:30.680
as if we've played the
slow policy the whole time.

01:21:30.680 --> 01:21:35.170
And the simulation is if
we used a faster policy.

01:21:35.170 --> 01:21:38.070
So they are complementary.

01:21:38.070 --> 01:21:39.710
And I know I'm over time.

01:21:39.710 --> 01:21:44.390
So I just wanted to skim
through the stats real quick.

01:21:44.390 --> 01:21:47.395
Distributed AlphaGo
won 77% of the games

01:21:47.395 --> 01:21:49.039
against regular AlphaGo.

01:21:49.039 --> 01:21:51.080
So it's the only thing
that beat regular AlphaGo.

01:21:51.080 --> 01:21:54.250
And then distributed AlphaGo
won 100% of the games

01:21:54.250 --> 01:21:55.130
against all these.

01:21:55.130 --> 01:21:57.720
In a rematch against Pachi,
now that we've added MCTS

01:21:57.720 --> 01:21:59.886
to our policy network and
we have our value network,

01:21:59.886 --> 01:22:03.170
we slaughtered Pachi 100%.

01:22:03.170 --> 01:22:05.460
Then we decided to see how
we fare against humans.

01:22:05.460 --> 01:22:08.540
And by we, I mean not
me, I mean Google.

01:22:08.540 --> 01:22:11.190
And they won 4 to 1.

01:22:11.190 --> 01:22:14.680
And Lee Sedol rating was 3,520.

01:22:14.680 --> 01:22:17.880
Now AlphaGo's rating is
estimated to be about 3,586.

01:22:17.880 --> 01:22:19.960
So you're like, whoo,
we beat the best dude.

01:22:19.960 --> 01:22:22.180
Except we didn't because
there's another dude

01:22:22.180 --> 01:22:31.320
who has an even higher
score, apparently, 3,621.

01:22:31.320 --> 01:22:32.970
This should be the last part.

01:22:32.970 --> 01:22:34.750
Here's this timeline.

01:22:34.750 --> 01:22:39.410
Basically, tic-tac-toe,
checkers were conquered in '50.

01:22:39.410 --> 01:22:42.155
About 40 years later, we
conquered checkers, chess.

01:22:42.155 --> 01:22:45.800
Then we scroll down
to 2015, is when

01:22:45.800 --> 01:22:48.065
AlphaGo was able to
beat Fan Hui, who

01:22:48.065 --> 01:22:51.340
was a two-dan player, which
is considered lower down

01:22:51.340 --> 01:22:54.425
in the tier of professional Go.

01:22:54.425 --> 01:22:56.470
But then, Lee Sedol
was a nine-dan player.

01:22:56.470 --> 01:23:00.187
And he was able to beat
him literally last month.

01:23:00.187 --> 01:23:01.520
PROFESSOR WILLIAMS: So good job.

01:23:01.520 --> 01:23:02.520
PROFESSOR 3: We're done.

01:23:02.520 --> 01:23:03.800
[APPLAUSE]