1
00:00:00,500 --> 00:00:03,077
The following content is
provided under a Creative

2
00:00:03,077 --> 00:00:03,819
Commons license.

3
00:00:03,819 --> 00:00:06,263
Your support will help
MIT OpenCourseWare

4
00:00:06,263 --> 00:00:10,070
continue to offer high-quality
educational resources for free.

5
00:00:10,070 --> 00:00:13,149
To make a donation or to
view additional materials

6
00:00:13,149 --> 00:00:18,508
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,508 --> 00:00:26,060
at ocw.mit.edu.

8
00:00:26,060 --> 00:00:28,040
JEREMY KEPNER: All
right, welcome.

9
00:00:28,040 --> 00:00:31,660
Thank you so much for coming.

10
00:00:31,660 --> 00:00:34,480
I'm Jeremy Kepner.

11
00:00:34,480 --> 00:00:37,220
I'm a fellow at
Lincoln Laboratory.

12
00:00:37,220 --> 00:00:39,944
I lead the Supercomputing
Center there,

13
00:00:39,944 --> 00:00:43,576
which means I have the
privilege of working

14
00:00:43,576 --> 00:00:49,500
every day with pretty
much everyone at MIT.

15
00:00:49,500 --> 00:00:52,484
I think I have the
best job at MIT

16
00:00:52,484 --> 00:00:58,739
because I get to help you all
pursue your research dreams.

17
00:00:58,739 --> 00:01:02,634
And as a result of that,
I get an opportunity

18
00:01:02,634 --> 00:01:07,156
to see what a really wide
range of folks are doing

19
00:01:07,156 --> 00:01:12,799
and observe patterns between
what different folks are doing.

20
00:01:12,799 --> 00:01:15,360
So with that, I'll get started.

21
00:01:15,360 --> 00:01:19,380
This is meant to be some
initial motivational material,

22
00:01:19,380 --> 00:01:22,559
why you should be
interested in learning

23
00:01:22,559 --> 00:01:26,553
about this mathematics,
this mathematics of big data

24
00:01:26,553 --> 00:01:31,009
and how it relates to machine
learning and other really

25
00:01:31,009 --> 00:01:31,740
exciting topics.

26
00:01:31,740 --> 00:01:33,590
It is a math course.

27
00:01:33,590 --> 00:01:37,119
We will be going over
a fair amount of math.

28
00:01:37,119 --> 00:01:41,920
But we really work hard to make
it very accessible to people.

29
00:01:41,920 --> 00:01:47,332
So we start out with a really
elementary mathematical concept

30
00:01:47,332 --> 00:01:50,549
here, probably one that
hopefully most of you

31
00:01:50,549 --> 00:01:51,299
are familiar with.

32
00:01:51,299 --> 00:01:55,970
It's the basic concept
of a circle, right?

33
00:01:55,970 --> 00:01:59,882
And I bring that up
because many of us

34
00:01:59,882 --> 00:02:03,360
know many ways to state
this mathematically, right?

35
00:02:03,360 --> 00:02:07,026
It's all the points
that are equal distance

36
00:02:07,026 --> 00:02:08,860
from a particular point.

37
00:02:08,860 --> 00:02:10,610
There's other ways
to describe it.

38
00:02:10,610 --> 00:02:13,865
But this is a basic
mathematical concept of a circle

39
00:02:13,865 --> 00:02:16,470
that many of us
have grown up with.

40
00:02:16,470 --> 00:02:21,024
But, of course, the
other thing we know

41
00:02:21,024 --> 00:02:25,579
is that, right, this
is the big idea.

42
00:02:25,579 --> 00:02:29,842
Although I can write down an
equation for circle, which

43
00:02:29,842 --> 00:02:33,166
is the equation for a
perfect, ideal circle,

44
00:02:33,166 --> 00:02:37,000
we know that such things don't
actually exist in nature.

45
00:02:37,000 --> 00:02:42,459
There is no true perfect
circle in nature.

46
00:02:42,459 --> 00:02:45,069
Even this circle that we've
drawn here, it has pixels.

47
00:02:45,069 --> 00:02:47,201
If I zoomed in on it, if
I zoomed in on it enough,

48
00:02:47,201 --> 00:02:49,349
it wouldn't look
like a circle at all.

49
00:02:49,349 --> 00:02:52,050
It would look like
a series of blocks.

50
00:02:52,050 --> 00:02:54,702
And so that approximation
process, right,

51
00:02:54,702 --> 00:02:59,145
where we have a mathematical
concept of an ideal circle,

52
00:02:59,145 --> 00:03:03,217
right, but we know that
there are not really--

53
00:03:03,217 --> 00:03:06,615
they don't really
exist nature, but we

54
00:03:06,615 --> 00:03:10,591
understand that it is
worthwhile to think

55
00:03:10,591 --> 00:03:14,386
about these mathematical
ideals, manipulate them and then

56
00:03:14,386 --> 00:03:16,297
take the results
of the manipulation

57
00:03:16,297 --> 00:03:17,890
back into the real world.

58
00:03:17,890 --> 00:03:21,367
That's a really productive
way to think about things

59
00:03:21,367 --> 00:03:26,800
and, really, the basis for a
lot of what we do here at MIT.

60
00:03:26,800 --> 00:03:32,828
This concept is essentially
the basis of modern

61
00:03:32,828 --> 00:03:37,349
or ancient Western
thought on mathematics.

62
00:03:37,349 --> 00:03:39,595
If you remember your
history courses,

63
00:03:39,595 --> 00:03:42,590
this concept of ideal
shapes and ideal circles

64
00:03:42,590 --> 00:03:47,810
is the foundation of
platonic mathematics

65
00:03:47,810 --> 00:03:51,290
some 2,500 years ago.

66
00:03:51,290 --> 00:03:55,509
And at the time, though,
that they were developing

67
00:03:55,509 --> 00:03:58,322
that concept, this
idea that there

68
00:03:58,322 --> 00:04:01,307
are ideal shapes out there
and that thinking about them

69
00:04:01,307 --> 00:04:03,579
and manipulating them
was a more effective way

70
00:04:03,579 --> 00:04:09,349
to reason about the real world,
there was a lot of skepticism.

71
00:04:09,349 --> 00:04:11,033
You could imagine
2,500 years ago

72
00:04:11,033 --> 00:04:12,717
someone is walking
around and saying,

73
00:04:12,717 --> 00:04:15,036
I believe there are
these things called

74
00:04:15,036 --> 00:04:17,988
ideal circles and ideal
squares and ideal shapes.

75
00:04:17,988 --> 00:04:20,829
But they don't actually
exist in nature.

76
00:04:20,829 --> 00:04:22,839
That would probably
not be well-received.

77
00:04:22,839 --> 00:04:25,749
In fact, it was
not well-received.

78
00:04:25,749 --> 00:04:31,674
Many of those philosophers
who were thinking about this

79
00:04:31,674 --> 00:04:34,308
were very negatively received.

80
00:04:34,308 --> 00:04:36,744
And, in fact, if
you want to learn

81
00:04:36,744 --> 00:04:39,180
about how negative the
response was to this,

82
00:04:39,180 --> 00:04:43,211
I encourage you to go and read
the Allegory of the Cave, which

83
00:04:43,211 --> 00:04:45,623
is essentially the story of
these philosophers talking

84
00:04:45,623 --> 00:04:47,027
about how they're
trying to bring

85
00:04:47,027 --> 00:04:49,233
the light of this knowledge
to the broader world

86
00:04:49,233 --> 00:04:52,239
and how they essentially
get killed because of it,

87
00:04:52,239 --> 00:04:54,909
because people don't
want to see it.

88
00:04:54,909 --> 00:04:58,639
So that struggle they
experienced 2,500 years ago,

89
00:04:58,639 --> 00:05:00,039
it exists today.

90
00:05:00,039 --> 00:05:04,322
You as people at MIT will try
and bring mathematical concepts

91
00:05:04,322 --> 00:05:06,407
into environments
where people are like,

92
00:05:06,407 --> 00:05:07,740
I don't see why that's relevant.

93
00:05:07,740 --> 00:05:12,050
And you will experience
negative inputs.

94
00:05:12,050 --> 00:05:16,119
But you should rest assured
that this is a good bet.

95
00:05:16,119 --> 00:05:18,759
It's worked well for
thousands of years.

96
00:05:18,759 --> 00:05:20,789
You know, it's what
I base my career on.

97
00:05:20,789 --> 00:05:22,622
People ask me, well,
what's the basis of it?

98
00:05:22,622 --> 00:05:24,159
Well, I'm just
betting on math here.

99
00:05:24,159 --> 00:05:26,100
It's been a good tool.

100
00:05:26,100 --> 00:05:29,704
So this is why we're beginning
to think this way when

101
00:05:29,704 --> 00:05:32,999
we talk about big data
and machine learning.

102
00:05:32,999 --> 00:05:35,560
So really looking at
the fundamentals, what

103
00:05:35,560 --> 00:05:39,557
are the ideals that we need
in order to effectively reason

104
00:05:39,557 --> 00:05:41,928
about the problems
that we're facing today

105
00:05:41,928 --> 00:05:44,906
in the virtual world,
right, and the fact

106
00:05:44,906 --> 00:05:50,363
that this mathematical concept
described the natural world so

107
00:05:50,363 --> 00:05:54,422
well and also described
in the virtual world

108
00:05:54,422 --> 00:05:56,872
is sometimes called the
unreasonable effectiveness

109
00:05:56,872 --> 00:05:57,689
of mathematics.

110
00:05:57,689 --> 00:05:58,689
You can look that up.

111
00:05:58,689 --> 00:06:00,889
But people talk about math.

112
00:06:00,889 --> 00:06:04,539
Why does it do such a good job
of describing so many things?

113
00:06:04,539 --> 00:06:07,319
And people say, well,
they don't really know.

114
00:06:07,319 --> 00:06:11,379
But it seems to be a good bit of
luck that it happens that way.

115
00:06:11,379 --> 00:06:16,449
So circles, that gets
us a certain way.

116
00:06:16,449 --> 00:06:20,770
But in most of the
fields that we work with,

117
00:06:20,770 --> 00:06:25,096
and I would say that, in
almost any introductory course

118
00:06:25,096 --> 00:06:29,030
that you take in college,
whatever the discipline is,

119
00:06:29,030 --> 00:06:32,620
whether it be chemistry
or mechanical engineering

120
00:06:32,620 --> 00:06:36,917
or electrical engineering
or physics or biology,

121
00:06:36,917 --> 00:06:39,725
the basic fundamental
theoretical ideas

122
00:06:39,725 --> 00:06:43,590
that they will
introduce to you will be

123
00:06:43,590 --> 00:06:46,490
the concept of a linear model.

124
00:06:46,490 --> 00:06:50,939
So there we have a
linear model, right?

125
00:06:50,939 --> 00:06:53,850
And why do we like
linear models?

126
00:06:53,850 --> 00:06:55,058
And again, it can be physics.

127
00:06:55,058 --> 00:06:59,429
It can be as simple as F
= MA Or, in chemistry, it

128
00:06:59,429 --> 00:07:02,119
can be some kind of
chemical rate equation.

129
00:07:02,119 --> 00:07:04,873
Or in mechanical
engineering it can

130
00:07:04,873 --> 00:07:07,169
be basic concepts of friction.

131
00:07:07,169 --> 00:07:11,181
The reason we like these
basic linear models

132
00:07:11,181 --> 00:07:14,190
is because we can
project, right?

133
00:07:14,190 --> 00:07:17,283
I know that if that
solid line represents

134
00:07:17,283 --> 00:07:20,376
what I believe
to-- you know, if I

135
00:07:20,376 --> 00:07:23,287
have evidence to support
that that is correct,

136
00:07:23,287 --> 00:07:27,231
then I feel pretty good about
projecting maybe where I don't

137
00:07:27,231 --> 00:07:29,909
have data or into a new domain.

138
00:07:29,909 --> 00:07:33,308
So linear models allow
us to do this reasoning.

139
00:07:33,308 --> 00:07:36,749
And that's why in the
first few weeks of almost

140
00:07:36,749 --> 00:07:40,029
any introductory course they
begin with these linear models,

141
00:07:40,029 --> 00:07:43,509
because they have proven
to be so effective.

142
00:07:43,509 --> 00:07:47,645
Now, there are many
non-linear phenomena that

143
00:07:47,645 --> 00:07:50,009
are tremendously important, OK?

144
00:07:50,009 --> 00:07:54,239
And as a person who deals
with large-scale computation,

145
00:07:54,239 --> 00:07:58,529
those are a staple
of what people do.

146
00:07:58,529 --> 00:08:02,341
But in order to do non-linear
calculations or reason

147
00:08:02,341 --> 00:08:04,459
about things
non-linearly, it usually

148
00:08:04,459 --> 00:08:09,325
requires a much more complicated
analysis and much more

149
00:08:09,325 --> 00:08:11,489
computation, much more data.

150
00:08:11,489 --> 00:08:14,783
And so our ability
to extrapolate

151
00:08:14,783 --> 00:08:16,979
is very limited, OK?

152
00:08:16,979 --> 00:08:18,120
It's very limited.

153
00:08:18,120 --> 00:08:23,205
So here I am talking
about the benefits

154
00:08:23,205 --> 00:08:27,020
of thinking mathematically,
talking about linearity.

155
00:08:27,020 --> 00:08:32,679
What does this have to do with
big data and machine learning?

156
00:08:32,679 --> 00:08:37,096
So we would like to be able to
do the same things that we've

157
00:08:37,096 --> 00:08:41,795
been able to do in other
fields in this new emerging

158
00:08:41,795 --> 00:08:44,039
field of big data.

159
00:08:44,039 --> 00:08:47,399
And this often
deals with data that

160
00:08:47,399 --> 00:08:50,759
doesn't look like the
traditional measurements we

161
00:08:50,759 --> 00:08:53,350
see in science.

162
00:08:53,350 --> 00:08:58,778
This can be data that has to do
with words or images, pictures

163
00:08:58,778 --> 00:09:01,831
of people, other types
of things that don't feel

164
00:09:01,831 --> 00:09:04,043
like the kinds of data
that we traditionally

165
00:09:04,043 --> 00:09:06,899
deal with in science
and engineering.

166
00:09:06,899 --> 00:09:11,350
But we know we want
to use linear models.

167
00:09:11,350 --> 00:09:13,259
So how are we going to do that?

168
00:09:13,259 --> 00:09:15,689
How can we take this
concept of linearity,

169
00:09:15,689 --> 00:09:18,558
which has been so powerful
across so many disciplines,

170
00:09:18,558 --> 00:09:21,627
and bring them to
this field that

171
00:09:21,627 --> 00:09:25,394
just feels completely
different than the kinds data

172
00:09:25,394 --> 00:09:26,970
that we have?

173
00:09:26,970 --> 00:09:33,806
So to begin with, I need to
refresh for you what it really

174
00:09:33,806 --> 00:09:35,910
means to be linear.

175
00:09:35,910 --> 00:09:39,470
Before, I showed you a line
and, hence, the line, linear.

176
00:09:39,470 --> 00:09:43,720
But mathematically, linearity
means something much deeper.

177
00:09:43,720 --> 00:09:48,591
And so here's an equation
that you may have first seen

178
00:09:48,591 --> 00:09:49,920
in elementary school.

179
00:09:49,920 --> 00:09:52,316
We basically have
to two times three

180
00:09:52,316 --> 00:09:57,120
plus four is equal to two times
three plus two times four.

181
00:09:57,120 --> 00:09:59,250
That is called the
distributive property.

182
00:09:59,250 --> 00:10:03,810
It basically says multiplication
distributes over addition.

183
00:10:03,810 --> 00:10:06,198
And this is the
fundamental reason

184
00:10:06,198 --> 00:10:10,180
why I would say mathematics
works in our world, right?

185
00:10:10,180 --> 00:10:14,103
If this wasn't true very
early on in the earliest

186
00:10:14,103 --> 00:10:16,850
days of inventing
mathematics, it would not

187
00:10:16,850 --> 00:10:19,050
have been very useful, right?

188
00:10:19,050 --> 00:10:24,030
To say that I have two of three
plus four of something, OK,

189
00:10:24,030 --> 00:10:27,559
and then I can
change it and do it

190
00:10:27,559 --> 00:10:31,660
in this other way, that's really
what makes mathematics useful.

191
00:10:31,660 --> 00:10:36,562
And from a deeper perspective,
the distributive property

192
00:10:36,562 --> 00:10:40,240
is basically what
makes math linear.

193
00:10:40,240 --> 00:10:44,713
This is the property that,
if this property holds,

194
00:10:44,713 --> 00:10:48,689
then we can reason
about a system linearly.

195
00:10:48,689 --> 00:10:52,289
Now, you're very familiar
with this type of mathematics,

196
00:10:52,289 --> 00:10:54,690
but there's other
types of mathematics.

197
00:10:54,690 --> 00:10:57,457
So if you'll allow
me, hopefully you

198
00:10:57,457 --> 00:11:00,620
will let me just replace
those multiplication symbols

199
00:11:00,620 --> 00:11:04,879
and addition symbols with this
funny circle times and circle

200
00:11:04,879 --> 00:11:05,379
plus.

201
00:11:05,379 --> 00:11:08,199
And we'll get to why
I'm going to do that.

202
00:11:08,199 --> 00:11:10,938
Because it turns
out that, while you

203
00:11:10,938 --> 00:11:14,487
have done most of your careers
with traditional arithmetic

204
00:11:14,487 --> 00:11:16,580
multiplication and
addition, the kind

205
00:11:16,580 --> 00:11:19,930
you would do on your
calculator or have

206
00:11:19,930 --> 00:11:24,015
done in elementary
school, it turns out

207
00:11:24,015 --> 00:11:26,934
there's other
pairs of operations

208
00:11:26,934 --> 00:11:31,229
that also obey this property,
this distributive property,

209
00:11:31,229 --> 00:11:34,659
and, therefore, allow
us to potentially build

210
00:11:34,659 --> 00:11:38,530
linear models of very
different types of data

211
00:11:38,530 --> 00:11:39,980
using this property.

212
00:11:39,980 --> 00:11:43,545
So, as I mentioned,
the classic two

213
00:11:43,545 --> 00:11:48,639
are circle plus is just equal
to regular arithmetic addition,

214
00:11:48,639 --> 00:11:51,918
as we show on the first
line, and circle times is

215
00:11:51,918 --> 00:11:53,709
equal to regular
arithmetic multiplication.

216
00:11:53,709 --> 00:11:55,779
So those are the standard ones.

217
00:11:55,779 --> 00:11:59,665
And, by far, this pair,
this is the most common pair

218
00:11:59,665 --> 00:12:02,139
that we use across
the world today.

219
00:12:02,139 --> 00:12:03,550
But there are others.

220
00:12:03,550 --> 00:12:09,753
So, for instance, I can
replace the plus operation

221
00:12:09,753 --> 00:12:17,050
with max and the multiplication
operation with addition, OK?

222
00:12:17,050 --> 00:12:19,644
And the above
distributive equation

223
00:12:19,644 --> 00:12:21,720
will still hold, right?

224
00:12:21,720 --> 00:12:22,860
That's a little confusing.

225
00:12:22,860 --> 00:12:26,868
I often get confused that
multiplications is now

226
00:12:26,868 --> 00:12:27,370
addition.

227
00:12:27,370 --> 00:12:30,431
But this pair sometimes
referred to as max plus-- you'll

228
00:12:30,431 --> 00:12:34,205
sometimes hear about it as
max plus algebra-- is actually

229
00:12:34,205 --> 00:12:38,079
very important in machine
learning and neural networks.

230
00:12:38,079 --> 00:12:43,015
This is actually the back end
of the rectified linear unit,

231
00:12:43,015 --> 00:12:44,663
is essentially this operation.

232
00:12:44,663 --> 00:12:46,829
If you didn't understand
what that meant, that's OK.

233
00:12:46,829 --> 00:12:49,350
We'll get to that later.

234
00:12:49,350 --> 00:12:51,220
It's very important in finance.

235
00:12:51,220 --> 00:12:54,123
There are certain
finance operations

236
00:12:54,123 --> 00:12:58,189
that rely on this
type of mathematics.

237
00:12:58,189 --> 00:13:01,880
There are other pairs, also.

238
00:13:01,880 --> 00:13:02,880
So here's one.

239
00:13:02,880 --> 00:13:08,000
I can replace addition with
union and multiplication

240
00:13:08,000 --> 00:13:09,920
with intersection, right?

241
00:13:09,920 --> 00:13:15,850
Now, that also obeys
that linear property.

242
00:13:15,850 --> 00:13:18,430
This is essentially
the pair of operations

243
00:13:18,430 --> 00:13:21,380
that, anytime you make
a transaction and work

244
00:13:21,380 --> 00:13:24,663
with what's called a
relational database, that's

245
00:13:24,663 --> 00:13:28,689
the mathematical operation
pair that's sitting inside it.

246
00:13:28,689 --> 00:13:31,790
It's why those databases work.

247
00:13:31,790 --> 00:13:36,770
It allows us to reason about
queries, which are just

248
00:13:36,770 --> 00:13:39,759
a series of
intersections and unions,

249
00:13:39,759 --> 00:13:41,959
and then reorder
them in such a way.

250
00:13:41,959 --> 00:13:45,100
In databases, this is
called query planning.

251
00:13:45,100 --> 00:13:46,507
And if that property
wasn't true,

252
00:13:46,507 --> 00:13:48,149
we wouldn't be able to do that.

253
00:13:48,149 --> 00:13:51,079
So this is a deep
property of that.

254
00:13:51,079 --> 00:13:55,293
So we can put all different
types of pairs in here

255
00:13:55,293 --> 00:13:57,209
and reason about them linearly.

256
00:13:57,209 --> 00:14:01,587
And this is why that
many, many of the systems

257
00:14:01,587 --> 00:14:03,339
we use today work.

258
00:14:03,339 --> 00:14:06,005
And so this class is
about really exposing

259
00:14:06,005 --> 00:14:08,005
that, that, really,
the mathematics that

260
00:14:08,005 --> 00:14:11,769
allows us to think linearly
about data that we haven't

261
00:14:11,769 --> 00:14:14,056
really thought of
as maybe obeying

262
00:14:14,056 --> 00:14:16,080
some kind of linear model.

263
00:14:16,080 --> 00:14:20,720
This is essentially the
critical point of this class.

264
00:14:20,720 --> 00:14:24,339
So it goes beyond that, though.

265
00:14:24,339 --> 00:14:28,434
So hopefully you'll allow
me to replace those numbers

266
00:14:28,434 --> 00:14:29,800
with letters, right?

267
00:14:29,800 --> 00:14:35,189
So that's basic algebra there.

268
00:14:35,189 --> 00:14:38,105
Just for a refresher,
the previous equation,

269
00:14:38,105 --> 00:14:42,689
we had A = 2, B = 3, C = 4.

270
00:14:42,689 --> 00:14:50,075
But we're not limited to these
variables, or these letters,

271
00:14:50,075 --> 00:14:53,819
to being just simple
scalar numbers,

272
00:14:53,819 --> 00:14:55,944
in this case, real numbers
or integers or something

273
00:14:55,944 --> 00:14:56,444
like that.

274
00:14:56,444 --> 00:14:58,360
They can be other things, too.

275
00:14:58,360 --> 00:15:04,300
So, for instance, A, B, and
C could be spreadsheets.

276
00:15:04,300 --> 00:15:06,997
And that's something we'll
go over with extensively

277
00:15:06,997 --> 00:15:09,695
in a class, so that
I can basically

278
00:15:09,695 --> 00:15:13,767
have A, B, and C be whole
spreadsheets of data

279
00:15:13,767 --> 00:15:16,749
and the linear equation
will still hold.

280
00:15:16,749 --> 00:15:22,385
And, in fact, that's probably
the key concept in big data,

281
00:15:22,385 --> 00:15:26,142
is the necessity to
reason about data

282
00:15:26,142 --> 00:15:30,920
as whole collections and
transforming whole collections.

283
00:15:30,920 --> 00:15:34,972
Going and looking at things
one element at a time

284
00:15:34,972 --> 00:15:40,249
is essentially the thing that is
extremely difficult to do when

285
00:15:40,249 --> 00:15:43,910
you have large amounts of data.

286
00:15:43,910 --> 00:15:50,730
A, B, and C can be
database tables, right?

287
00:15:50,730 --> 00:15:52,720
Those don't differ too
much from spreadsheets.

288
00:15:52,720 --> 00:15:57,220
And as I talked to you
in the previous slide,

289
00:15:57,220 --> 00:16:00,262
that union/intersection
pair naturally lines up

290
00:16:00,262 --> 00:16:05,812
and we can reason
about whole tables

291
00:16:05,812 --> 00:16:10,569
in a database using
linear properties.

292
00:16:10,569 --> 00:16:11,939
They can be matrices.

293
00:16:11,939 --> 00:16:15,433
I think, for those of you
who have had a linear algebra

294
00:16:15,433 --> 00:16:17,502
and matrix mathematics,
that would have been

295
00:16:17,502 --> 00:16:21,180
the first example, right, when
I substituted the A, B, and C

296
00:16:21,180 --> 00:16:23,689
and had these linear equations.

297
00:16:23,689 --> 00:16:27,539
Often, in many of
the sciences, we

298
00:16:27,539 --> 00:16:30,839
think about matrix
operations and linearity

299
00:16:30,839 --> 00:16:36,420
as being coupled together.

300
00:16:36,420 --> 00:16:39,762
And through the duality
between matrices and graphs

301
00:16:39,762 --> 00:16:43,314
and networks, we can
represent graphs and networks

302
00:16:43,314 --> 00:16:44,360
through matrices.

303
00:16:44,360 --> 00:16:47,308
Any time you work
with a neural network,

304
00:16:47,308 --> 00:16:49,889
you're representing that
network as a matrix.

305
00:16:49,889 --> 00:16:53,257
And, of course, all these
equations apply there as well

306
00:16:53,257 --> 00:16:56,889
and you can reason about
those systems linearly.

307
00:16:56,889 --> 00:17:03,529
So that provides a
little motivation there.

308
00:17:03,529 --> 00:17:05,950
As we like to say,
enough about me,

309
00:17:05,950 --> 00:17:08,069
let me tell you about my book.

310
00:17:08,069 --> 00:17:12,209
So this will be the text that
will we use in the class.

311
00:17:12,209 --> 00:17:15,745
We are not going to go
through the full text,

312
00:17:15,745 --> 00:17:19,623
but we have printed out copies
of the first seven chapters

313
00:17:19,623 --> 00:17:21,359
that we will go through.

314
00:17:21,359 --> 00:17:29,280
And we will hand those out
later when you do the class.

315
00:17:29,280 --> 00:17:33,055
So let me now switch
gears a little bit

316
00:17:33,055 --> 00:17:36,831
and talk about how this
relates to, I think,

317
00:17:36,831 --> 00:17:38,831
one of the most
wonderful breakthroughs

318
00:17:38,831 --> 00:17:41,992
that we have seen, or
I've seen in my career,

319
00:17:41,992 --> 00:17:45,489
and many of my colleagues
here at MIT have seen,

320
00:17:45,489 --> 00:17:48,670
which is what's been going
on in machine learning,

321
00:17:48,670 --> 00:17:53,760
right, which is-- it's not hype.

322
00:17:53,760 --> 00:17:58,450
There's a real real there there
and it's tremendously exciting.

323
00:17:58,450 --> 00:18:01,650
So let me give you a little
history, basic history

324
00:18:01,650 --> 00:18:02,610
of this field.

325
00:18:02,610 --> 00:18:08,707
So in a certain sense,
before 2010, machine learning

326
00:18:08,707 --> 00:18:10,740
looked like this.

327
00:18:10,740 --> 00:18:17,720
And then, after 2015, it
kind of looks like this.

328
00:18:17,720 --> 00:18:21,519
So when people talk about
the hype in machine learning,

329
00:18:21,519 --> 00:18:23,799
or AI, really deep
neural networks

330
00:18:23,799 --> 00:18:28,360
are the elephant inside
the machine learning snake.

331
00:18:28,360 --> 00:18:32,488
It has stormed onto the
scene in the last five years

332
00:18:32,488 --> 00:18:38,316
and basically allowed us to do
things that we had almost taken

333
00:18:38,316 --> 00:18:40,700
for granted were impossible.

334
00:18:40,700 --> 00:18:44,756
Just the fact that you're
able to talk to computers

335
00:18:44,756 --> 00:18:48,989
and they can understand you,
that we can have computers that

336
00:18:48,989 --> 00:18:53,693
can see at least in a way that
approximates the way humans do,

337
00:18:53,693 --> 00:18:55,997
these are really almost
technological miracles

338
00:18:55,997 --> 00:18:59,387
that, for those of us
who have been working

339
00:18:59,387 --> 00:19:02,884
on this field for fifty years,
we had almost literally given

340
00:19:02,884 --> 00:19:03,520
up on.

341
00:19:03,520 --> 00:19:06,710
And then all of a sudden
it became possible.

342
00:19:06,710 --> 00:19:09,796
So let me give you a little
sense of appreciation

343
00:19:09,796 --> 00:19:11,649
for this field and its roots.

344
00:19:11,649 --> 00:19:15,361
So machine learning,
like any field,

345
00:19:15,361 --> 00:19:20,929
is defined as a set of
techniques and problems.

346
00:19:20,929 --> 00:19:23,435
When you ask what defines
a field, you ask, well,

347
00:19:23,435 --> 00:19:26,552
what are the problems that they
work on that other fields don't

348
00:19:26,552 --> 00:19:27,370
really work on?

349
00:19:27,370 --> 00:19:30,635
And what are the techniques
they employ that really are not

350
00:19:30,635 --> 00:19:32,120
really being employed by them?

351
00:19:32,120 --> 00:19:35,706
So the core techniques,
as I mentioned earlier,

352
00:19:35,706 --> 00:19:37,500
are these neural networks.

353
00:19:37,500 --> 00:19:41,514
These are meant to crudely
approximate maybe the way

354
00:19:41,514 --> 00:19:43,299
humans think about problems.

355
00:19:43,299 --> 00:19:45,549
We have these circles
which are neurons.

356
00:19:45,549 --> 00:19:47,980
They have connections
to other neurons.

357
00:19:47,980 --> 00:19:50,612
You know, those connections
have different weights

358
00:19:50,612 --> 00:19:51,740
associated with them.

359
00:19:51,740 --> 00:19:53,636
As information
comes in, they get

360
00:19:53,636 --> 00:19:54,900
multiplied by those weights.

361
00:19:54,900 --> 00:19:56,240
They get summed together.

362
00:19:56,240 --> 00:19:58,842
And if they pass certain
thresholds or criteria,

363
00:19:58,842 --> 00:20:01,770
then they send a signal
on to another neuron.

364
00:20:01,770 --> 00:20:04,267
And this is, to
a certain degree,

365
00:20:04,267 --> 00:20:07,479
how we believe the
human brain works and is

366
00:20:07,479 --> 00:20:10,338
a natural starting
point for, how could

367
00:20:10,338 --> 00:20:13,020
we make computers
do similar things?

368
00:20:13,020 --> 00:20:17,484
The big problems that
people have worked on

369
00:20:17,484 --> 00:20:21,390
are these classic problems
in machine learning,

370
00:20:21,390 --> 00:20:24,955
are language, how do we
make computers understand

371
00:20:24,955 --> 00:20:28,520
human language, vision,
how do we make computers

372
00:20:28,520 --> 00:20:31,460
see pictures or explain
pictures back to us the way

373
00:20:31,460 --> 00:20:34,876
we would like, and strategy and
games and other types of things

374
00:20:34,876 --> 00:20:35,419
like that.

375
00:20:35,419 --> 00:20:38,390
So how do we get them
to solve problems?

376
00:20:38,390 --> 00:20:40,900
This is not new.

377
00:20:40,900 --> 00:20:45,557
These core concepts trace
back to the earliest days

378
00:20:45,557 --> 00:20:47,110
of the field.

379
00:20:47,110 --> 00:20:50,531
In fact, these four
figures here, each one

380
00:20:50,531 --> 00:20:53,952
is taken from a paper
that was presented

381
00:20:53,952 --> 00:20:58,041
at the very first
machine learning

382
00:20:58,041 --> 00:21:00,680
conference in the mid-1950s.

383
00:21:00,680 --> 00:21:03,179
So there was a machine learning
conference in the mid-1950s.

384
00:21:03,179 --> 00:21:06,370
It was in Los Angeles.

385
00:21:06,370 --> 00:21:09,700
It had four papers presented.

386
00:21:09,700 --> 00:21:12,930
These were the four papers.

387
00:21:12,930 --> 00:21:16,779
And I will say
that three of them

388
00:21:16,779 --> 00:21:21,110
were done by folks at MIT
Lincoln Laboratory, which

389
00:21:21,110 --> 00:21:22,570
is where I work.

390
00:21:22,570 --> 00:21:25,290
And so that was basically
the neural networks

391
00:21:25,290 --> 00:21:26,650
of language and vision.

392
00:21:26,650 --> 00:21:31,510
And we didn't play
games, so that was it.

393
00:21:31,510 --> 00:21:33,309
And you might say,
well, why is it?

394
00:21:33,309 --> 00:21:36,198
Why was there so
much work going on

395
00:21:36,198 --> 00:21:38,366
in Lincoln Laboratory
in the mid-1950s

396
00:21:38,366 --> 00:21:43,740
that they would want to
pioneer in these directions?

397
00:21:43,740 --> 00:21:48,065
At that time, people were
first building computers

398
00:21:48,065 --> 00:21:51,310
and computers were
very special purpose.

399
00:21:51,310 --> 00:21:53,691
So different organizations
around the world

400
00:21:53,691 --> 00:21:56,470
were building computers
to do different things.

401
00:21:56,470 --> 00:22:06,886
Some were doing them to simulate
complex fluid dynamics systems,

402
00:22:06,886 --> 00:22:11,097
think about designing
ships or other types

403
00:22:11,097 --> 00:22:13,650
of things like
that or airplanes.

404
00:22:13,650 --> 00:22:17,162
Others were doing them to,
say, like what Alan Turing was

405
00:22:17,162 --> 00:22:18,120
doing, break codes.

406
00:22:18,120 --> 00:22:23,079
And our task was
to help people who

407
00:22:23,079 --> 00:22:27,419
were watching radar scopes
make decisions, right?

408
00:22:27,419 --> 00:22:32,761
How could computers enable
humans to watch more sensors

409
00:22:32,761 --> 00:22:35,730
and see where they're going?

410
00:22:35,730 --> 00:22:36,730
How could we do that?

411
00:22:36,730 --> 00:22:40,136
So at Lincoln Laboratory, we
were building special purpose

412
00:22:40,136 --> 00:22:41,650
computers to do this.

413
00:22:41,650 --> 00:22:46,257
And we built the
first large computer

414
00:22:46,257 --> 00:22:48,890
with reliable, fast memory.

415
00:22:48,890 --> 00:22:57,242
This system had 4,096 bytes
of memory, which, at the time,

416
00:22:57,242 --> 00:23:01,039
people thought was too much.

417
00:23:01,039 --> 00:23:06,490
What could you possibly
do with 4,096 numbers?

418
00:23:06,490 --> 00:23:08,710
The human brain, of course!

419
00:23:08,710 --> 00:23:10,169
Right, that's enough, right?

420
00:23:10,169 --> 00:23:13,590
Most of us can remember five,
six, seven digits, right?

421
00:23:13,590 --> 00:23:16,705
So a computer that can
remember 4,096 numbers

422
00:23:16,705 --> 00:23:20,790
should be able to do things
like language and vision

423
00:23:20,790 --> 00:23:21,760
and strategy.

424
00:23:21,760 --> 00:23:23,149
So why not?

425
00:23:23,149 --> 00:23:27,052
So they went out
and they started

426
00:23:27,052 --> 00:23:29,840
working on these problems, OK?

427
00:23:29,840 --> 00:23:33,400
But Lincoln Laboratory, being
an applied research laboratory,

428
00:23:33,400 --> 00:23:40,176
we are required to get answers
to our sponsors in a few years'

429
00:23:40,176 --> 00:23:41,350
time frame.

430
00:23:41,350 --> 00:23:44,341
If problems are going to
take longer than that,

431
00:23:44,341 --> 00:23:48,679
then they really are the
purview of the basic research

432
00:23:48,679 --> 00:23:50,019
community, universities.

433
00:23:50,019 --> 00:23:51,969
And it became
apparent pretty early

434
00:23:51,969 --> 00:23:55,220
on that this problem was
going to be more difficult.

435
00:23:55,220 --> 00:23:59,649
It was not going to
be solved right away.

436
00:23:59,649 --> 00:24:03,420
So we did what we often
do, is we partnered.

437
00:24:03,420 --> 00:24:07,266
We found some bright young
people at MIT, people

438
00:24:07,266 --> 00:24:08,549
just like yourselves.

439
00:24:08,549 --> 00:24:14,610
In this case, we found a young
professor named Marvin Minsky.

440
00:24:14,610 --> 00:24:18,950
And we said, why don't you go
and get some of your friends

441
00:24:18,950 --> 00:24:20,867
together and create
a meeting where

442
00:24:20,867 --> 00:24:23,096
you can lay out what the
fundamental challenges are

443
00:24:23,096 --> 00:24:23,840
of this field?

444
00:24:23,840 --> 00:24:26,885
And then we will figure out how
to get that funded so that you

445
00:24:26,885 --> 00:24:28,190
can go and do that research.

446
00:24:28,190 --> 00:24:32,116
And that was the famous
Dartmouth AI conference

447
00:24:32,116 --> 00:24:34,570
which kicked off the field.

448
00:24:34,570 --> 00:24:38,809
And the person leading this
group, Oliver Selfridge

449
00:24:38,809 --> 00:24:42,563
at Lincoln Laboratory, basically
arranged for that conference

450
00:24:42,563 --> 00:24:46,240
to happen and then subsequently
arranged for what would

451
00:24:46,240 --> 00:24:51,880
became the MIT AI Lab that was
founded by Professor Minsky.

452
00:24:51,880 --> 00:24:54,064
And likewise,
Professor Selfridge

453
00:24:54,064 --> 00:24:58,980
also realized that we would
need more computing power.

454
00:24:58,980 --> 00:25:01,530
So he left Lincoln
Laboratory and formed

455
00:25:01,530 --> 00:25:04,809
what was called Project MAC,
which became the Laboratory

456
00:25:04,809 --> 00:25:06,179
for Computer Science.

457
00:25:06,179 --> 00:25:11,640
And then those two entities
later merged 30 years later

458
00:25:11,640 --> 00:25:13,279
to become CSAIL.

459
00:25:13,279 --> 00:25:16,000
So that was the initial thing.

460
00:25:16,000 --> 00:25:20,199
Now, it was pretty clear that,
when this problem was handed

461
00:25:20,199 --> 00:25:22,490
off to the basic
research community,

462
00:25:22,490 --> 00:25:25,655
there was a feeling that
these problems would

463
00:25:25,655 --> 00:25:28,029
be solved in about a decade.

464
00:25:28,029 --> 00:25:32,404
So we were really
thinking by the mid-1960s

465
00:25:32,404 --> 00:25:36,779
is when these problems
would be really solved.

466
00:25:36,779 --> 00:25:40,639
So it's like giving someone
an assignment, right?

467
00:25:40,639 --> 00:25:43,360
You all are given
assignments by professors

468
00:25:43,360 --> 00:25:46,860
and they give you
a week to do it.

469
00:25:46,860 --> 00:25:49,460
But it took a little longer.

470
00:25:49,460 --> 00:25:55,294
In this case, it took five weeks
or, in this case, five decades

471
00:25:55,294 --> 00:25:57,090
to solve this problem.

472
00:25:57,090 --> 00:25:58,090
But we have.

473
00:25:58,090 --> 00:26:01,966
We have now really,
using those techniques,

474
00:26:01,966 --> 00:26:05,289
made tremendous progress
on those problems.

475
00:26:05,289 --> 00:26:07,740
But we don't know why it works.

476
00:26:07,740 --> 00:26:09,198
So we made this
tremendous progress

477
00:26:09,198 --> 00:26:09,770
but we don't really
understand why this works.

478
00:26:09,770 --> 00:26:12,490
So let me show you a little
bit what we have learned,

479
00:26:12,490 --> 00:26:15,084
and this course will explore
the deeper mathematics

480
00:26:15,084 --> 00:26:18,135
to help us gain insight.

481
00:26:18,135 --> 00:26:19,510
We still don't
know why it works.

482
00:26:19,510 --> 00:26:22,195
At least we can
lay the foundations

483
00:26:22,195 --> 00:26:24,880
and maybe you can figure it out.

484
00:26:24,880 --> 00:26:30,040
So here I am, fifty
years later, a person

485
00:26:30,040 --> 00:26:33,480
from Lincoln Laboratory
saying, "All right.

486
00:26:33,480 --> 00:26:36,030
Question one has been answered.

487
00:26:36,030 --> 00:26:37,370
Here's question two."

488
00:26:37,370 --> 00:26:38,370
Ha.

489
00:26:38,370 --> 00:26:41,357
Why does this work and
hopefully you can begin,

490
00:26:41,357 --> 00:26:43,150
be the generation
figured it out.

491
00:26:43,150 --> 00:26:45,140
Hopefully it'll take
less than fifty years.

492
00:26:45,140 --> 00:26:48,295
Historically this type
once we know how it works,

493
00:26:48,295 --> 00:26:51,970
it usually takes about twenty
years to figure out why.

494
00:26:51,970 --> 00:26:54,715
So I mean impasses
but maybe maybe you

495
00:26:54,715 --> 00:26:58,779
know some people are smarter and
they'll figure it out faster.

496
00:26:58,779 --> 00:27:05,870
So this is what a neural
network looks like.

497
00:27:05,870 --> 00:27:10,130
On the left you have your input,
in this case, a vector, y zero.

498
00:27:10,130 --> 00:27:15,090
It's just these dots
that are called features.

499
00:27:15,090 --> 00:27:17,760
What is a feature?

500
00:27:17,760 --> 00:27:20,210
Anything can be a feature.

501
00:27:20,210 --> 00:27:23,418
That is the power
of neural networks,

502
00:27:23,418 --> 00:27:28,002
is they don't require you
to a priori state what

503
00:27:28,002 --> 00:27:29,461
the inputs can be.

504
00:27:29,461 --> 00:27:31,130
They can be anything.

505
00:27:31,130 --> 00:27:32,506
People have said,
well, you know,

506
00:27:32,506 --> 00:27:33,921
neural networks,
machine learning,

507
00:27:33,921 --> 00:27:34,980
it's just curve fitting.

508
00:27:34,980 --> 00:27:38,380
Yeah, but it's curve fitting
without domain knowledge.

509
00:27:38,380 --> 00:27:42,065
Because domain knowledge
is so costly and expensive

510
00:27:42,065 --> 00:27:47,415
to create that having a
general system that can do this

511
00:27:47,415 --> 00:27:50,000
is really what's so powerful.

512
00:27:50,000 --> 00:27:52,480
So the inputs: we
have a input feature.

513
00:27:52,480 --> 00:27:56,380
It could be a vector,
which we call y sub zero.

514
00:27:56,380 --> 00:28:00,010
And that can just be an image,
right, the canonical thing

515
00:28:00,010 --> 00:28:02,320
being an image of a cat, right?

516
00:28:02,320 --> 00:28:05,484
And that can just be
the pixels, values just

517
00:28:05,484 --> 00:28:10,419
rolled out into a vector,
and they will be the inputs.

518
00:28:10,419 --> 00:28:12,100
And then we have a
series of layers.

519
00:28:12,100 --> 00:28:14,560
These are called hidden layers.

520
00:28:14,560 --> 00:28:19,639
The circles are often
referred to as neurons, OK?

521
00:28:19,639 --> 00:28:22,682
And each line
connecting each dot

522
00:28:22,682 --> 00:28:26,740
has a value associated
with it, a weight.

523
00:28:26,740 --> 00:28:28,863
And the strength
of the connection

524
00:28:28,863 --> 00:28:32,049
between any two neurons
is given by that weight.

525
00:28:32,049 --> 00:28:36,375
And then, ultimately,
the output, in this case,

526
00:28:36,375 --> 00:28:40,871
the output classification,
the series of blue dots there,

527
00:28:40,871 --> 00:28:43,110
are the different
possible categories.

528
00:28:43,110 --> 00:28:46,455
So if I put in a cat picture,
one of those dots would be cat,

529
00:28:46,455 --> 00:28:51,189
maybe one would be dog, maybe
one would be apple or orange,

530
00:28:51,189 --> 00:28:52,740
whatever I desired.

531
00:28:52,740 --> 00:28:57,040
And the whole idea is that,
if I put in a picture of a cat

532
00:28:57,040 --> 00:28:59,302
and I set all these
values correctly,

533
00:28:59,302 --> 00:29:02,558
then the dot
corresponding to cat

534
00:29:02,558 --> 00:29:06,900
will end up with the
highest score, right?

535
00:29:06,900 --> 00:29:11,596
And then I mentioned earlier
that each one of these neurons

536
00:29:11,596 --> 00:29:12,450
collects inputs.

537
00:29:12,450 --> 00:29:14,606
And if it's above a
certain threshold,

538
00:29:14,606 --> 00:29:18,380
it then chooses to pass on
information to the next.

539
00:29:18,380 --> 00:29:21,234
And that's where these b
values, which are vectors,

540
00:29:21,234 --> 00:29:22,820
are just the
thresholds associated

541
00:29:22,820 --> 00:29:24,000
with each one of those.

542
00:29:24,000 --> 00:29:29,194
It's a vector, one value
associated with each one

543
00:29:29,194 --> 00:29:32,080
of those that does those.

544
00:29:32,080 --> 00:29:35,897
This entire system can
be represented relatively

545
00:29:35,897 --> 00:29:39,169
simply with one
equation, which is

546
00:29:39,169 --> 00:29:44,477
that yi plus one, which is
the next vector in the layer,

547
00:29:44,477 --> 00:29:48,780
OK, can be computed by
the previous vector, yi

548
00:29:48,780 --> 00:29:52,834
matrix multiplied
by the weight, W.

549
00:29:52,834 --> 00:29:56,213
So whenever you
see transformations

550
00:29:56,213 --> 00:30:02,267
from one set of neurons to the
next layer, you should think,

551
00:30:02,267 --> 00:30:06,401
oh, I have a matrix that
represents all those weights

552
00:30:06,401 --> 00:30:09,720
and I'm going to multiply it by
the vector to get the next one.

553
00:30:09,720 --> 00:30:13,700
Then we apply these
thresholds, all right?

554
00:30:13,700 --> 00:30:17,492
So we add these,
the bi's, and then

555
00:30:17,492 --> 00:30:21,759
we have a function that
we pass it through.

556
00:30:21,759 --> 00:30:27,316
Typically, this h function
has been given the name

557
00:30:27,316 --> 00:30:29,169
rectified linear unit.

558
00:30:29,169 --> 00:30:30,409
It's much simpler than that.

559
00:30:30,409 --> 00:30:34,369
It's just, if the value is
greater than what comes out

560
00:30:34,369 --> 00:30:37,445
of this matrix multiplied,
if the value is greater

561
00:30:37,445 --> 00:30:38,970
than zero, don't touch it.

562
00:30:38,970 --> 00:30:40,190
Just let it pass through.

563
00:30:40,190 --> 00:30:43,259
If it's less than zero,
make it zero, right?

564
00:30:43,259 --> 00:30:46,812
You know, it's a
pretty complicated name

565
00:30:46,812 --> 00:30:49,350
for a very simple function.

566
00:30:49,350 --> 00:30:50,750
That's actually critical.

567
00:30:50,750 --> 00:30:53,290
If you didn't have
that h function,

568
00:30:53,290 --> 00:30:55,467
this nonlinear
function there, then we

569
00:30:55,467 --> 00:30:57,722
could roll up all
of these together

570
00:30:57,722 --> 00:31:00,399
and we would just have one
big matrix equation, right?

571
00:31:00,399 --> 00:31:03,490
So that's really considered a
pretty important part of it.

572
00:31:03,490 --> 00:31:05,415
So that's pretty
much what's going on.

573
00:31:05,415 --> 00:31:07,998
When you want to know what the
big deal is of neural networks,

574
00:31:07,998 --> 00:31:09,940
that's all that's going on.

575
00:31:09,940 --> 00:31:12,480
It's just that equation.

576
00:31:12,480 --> 00:31:16,870
The challenge is we don't know
what the W's and the b's are.

577
00:31:16,870 --> 00:31:19,559
And we don't know how many
layers there should be.

578
00:31:19,559 --> 00:31:23,750
And we don't know how
many neurons there

579
00:31:23,750 --> 00:31:26,370
should be in each layer.

580
00:31:26,370 --> 00:31:28,776
And although the features
can be arbitrary,

581
00:31:28,776 --> 00:31:30,449
picking the right
ones do matter.

582
00:31:30,449 --> 00:31:32,240
And picking the right
categories do matter.

583
00:31:32,240 --> 00:31:35,355
So when people talk about,
I do machine learning

584
00:31:35,355 --> 00:31:37,779
or I'm off working
on-- they're basically

585
00:31:37,779 --> 00:31:39,960
playing with all
of these parameters

586
00:31:39,960 --> 00:31:42,869
to try and find
the ones that will

587
00:31:42,869 --> 00:31:45,470
work best for their problem.

588
00:31:45,470 --> 00:31:47,100
And there's a lot
of trial and error.

589
00:31:47,100 --> 00:31:48,532
And you'll hear
about there's now

590
00:31:48,532 --> 00:31:50,680
systems that try and use
machine learning to do

591
00:31:50,680 --> 00:31:51,940
that process automatically.

592
00:31:51,940 --> 00:31:56,851
You know, how do you
make machines that learn

593
00:31:56,851 --> 00:31:59,580
how to do machine learning?

594
00:31:59,580 --> 00:32:03,580
The basic approach is a
trial and error approach.

595
00:32:03,580 --> 00:32:07,643
I take a whole bunch
of pictures of cats

596
00:32:07,643 --> 00:32:13,610
that I now have cats in them,
OK, and other things, right?

597
00:32:13,610 --> 00:32:18,769
And I randomly set all those
weights and thresholds.

598
00:32:18,769 --> 00:32:22,549
And I put in the vector
and I see what the system--

599
00:32:22,549 --> 00:32:26,078
I guess what I think the
number of layers and neurons

600
00:32:26,078 --> 00:32:30,111
and all that should be and
I run it through the system

601
00:32:30,111 --> 00:32:33,013
and I get an estimate
or a calculation

602
00:32:33,013 --> 00:32:36,422
for what I think these
final values should be

603
00:32:36,422 --> 00:32:38,649
and I compare it with the truth.

604
00:32:38,649 --> 00:32:40,960
That is, I just
basically subtract it.

605
00:32:40,960 --> 00:32:46,317
And then I use those corrections
to very carefully adjust

606
00:32:46,317 --> 00:32:47,389
the weights.

607
00:32:47,389 --> 00:32:49,919
Basically, with the
last weights first, I

608
00:32:49,919 --> 00:32:53,486
do what's called back propagate
these little changes to try

609
00:32:53,486 --> 00:32:57,289
and make a better guess on
what these weights should be.

610
00:32:57,289 --> 00:32:59,977
So if you hear the
term back propagation,

611
00:32:59,977 --> 00:33:02,330
that's that process of
taking those differences

612
00:33:02,330 --> 00:33:07,210
and using them to adjust
these weights by about

613
00:33:07,210 --> 00:33:09,379
0.01% at a time.

614
00:33:09,379 --> 00:33:12,321
And then we just do
this over and over

615
00:33:12,321 --> 00:33:15,263
again until eventually
we get a set of weights

616
00:33:15,263 --> 00:33:19,310
that we think does the problem
well enough for our purpose.

617
00:33:19,310 --> 00:33:21,899
So that's called back
propagation, all right?

618
00:33:21,899 --> 00:33:23,604
Once we have the set
of weights and we

619
00:33:23,604 --> 00:33:25,500
have a new picture that
we want to know what

620
00:33:25,500 --> 00:33:28,233
it is, we drop it in
there and it tells us

621
00:33:28,233 --> 00:33:30,269
it's a cat or a dog or whatever.

622
00:33:30,269 --> 00:33:32,100
That forward step
is called inference.

623
00:33:32,100 --> 00:33:34,442
These are two words
you'll hear frequently

624
00:33:34,442 --> 00:33:37,200
in machine learning, back
propagation and inference.

625
00:33:37,200 --> 00:33:38,450
And that's all there is to it.

626
00:33:38,450 --> 00:33:40,970
There's really
nothing else to that.

627
00:33:40,970 --> 00:33:42,867
If you can understand
this equation,

628
00:33:42,867 --> 00:33:46,029
you'll be way ahead of most
people in machine learning,

629
00:33:46,029 --> 00:33:47,029
you know?

630
00:33:47,029 --> 00:33:49,028
You know, there's
lots of people who

631
00:33:49,028 --> 00:33:51,314
understand about all the
software and the packages

632
00:33:51,314 --> 00:33:52,600
and the data.

633
00:33:52,600 --> 00:33:54,529
All of them are just doing that.

634
00:33:54,529 --> 00:33:58,027
And I'd say this is one
of the most powerful ways

635
00:33:58,027 --> 00:34:01,470
to be ahead in your field,
is to actually understand

636
00:34:01,470 --> 00:34:03,210
the mathematical principles.

637
00:34:03,210 --> 00:34:05,944
Because then the software
and what it's doing

638
00:34:05,944 --> 00:34:06,970
is much clearer.

639
00:34:06,970 --> 00:34:08,809
And other people
who don't understand

640
00:34:08,809 --> 00:34:11,100
these mathematical principles,
they're really guessing.

641
00:34:11,100 --> 00:34:12,968
They're like, oh,
well, I do this

642
00:34:12,968 --> 00:34:14,570
and I throw this module in.

643
00:34:14,570 --> 00:34:17,838
They don't really know
that all it's doing

644
00:34:17,838 --> 00:34:20,699
is making adjustments to
these various equations,

645
00:34:20,699 --> 00:34:24,230
how many different layers
there are and stuff like that.

646
00:34:24,230 --> 00:34:25,996
Now, why is this important?

647
00:34:25,996 --> 00:34:27,620
You're like, well,
what does it matter?

648
00:34:27,620 --> 00:34:29,909
As I said before,
we have this system.

649
00:34:29,909 --> 00:34:32,000
It works but we don't know why.

650
00:34:32,000 --> 00:34:34,389
Well, why is it
important to know why?

651
00:34:34,389 --> 00:34:36,139
Well, there's two reasons.

652
00:34:36,139 --> 00:34:40,519
One is that, if we want
to be able to apply

653
00:34:40,519 --> 00:34:43,962
this incredible innovation to
other domains-- so many of you

654
00:34:43,962 --> 00:34:45,280
probably want to do that.

655
00:34:45,280 --> 00:34:48,738
Many of you want to say,
how can I apply machine

656
00:34:48,738 --> 00:34:52,217
learning to something else
other than language or vision

657
00:34:52,217 --> 00:34:56,690
or some of these other
standard problems?

658
00:34:56,690 --> 00:34:58,570
I kind of need some
theory to know.

659
00:34:58,570 --> 00:35:01,081
Like, OK, if I have a problem
that's like this one over here

660
00:35:01,081 --> 00:35:03,228
and I changed it in
this way, there's

661
00:35:03,228 --> 00:35:05,700
a good chance it'll work.

662
00:35:05,700 --> 00:35:10,169
There's some basis for why I'm
going to try something, right?

663
00:35:10,169 --> 00:35:11,960
Right now there's a
lot of trial and error.

664
00:35:11,960 --> 00:35:13,209
It's like, well, it's an idea.

665
00:35:13,209 --> 00:35:15,385
But if you can have
some math that says,

666
00:35:15,385 --> 00:35:17,460
you know, I think that
will probably work,

667
00:35:17,460 --> 00:35:21,091
that really is a great way
to guide your reasoning

668
00:35:21,091 --> 00:35:22,590
and guide your efforts.

669
00:35:22,590 --> 00:35:25,928
Another reason is
that-- so here's

670
00:35:25,928 --> 00:35:30,380
a picture of a very
cute poodle, right?

671
00:35:30,380 --> 00:35:34,214
And the machine learning
system correctly

672
00:35:34,214 --> 00:35:36,770
identifies it as poodle.

673
00:35:36,770 --> 00:35:39,430
One thing we realized
is that the way you

674
00:35:39,430 --> 00:35:42,520
and I see that picture is
actually very, very different

675
00:35:42,520 --> 00:35:47,260
than the way the neural network
sees that picture, all right?

676
00:35:47,260 --> 00:35:51,167
And, in fact, I can make
changes to that picture that

677
00:35:51,167 --> 00:35:54,321
are imperceptible to you
or me but will completely

678
00:35:54,321 --> 00:35:56,189
change how the
neural network-- that

679
00:35:56,189 --> 00:36:00,428
is, given our neural network,
I can basically make it

680
00:36:00,428 --> 00:36:03,050
think anything, right?

681
00:36:03,050 --> 00:36:05,760
And so, for instance,
this is a famous paper.

682
00:36:05,760 --> 00:36:08,508
And they got the system
to think that that

683
00:36:08,508 --> 00:36:09,730
was an ostrich, right?

684
00:36:09,730 --> 00:36:12,890
And you can basically show
this for anything, right?

685
00:36:12,890 --> 00:36:15,980
So what's called robust AI,
or robust machine learning,

686
00:36:15,980 --> 00:36:18,040
machine learning that
can't be tricked,

687
00:36:18,040 --> 00:36:21,030
is going to become more
and more important.

688
00:36:21,030 --> 00:36:25,188
And again, having a deeper
understanding of the theory

689
00:36:25,188 --> 00:36:27,960
is very, very critical of that.

690
00:36:27,960 --> 00:36:31,850
So how are we going to do this?

691
00:36:31,850 --> 00:36:34,015
What's the main concept
that we are going

692
00:36:34,015 --> 00:36:35,640
to go through in this class?

693
00:36:35,640 --> 00:36:37,770
This has mostly
been motivational.

694
00:36:37,770 --> 00:36:40,320
But how are we going to
understand the data at a deeper

695
00:36:40,320 --> 00:36:40,820
level?

696
00:36:40,820 --> 00:36:42,770
You know, what's the big idea?

697
00:36:42,770 --> 00:36:46,179
And the big idea is
captured now in this,

698
00:36:46,179 --> 00:36:48,831
I apologize for this
eye chart slide,

699
00:36:48,831 --> 00:36:52,277
which is what we call
declarative mathematically

700
00:36:52,277 --> 00:36:53,300
rigorous data.

701
00:36:53,300 --> 00:36:56,504
So we have this
mathematical concept

702
00:36:56,504 --> 00:36:58,640
called an associative array.

703
00:36:58,640 --> 00:37:01,960
And it's corresponding algebra
that basically encompasses

704
00:37:01,960 --> 00:37:05,763
the data you would put
in databases, the data

705
00:37:05,763 --> 00:37:07,938
that you would put in
graphs, the data that

706
00:37:07,938 --> 00:37:11,560
would put in matrices and it
makes it all a linear system.

707
00:37:11,560 --> 00:37:14,240
And the key operations are
outlined there at the bottom.

708
00:37:14,240 --> 00:37:17,479
If you recall, we have
our basic little addition

709
00:37:17,479 --> 00:37:18,270
and multiplication.

710
00:37:18,270 --> 00:37:20,462
And then what's going
to be very important,

711
00:37:20,462 --> 00:37:22,707
probably the real
workhorse for this-- and i

712
00:37:22,707 --> 00:37:24,724
didn't show it before--
is called essentially

713
00:37:24,724 --> 00:37:26,640
array multiplication or
matrix multiplication.

714
00:37:26,640 --> 00:37:29,294
And that's the far
one on the right

715
00:37:29,294 --> 00:37:33,354
there, which we often abbreviate
just with no symbol, just A B.

716
00:37:33,354 --> 00:37:36,578
But if we really want
to explicitly call out

717
00:37:36,578 --> 00:37:39,037
that its matrix multiplication
as a combination

718
00:37:39,037 --> 00:37:40,706
of both multiplication
and addition,

719
00:37:40,706 --> 00:37:44,210
we put in what we call the
punch-drunk emoji, which

720
00:37:44,210 --> 00:37:46,710
is a plus dot times.

721
00:37:46,710 --> 00:37:50,370
You're probably all young
enough that you don't even

722
00:37:50,370 --> 00:37:53,585
remember emojis when
they had to type them out

723
00:37:53,585 --> 00:37:56,170
with just little characters and
we didn't have icons, right?

724
00:37:56,170 --> 00:38:00,620
So that meant you went to the
bar and lost to the fight,

725
00:38:00,620 --> 00:38:01,120
right?

726
00:38:01,120 --> 00:38:04,616
But, anyway, that's really going
to be the workhorse of what

727
00:38:04,616 --> 00:38:06,502
we're doing here.