1
00:00:00,090 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,810
Commons license.

3
00:00:03,810 --> 00:00:06,050
Your support will help
MIT OpenCourseWare

4
00:00:06,050 --> 00:00:10,140
continue to offer high quality
educational resources for free.

5
00:00:10,140 --> 00:00:12,700
To make a donation or to
view additional materials

6
00:00:12,700 --> 00:00:16,600
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,600 --> 00:00:17,260
at ocw.mit.edu.

8
00:00:20,862 --> 00:00:21,820
JEREMY KEPNER: Welcome.

9
00:00:21,820 --> 00:00:24,140
Happy Halloween.

10
00:00:24,140 --> 00:00:29,019
And for those of you
watching this video at home,

11
00:00:29,019 --> 00:00:31,310
you'll be glad to know the
whole audience has joined me

12
00:00:31,310 --> 00:00:33,940
and they're all dressed
out in costumes.

13
00:00:33,940 --> 00:00:37,790
And it's a real fun
day here as we do this.

14
00:00:37,790 --> 00:00:42,090
So making me not feel
alone in my costume.

15
00:00:42,090 --> 00:00:45,470
A lot of moral support
there, so that's great.

16
00:00:45,470 --> 00:00:48,210
So, yeah.

17
00:00:48,210 --> 00:00:54,720
So this is Lecture 05 five on
Signal Processing on Databases,

18
00:00:54,720 --> 00:00:55,750
just for a recap.

19
00:00:55,750 --> 00:00:57,580
For those of you who
missed earlier classes

20
00:00:57,580 --> 00:01:03,700
or are going this out
of order on the web,

21
00:01:03,700 --> 00:01:07,860
signal processing really
alludes detection theory,

22
00:01:07,860 --> 00:01:10,470
finding things, which
alludes to the underlying

23
00:01:10,470 --> 00:01:14,310
mathematical basis of that,
which is linear algebra.

24
00:01:14,310 --> 00:01:18,060
And the databases
really refers to working

25
00:01:18,060 --> 00:01:22,810
with unstructured data, strings,
and other types of things.

26
00:01:22,810 --> 00:01:26,230
Two things that aren't
really talked about together,

27
00:01:26,230 --> 00:01:28,450
but we're bringing
them together here

28
00:01:28,450 --> 00:01:31,510
because we have lots of new
data sets that required it.

29
00:01:31,510 --> 00:01:35,880
And so this talk is
probably the one that's

30
00:01:35,880 --> 00:01:38,770
getting most into
something that we would say

31
00:01:38,770 --> 00:01:41,200
relates to detection
theory, because we're

32
00:01:41,200 --> 00:01:45,060
going to be dealing with a lot
with background data models.

33
00:01:45,060 --> 00:01:48,310
And in particular,
power laws and methods

34
00:01:48,310 --> 00:01:51,110
of constructing
power law data sets,

35
00:01:51,110 --> 00:01:56,650
methods on sampling and fitting,
and using that as a basis

36
00:01:56,650 --> 00:02:03,040
for doing the kind of
work that you want to do.

37
00:02:03,040 --> 00:02:04,370
So moving in here.

38
00:02:04,370 --> 00:02:05,400
So just your outline.

39
00:02:05,400 --> 00:02:10,310
And so we've got a lot of
material to go over here today.

40
00:02:10,310 --> 00:02:13,786
This is all using
the data set, when

41
00:02:13,786 --> 00:02:15,410
we get into the data
set that we talked

42
00:02:15,410 --> 00:02:19,240
about in the last lecture,
which is this Reuters data set.

43
00:02:19,240 --> 00:02:22,510
So we'll be applying some of
these ideas to that data set.

44
00:02:22,510 --> 00:02:24,769
So just going to--
introduction here,

45
00:02:24,769 --> 00:02:26,310
and then I'm going
to get to sampling

46
00:02:26,310 --> 00:02:29,410
theory, and subsampling theory,
various types of distributions.

47
00:02:29,410 --> 00:02:33,330
And then end up with
the Reuter's data set.

48
00:02:33,330 --> 00:02:35,160
The overall goal
of this lecture is

49
00:02:35,160 --> 00:02:41,140
to really develop a background
model for these types of data

50
00:02:41,140 --> 00:02:45,570
sets that is based on what I'm
calling a perfect power law.

51
00:02:45,570 --> 00:02:47,090
And then we're
going to, basically,

52
00:02:47,090 --> 00:02:49,356
after we can construct
a perfect power law,

53
00:02:49,356 --> 00:02:50,855
we're going to
sample that power law

54
00:02:50,855 --> 00:02:53,210
and look at what happens
when we sample it.

55
00:02:53,210 --> 00:02:55,360
What are the effects
of sampling it?

56
00:02:55,360 --> 00:02:56,850
And then we can
use the power law

57
00:02:56,850 --> 00:03:00,132
to look at things like
deviations and such.

58
00:03:00,132 --> 00:03:01,590
Now you might ask,
well, why are we

59
00:03:01,590 --> 00:03:04,770
so concerned about
backgrounds and power laws?

60
00:03:04,770 --> 00:03:07,860
And it's because here's sort
of the basis of detection

61
00:03:07,860 --> 00:03:09,630
theory on one slide.

62
00:03:09,630 --> 00:03:12,850
So in detection
theory, you basically

63
00:03:12,850 --> 00:03:17,020
have a model which consists
of noise and the signal,

64
00:03:17,020 --> 00:03:19,640
and you have two
hypotheses, H0, which

65
00:03:19,640 --> 00:03:21,640
is there's only
noise in your data,

66
00:03:21,640 --> 00:03:24,530
and H1 that there's
signal plus noise.

67
00:03:24,530 --> 00:03:27,770
And so, essentially, when you
do detection theory, what you're

68
00:03:27,770 --> 00:03:32,540
doing is given these
models, you can then

69
00:03:32,540 --> 00:03:37,290
compute optimal filters
for answering the question,

70
00:03:37,290 --> 00:03:39,960
is there a signal there
or is it just noise?

71
00:03:39,960 --> 00:03:43,900
That's essentially what
detection theory boils down to.

72
00:03:43,900 --> 00:03:47,670
In this data, now when we
deal with graph theory,

73
00:03:47,670 --> 00:03:51,930
it's obviously not so clean in
terms of our dimensions here.

74
00:03:51,930 --> 00:03:53,979
We'll have some kind of
high dimensional space

75
00:03:53,979 --> 00:03:55,520
and our signal will
be projected into

76
00:03:55,520 --> 00:03:57,320
that high dimensional space.

77
00:03:57,320 --> 00:04:02,450
But nevertheless, the concept
is still just as important

78
00:04:02,450 --> 00:04:04,974
that we have noise and a signal.

79
00:04:04,974 --> 00:04:06,640
And that's what we're
trying to do here.

80
00:04:09,180 --> 00:04:12,870
Detection theory works in
the traditional domains

81
00:04:12,870 --> 00:04:15,460
that we've applied
it to because we

82
00:04:15,460 --> 00:04:18,220
have a fairly good model for
the background, which tends

83
00:04:18,220 --> 00:04:20,260
to be Gaussian random noise.

84
00:04:20,260 --> 00:04:23,700
It's kind of the
fundamental distribution

85
00:04:23,700 --> 00:04:29,400
that we use in lots
of our data sets.

86
00:04:29,400 --> 00:04:30,830
And if we didn't
have that model,

87
00:04:30,830 --> 00:04:32,980
it would be very difficult
for us to proceed

88
00:04:32,980 --> 00:04:34,880
with much of detection theory.

89
00:04:34,880 --> 00:04:39,120
And the Gaussian
random noise model

90
00:04:39,120 --> 00:04:41,210
works, because in many
respects, if you're

91
00:04:41,210 --> 00:04:42,910
collecting sensor
data, you really

92
00:04:42,910 --> 00:04:46,560
will have Gaussian
physics going on.

93
00:04:46,560 --> 00:04:49,750
You also-- law of
large numbers in terms

94
00:04:49,750 --> 00:04:53,240
of if you have lots of
different distributions

95
00:04:53,240 --> 00:04:55,340
that are pulled together,
they will end up

96
00:04:55,340 --> 00:04:58,300
beginning to look like
a Gaussian as well.

97
00:04:58,300 --> 00:05:00,870
So that's what we have in
many of the traditional fields

98
00:05:00,870 --> 00:05:03,100
that we've worked in
in signal processing,

99
00:05:03,100 --> 00:05:06,390
but now we're in this new
area where a lot of our data

100
00:05:06,390 --> 00:05:10,100
sets arrive from artificial
processes, processes that

101
00:05:10,100 --> 00:05:12,270
are a result of human actions.

102
00:05:12,270 --> 00:05:17,560
Be it data on a network, be
it data in a social network,

103
00:05:17,560 --> 00:05:21,200
be it other types of data
that have a strong sort

104
00:05:21,200 --> 00:05:23,420
of artificial element to them.

105
00:05:23,420 --> 00:05:26,550
We find that the
Gaussian model does not

106
00:05:26,550 --> 00:05:30,190
reveal itself in the same way
that we've seen in other sets.

107
00:05:30,190 --> 00:05:33,620
So we need this.

108
00:05:33,620 --> 00:05:35,470
We really need a
background model,

109
00:05:35,470 --> 00:05:37,890
so we have to do
something about it.

110
00:05:37,890 --> 00:05:41,330
So there has been a fair amount
of research and literature

111
00:05:41,330 --> 00:05:44,420
to come up with first
principles, methods

112
00:05:44,420 --> 00:05:49,540
for creating power law
distribution in data sets.

113
00:05:49,540 --> 00:05:51,810
We talked a little bit
about these distributions

114
00:05:51,810 --> 00:05:54,270
in the previous lectures.

115
00:05:54,270 --> 00:05:56,550
And they've met
with mixed results.

116
00:05:56,550 --> 00:05:59,010
It's been difficult
to come up with,

117
00:05:59,010 --> 00:06:03,430
what is the underlying
physics of the processes

118
00:06:03,430 --> 00:06:06,240
that result in certain
vertices in graphs

119
00:06:06,240 --> 00:06:09,160
having enormous number of
edges and other vertices

120
00:06:09,160 --> 00:06:10,600
only having a few?

121
00:06:10,600 --> 00:06:12,230
And so there has
been work on that.

122
00:06:12,230 --> 00:06:14,080
I encourage you to look
at that literature.

123
00:06:14,080 --> 00:06:18,490
Here we're going to do
much more of a-- sort of go

124
00:06:18,490 --> 00:06:20,770
from the reverse
direction, which

125
00:06:20,770 --> 00:06:25,490
is let's begin by coming up
with some way to construct

126
00:06:25,490 --> 00:06:27,100
a perfect power law.

127
00:06:27,100 --> 00:06:32,090
With no concept of well, what
is the underlying physics

128
00:06:32,090 --> 00:06:32,920
motivating this?

129
00:06:32,920 --> 00:06:38,620
Essentially, a basic linear
model for a perfect power law.

130
00:06:38,620 --> 00:06:40,220
That we probably can do.

131
00:06:40,220 --> 00:06:42,550
And then we'll go from there.

132
00:06:42,550 --> 00:06:44,470
And linear models are
something that we often

133
00:06:44,470 --> 00:06:46,000
use in our business.

134
00:06:46,000 --> 00:06:48,460
And it's a good
first starting point.

135
00:06:48,460 --> 00:06:52,350
So along those
lines, this is a way

136
00:06:52,350 --> 00:06:58,310
to construct a perfect
power law in a matrix.

137
00:06:58,310 --> 00:07:00,190
This is basically a
slide of definitions

138
00:07:00,190 --> 00:07:03,430
here, so let me spend a little
time going through them.

139
00:07:03,430 --> 00:07:06,380
So we're going to represent
our graph or our data

140
00:07:06,380 --> 00:07:08,580
as a random matrix.

141
00:07:08,580 --> 00:07:15,830
Basically of zero where there's
no connection between-- this

142
00:07:15,830 --> 00:07:20,010
is a set of vertices connected
with another set of vertices.

143
00:07:20,010 --> 00:07:23,110
There are N out of
these vertices and N

144
00:07:23,110 --> 00:07:24,800
in of these vertices.

145
00:07:24,800 --> 00:07:26,280
You have a dot here.

146
00:07:26,280 --> 00:07:30,280
The row corresponds to the
vertex at the edge left,

147
00:07:30,280 --> 00:07:33,550
and the column
corresponds to the vertex

148
00:07:33,550 --> 00:07:35,610
that the edge is going into.

149
00:07:35,610 --> 00:07:38,710
So this adjacency
matrix A is just

150
00:07:38,710 --> 00:07:42,100
going to be constructed
by randomly filling

151
00:07:42,100 --> 00:07:45,880
this matrix with entries.

152
00:07:45,880 --> 00:07:50,160
And the only real constraint
on them is that when you sum A,

153
00:07:50,160 --> 00:07:52,495
we're going to allow
multiple edges.

154
00:07:52,495 --> 00:07:54,370
So you can have more--
the values aren't just

155
00:07:54,370 --> 00:07:56,370
zero and one, but they
can have more than that.

156
00:07:56,370 --> 00:07:59,770
But when you sum the matrix
A, all its values up,

157
00:07:59,770 --> 00:08:04,770
you get a value M, which is
the total number of edges

158
00:08:04,770 --> 00:08:05,940
in the graph.

159
00:08:05,940 --> 00:08:09,840
So we have essentially a
graph with-- this could

160
00:08:09,840 --> 00:08:12,530
be a bipartite graph or not.

161
00:08:12,530 --> 00:08:18,375
But with essentially N out
vertices, N in vertices, and M

162
00:08:18,375 --> 00:08:19,010
total edges.

163
00:08:21,950 --> 00:08:24,210
The perfect power
law, we're going

164
00:08:24,210 --> 00:08:26,910
to have essentially two
perfect power laws here.

165
00:08:26,910 --> 00:08:29,700
One on the out degree.

166
00:08:29,700 --> 00:08:39,190
So if you sum these rows,
and then you do a histogram,

167
00:08:39,190 --> 00:08:41,617
you're going to want to
produce a histogram that

168
00:08:41,617 --> 00:08:42,700
looks something like this.

169
00:08:42,700 --> 00:08:45,610
So you have an out
degree for each vertex.

170
00:08:45,610 --> 00:08:49,080
And this would show how many
vertices have that out degree.

171
00:08:49,080 --> 00:08:51,980
And so our power law
says that these points

172
00:08:51,980 --> 00:08:56,080
should fall on a slope
with a negative power law

173
00:08:56,080 --> 00:08:58,110
coefficient of alpha out.

174
00:08:58,110 --> 00:09:00,760
So that's essentially
the one definition there.

175
00:09:00,760 --> 00:09:03,040
And then when you likewise
have another going

176
00:09:03,040 --> 00:09:04,290
in the-- for the other degree.

177
00:09:04,290 --> 00:09:07,280
So the in degrees have
their own power law.

178
00:09:07,280 --> 00:09:08,500
So these are the definitions.

179
00:09:08,500 --> 00:09:11,150
This is what we're saying
is a perfect power law.

180
00:09:11,150 --> 00:09:15,075
So we're saying a perfect
power law has these properties.

181
00:09:15,075 --> 00:09:17,370
Now we're going to
attempt to construct it.

182
00:09:17,370 --> 00:09:20,860
We have no physical
basis for saying why

183
00:09:20,860 --> 00:09:22,170
the data should look this way.

184
00:09:22,170 --> 00:09:25,270
We're just saying this is
a linear model of the data

185
00:09:25,270 --> 00:09:27,760
and we're going to
construct it that way.

186
00:09:27,760 --> 00:09:30,250
And again, these can be
undirected, multi-edge,

187
00:09:30,250 --> 00:09:34,160
we can allow self-loops
and disconnected vertices,

188
00:09:34,160 --> 00:09:35,290
and hyper-edges.

189
00:09:35,290 --> 00:09:38,120
Anything you can get by
just randomly throwing down

190
00:09:38,120 --> 00:09:40,130
values onto a matrix.

191
00:09:40,130 --> 00:09:41,960
And again, the
only constraint is

192
00:09:41,960 --> 00:09:46,020
that the sum in
both directions is

193
00:09:46,020 --> 00:09:49,630
equal to the number of edges.

194
00:09:49,630 --> 00:09:52,520
So given that, can we
construct such a thing?

195
00:09:52,520 --> 00:09:55,550
Well, it turns out we can
construct such a thing fairly

196
00:09:55,550 --> 00:09:56,350
simply.

197
00:09:56,350 --> 00:09:59,010
And so in MATLAB we can
construct a perfect power law

198
00:09:59,010 --> 00:10:04,440
graph with this four
line function here.

199
00:10:04,440 --> 00:10:07,390
It will construct a
degree distribution

200
00:10:07,390 --> 00:10:09,460
that has this property.

201
00:10:09,460 --> 00:10:12,830
And the three number
parameters to this distribution

202
00:10:12,830 --> 00:10:16,020
are alpha, which
is the slope, dmax,

203
00:10:16,020 --> 00:10:19,270
which is the maximum
degree vertex,

204
00:10:19,270 --> 00:10:22,260
and then this number
Nd, which is roughly

205
00:10:22,260 --> 00:10:24,430
proportional to
the number of bins

206
00:10:24,430 --> 00:10:25,680
that we're going to have here.

207
00:10:25,680 --> 00:10:27,055
Essentially the
number of points.

208
00:10:27,055 --> 00:10:30,550
It's not exactly that, but
roughly proportional to that.

209
00:10:30,550 --> 00:10:35,090
And so when you do this
little equation here,

210
00:10:35,090 --> 00:10:37,420
the first thing
that you will see

211
00:10:37,420 --> 00:10:42,500
is we are going to be creating
a logarithmic spacing of bins.

212
00:10:42,500 --> 00:10:45,150
We kind of need to do that here.

213
00:10:45,150 --> 00:10:47,730
But at a certain point,
we get below a value

214
00:10:47,730 --> 00:10:50,510
where the spacing will be--
we'll have essentially one

215
00:10:50,510 --> 00:10:51,820
bin per integer.

216
00:10:51,820 --> 00:10:53,634
And so these are two
very separate regimes.

217
00:10:53,634 --> 00:10:55,800
You're going to have one
which is called the integer

218
00:10:55,800 --> 00:10:59,860
regime, where basically
each integer has

219
00:10:59,860 --> 00:11:01,980
one-- is representative.

220
00:11:01,980 --> 00:11:04,651
And then it transitions
to a logarithmic regime.

221
00:11:04,651 --> 00:11:06,400
You might say this is
somewhat artificial.

222
00:11:06,400 --> 00:11:09,430
It's actually very reflective
of what's really going on.

223
00:11:09,430 --> 00:11:13,070
We really see here--
we really have a dmax,

224
00:11:13,070 --> 00:11:16,790
we really have, almost
always, a count at 1.

225
00:11:16,790 --> 00:11:19,080
And then we have
a count at 2 or 3,

226
00:11:19,080 --> 00:11:21,430
and then they start
spreading out.

227
00:11:21,430 --> 00:11:23,360
And so this is just
an artificial way

228
00:11:23,360 --> 00:11:25,970
to create this type
of distribution,

229
00:11:25,970 --> 00:11:30,810
which is a perfect
power law distribution.

230
00:11:30,810 --> 00:11:32,990
So it's a very simple,
very efficient code

231
00:11:32,990 --> 00:11:35,060
for creating one of these.

232
00:11:35,060 --> 00:11:37,850
It has a smooth transition
from what we call the integer

233
00:11:37,850 --> 00:11:40,930
bins and the logarithmic bins.

234
00:11:40,930 --> 00:11:44,280
And it also gives a very nice
what we call poor man slope

235
00:11:44,280 --> 00:11:44,800
estimator.

236
00:11:44,800 --> 00:11:47,110
So there's a lot of
research out there

237
00:11:47,110 --> 00:11:49,960
about how do you estimate
the slope of your power law.

238
00:11:49,960 --> 00:11:53,780
And there's all kinds of
algorithms for doing this.

239
00:11:53,780 --> 00:11:57,820
Well, the simplest way is just
to take the two endpoints.

240
00:11:57,820 --> 00:12:00,280
Take the first point
and the last point,

241
00:12:00,280 --> 00:12:02,360
and you know you're
perfectly fitting two points.

242
00:12:02,360 --> 00:12:04,193
And you could argue
you're perfectly fitting

243
00:12:04,193 --> 00:12:05,730
the two most important points.

244
00:12:05,730 --> 00:12:10,660
And you get this nice, simple
value for the slope here.

245
00:12:10,660 --> 00:12:12,630
In addition, you can
make the argument

246
00:12:12,630 --> 00:12:16,250
that regardless of
how you bin the data,

247
00:12:16,250 --> 00:12:18,280
you'll always have
these two bins.

248
00:12:18,280 --> 00:12:20,360
You will always
have a bin at dmax

249
00:12:20,360 --> 00:12:22,640
and you will always
have a bin at 1.

250
00:12:22,640 --> 00:12:26,070
And all the other bins
are going to be somewhat

251
00:12:26,070 --> 00:12:29,180
a matter of choice,
or of fitting.

252
00:12:29,180 --> 00:12:32,440
And so again, that's another
reason to rationalize alpha.

253
00:12:32,440 --> 00:12:34,980
So I would say, if
you plot your data

254
00:12:34,980 --> 00:12:36,930
and you have to
estimate an alpha,

255
00:12:36,930 --> 00:12:40,250
then I would just say,
well, here's what you do.

256
00:12:40,250 --> 00:12:43,180
And it's as good an
estimate of alpha as any.

257
00:12:43,180 --> 00:12:44,610
And it's very nicely defined.

258
00:12:44,610 --> 00:12:47,590
So we call that-- when
we talk about estimating

259
00:12:47,590 --> 00:12:52,460
the slope here, this is the
formula we're going to use.

260
00:12:52,460 --> 00:12:56,680
So far, this code
has just constructed

261
00:12:56,680 --> 00:12:59,130
a degree distribution,
i.e., the degree, and then

262
00:12:59,130 --> 00:13:01,780
the number of vertices
with that degree

263
00:13:01,780 --> 00:13:05,620
will be the outputs of this
perfect power law function.

264
00:13:05,620 --> 00:13:10,720
We still have to assign
that degree distribution

265
00:13:10,720 --> 00:13:12,930
to an actual set of edges.

266
00:13:12,930 --> 00:13:13,850
OK?

267
00:13:13,850 --> 00:13:16,050
And here's the code
that will do that for.

268
00:13:16,050 --> 00:13:18,500
It will say, given a
degree distribution,

269
00:13:18,500 --> 00:13:24,340
it will create a set of
vertices that do that.

270
00:13:24,340 --> 00:13:31,910
Now, the actual pairing
of the vertices into edges

271
00:13:31,910 --> 00:13:33,290
is arbitrary.

272
00:13:33,290 --> 00:13:42,100
And in fact, these are all
different adjacency matrices

273
00:13:42,100 --> 00:13:45,980
for the same degree
distribution.

274
00:13:45,980 --> 00:13:47,650
That is, every
single one of these

275
00:13:47,650 --> 00:13:51,484
has the same degree
distribution in both the rows

276
00:13:51,484 --> 00:13:52,150
and the columns.

277
00:13:52,150 --> 00:13:55,810
So the actual which vertices
are connected to which

278
00:13:55,810 --> 00:13:58,670
is a second order statistic.

279
00:13:58,670 --> 00:14:01,160
So the degree distribution in
this first order [INAUDIBLE],

280
00:14:01,160 --> 00:14:03,820
but how you want to
connect those vertices up

281
00:14:03,820 --> 00:14:05,390
is somewhat arbitrary.

282
00:14:05,390 --> 00:14:09,020
And so that's a freedom
that you have here.

283
00:14:09,020 --> 00:14:13,110
So for example, if I just
take the vertices out of here

284
00:14:13,110 --> 00:14:15,660
and I just say, all right, every
single vertex in your list,

285
00:14:15,660 --> 00:14:17,710
I'm just going to pair
it up with yourself,

286
00:14:17,710 --> 00:14:20,440
I will get an adjacency
matrix that's all diagonals.

287
00:14:20,440 --> 00:14:22,680
Essentially, all self-loops.

288
00:14:22,680 --> 00:14:25,580
If I take that list
and just randomly

289
00:14:25,580 --> 00:14:28,940
reorder the vertex
labels themselves,

290
00:14:28,940 --> 00:14:31,210
then I get something
that looks like this.

291
00:14:31,210 --> 00:14:34,520
If I just randomly
reorder the edge pairs,

292
00:14:34,520 --> 00:14:36,750
I get something like this.

293
00:14:36,750 --> 00:14:39,480
And if I randomly
relabel both the vertices

294
00:14:39,480 --> 00:14:43,750
and reconnect the vertices
into different edges,

295
00:14:43,750 --> 00:14:45,460
I get something like this.

296
00:14:45,460 --> 00:14:48,590
And for the most part, when
we talk about our randomly

297
00:14:48,590 --> 00:14:50,800
generating our
perfect power laws,

298
00:14:50,800 --> 00:14:52,620
we're going to talk about this.

299
00:14:52,620 --> 00:14:55,815
Which is probably most like
what we really encounter.

300
00:14:55,815 --> 00:14:57,420
It's essentially
something that's

301
00:14:57,420 --> 00:14:59,720
equivalent to randomly
labeling your vertices,

302
00:14:59,720 --> 00:15:01,420
and then randomly
taking those vertices

303
00:15:01,420 --> 00:15:05,050
and randomly pairing
them together.

304
00:15:05,050 --> 00:15:08,600
So that basically
talks about how

305
00:15:08,600 --> 00:15:11,545
we can actually construct a
graph from our perfect power

306
00:15:11,545 --> 00:15:12,710
law edge.

307
00:15:12,710 --> 00:15:15,430
Now, so this is a forward model.

308
00:15:15,430 --> 00:15:19,290
Given a set of these
three sort of parameters,

309
00:15:19,290 --> 00:15:21,760
we can generate a
perfect power law.

310
00:15:21,760 --> 00:15:25,420
But if we're dealing
with data, we often

311
00:15:25,420 --> 00:15:27,800
want slightly
different parameters.

312
00:15:27,800 --> 00:15:30,480
So as I said, before our
three parameters were alpha,

313
00:15:30,480 --> 00:15:35,500
which is greater
than 0, dmax, which

314
00:15:35,500 --> 00:15:39,550
is the highest degree in
the data, which we're saying

315
00:15:39,550 --> 00:15:43,320
is greater than 1, and then
this parameter Nd, which roughly

316
00:15:43,320 --> 00:15:45,210
corresponds to number of bins.

317
00:15:45,210 --> 00:15:49,500
So we can generate a power law
model for any of these values

318
00:15:49,500 --> 00:15:51,390
here that satisfy
these constraints.

319
00:15:51,390 --> 00:15:52,900
So that's a large number.

320
00:15:52,900 --> 00:15:55,480
However, what
we'll typically see

321
00:15:55,480 --> 00:15:59,290
is that we want to
use these parameters.

322
00:15:59,290 --> 00:16:02,450
So we'll want to have an
alpha, a number of vertices,

323
00:16:02,450 --> 00:16:05,100
and a number of edges, is
more often the parameters we

324
00:16:05,100 --> 00:16:06,580
want to work with.

325
00:16:06,580 --> 00:16:09,730
And we can compute those by
inverting these formulas.

326
00:16:09,730 --> 00:16:11,950
That is, if we
compute the degree,

327
00:16:11,950 --> 00:16:15,450
we can sum it to compute
the number of vertices.

328
00:16:15,450 --> 00:16:20,360
And likewise, we can sum the
distribution times the degree

329
00:16:20,360 --> 00:16:21,730
to get the number of edges.

330
00:16:21,730 --> 00:16:25,720
So given these, an alpha and
our model, we can invert these.

331
00:16:25,720 --> 00:16:26,750
All right?

332
00:16:26,750 --> 00:16:32,000
And what you see here is
for a given value of alpha,

333
00:16:32,000 --> 00:16:36,730
the allowed values
of N and M, given--

334
00:16:36,730 --> 00:16:40,940
that is, the values of N and
M that will be a power law.

335
00:16:40,940 --> 00:16:47,970
So what you see is that not
all combinations of vertices,

336
00:16:47,970 --> 00:16:51,360
vertex count and edge count, can
be constructed in a power law.

337
00:16:51,360 --> 00:16:52,510
There's a band here.

338
00:16:52,510 --> 00:16:54,020
This is a logarithmic graph.

339
00:16:54,020 --> 00:16:55,590
It's a wide band.

340
00:16:55,590 --> 00:16:59,780
But there's a band
here of allowable data

341
00:16:59,780 --> 00:17:01,250
that will produce that.

342
00:17:01,250 --> 00:17:03,210
And typically, what
you see is kind

343
00:17:03,210 --> 00:17:06,530
of the middle of this band
is like a ratio of around 10,

344
00:17:06,530 --> 00:17:09,210
which happens to be the
magic number that we see

345
00:17:09,210 --> 00:17:11,540
in lots of our data
sets when people say,

346
00:17:11,540 --> 00:17:12,750
I have power law data.

347
00:17:12,750 --> 00:17:15,369
Someone will ask you, well,
what's your to vertex to ratio?

348
00:17:15,369 --> 00:17:18,280
And [INAUDIBLE] we say,
it's like 8, or 10, or 20,

349
00:17:18,280 --> 00:17:19,560
or something like that.

350
00:17:19,560 --> 00:17:21,560
And again, you see it's
because in order for it

351
00:17:21,560 --> 00:17:24,109
to be power law data, at
least according to this model,

352
00:17:24,109 --> 00:17:26,790
it has to fall into
this band here.

353
00:17:26,790 --> 00:17:30,270
You'll also see this is a
very nonlinear function here.

354
00:17:30,270 --> 00:17:32,190
And we'll get into
fitting that later.

355
00:17:32,190 --> 00:17:35,030
And it's a nasty, nasty
function to invert,

356
00:17:35,030 --> 00:17:36,990
because we have
integer data, and data

357
00:17:36,990 --> 00:17:38,860
that's almost continuous.

358
00:17:38,860 --> 00:17:41,750
And it's a nasty, nasty
bus-- we can do it,

359
00:17:41,750 --> 00:17:43,740
but it's kind of
a nasty business.

360
00:17:43,740 --> 00:17:46,420
But given an alpha and
an N that are consistent,

361
00:17:46,420 --> 00:17:50,510
we can actually then generate
a dmax and an Nd that

362
00:17:50,510 --> 00:17:54,570
will best fit those parameters.

363
00:17:54,570 --> 00:17:57,770
So let's do an example here.

364
00:17:57,770 --> 00:18:00,620
So I didn't just dress up in
this crazy outfit for nothing.

365
00:18:00,620 --> 00:18:03,920
We have a whole Halloween
theme to our lecture today.

366
00:18:03,920 --> 00:18:08,500
So this is-- when I
go trick or treating

367
00:18:08,500 --> 00:18:11,100
with my daughter, of
course, our favorite thing

368
00:18:11,100 --> 00:18:14,690
is to do the distribution of
the candy when we're done.

369
00:18:14,690 --> 00:18:17,230
And so this shows last
year's candy distribution.

370
00:18:17,230 --> 00:18:19,310
We'll see how it varies.

371
00:18:19,310 --> 00:18:22,430
As you can see,
Hershey's chocolate

372
00:18:22,430 --> 00:18:27,010
bars, not surprisingly,
extremely popular.

373
00:18:27,010 --> 00:18:28,900
What else is popular here?

374
00:18:28,900 --> 00:18:31,830
Swedish fish, not so popular.

375
00:18:31,830 --> 00:18:33,980
Nestle's Crunch
bars, not so popular.

376
00:18:33,980 --> 00:18:36,900
Again, I actually found
this somewhat-- this list

377
00:18:36,900 --> 00:18:39,640
hasn't change since when
I went trick or treating.

378
00:18:39,640 --> 00:18:42,940
This is a tough
list to break into.

379
00:18:42,940 --> 00:18:46,370
Getting a new candy that makes
it to Halloween-worthy candy

380
00:18:46,370 --> 00:18:48,630
is pretty hard.

381
00:18:48,630 --> 00:18:51,000
So this year shows the
distribution of all the candy

382
00:18:51,000 --> 00:18:52,400
that we collected.

383
00:18:52,400 --> 00:18:55,250
And here are some
basic information.

384
00:18:55,250 --> 00:18:58,780
So we had 77 pieces of
candy, or distinct edges.

385
00:18:58,780 --> 00:19:00,610
We had 19 types of candy.

386
00:19:00,610 --> 00:19:04,540
Our edge to vertex ratio was 4.

387
00:19:04,540 --> 00:19:09,030
The dmax was 15.

388
00:19:09,030 --> 00:19:11,380
So we had 15 Hershey's Kisses.

389
00:19:11,380 --> 00:19:15,150
N1, we had eight types of
candy that we only got one of.

390
00:19:15,150 --> 00:19:18,360
And then our power
slope was alpha.

391
00:19:18,360 --> 00:19:20,130
And then our fit
parameters to this,

392
00:19:20,130 --> 00:19:25,940
when we actually fit, where
we got 77, 21, and M/N of 3.7.

393
00:19:25,940 --> 00:19:27,510
And this shows you the data.

394
00:19:27,510 --> 00:19:29,300
So this is the candy degree.

395
00:19:29,300 --> 00:19:31,270
And this is the number.

396
00:19:31,270 --> 00:19:33,550
And this shows you
what we measured.

397
00:19:33,550 --> 00:19:37,290
This is the poor
man's slope here.

398
00:19:37,290 --> 00:19:38,990
This is the model.

399
00:19:38,990 --> 00:19:41,646
And then one thing we
can do is actually--

400
00:19:41,646 --> 00:19:43,020
which is very
helpful-- is we can

401
00:19:43,020 --> 00:19:46,840
re-bin the measured data
using the bins extracted

402
00:19:46,840 --> 00:19:47,810
from the model.

403
00:19:47,810 --> 00:19:57,882
And that gets you these red
x's here, or plus signs here.

404
00:19:57,882 --> 00:19:59,590
And we'll discover
that's very important.

405
00:19:59,590 --> 00:20:01,900
Because the data
you have is often

406
00:20:01,900 --> 00:20:03,740
very rarely binned
in a way that's

407
00:20:03,740 --> 00:20:07,810
proper for seeing the
proper distribution.

408
00:20:07,810 --> 00:20:09,790
And we can actually
use this model

409
00:20:09,790 --> 00:20:12,990
to come up with what
a better set of bins

410
00:20:12,990 --> 00:20:16,540
would be, and then bin the
data with respect to that.

411
00:20:16,540 --> 00:20:19,230
So that's just an example
of this in actual practice.

412
00:20:23,300 --> 00:20:27,390
So now that we have a mechanism
for generating perfect power

413
00:20:27,390 --> 00:20:31,450
law, let's see what
happens when we sample it.

414
00:20:31,450 --> 00:20:36,380
Let's see what
happens when we do

415
00:20:36,380 --> 00:20:38,680
the things to it
that we typically

416
00:20:38,680 --> 00:20:40,450
do to clean up our data.

417
00:20:40,450 --> 00:20:44,670
I bring this up because in
standard graph theory, as I've

418
00:20:44,670 --> 00:20:47,000
talked about in
previous lectures,

419
00:20:47,000 --> 00:20:53,380
we often have what
we call random,

420
00:20:53,380 --> 00:20:57,130
undirected Erdos-Renyi
graphs, which

421
00:20:57,130 --> 00:21:01,000
are basically vertices
without direction.

422
00:21:01,000 --> 00:21:03,010
And usually the
edges are unweighted.

423
00:21:03,010 --> 00:21:05,400
So we just have a 0 or a 1.

424
00:21:05,400 --> 00:21:08,740
So very simplified graphs.

425
00:21:08,740 --> 00:21:11,480
I'm actually going to--
getting a little hot here

426
00:21:11,480 --> 00:21:12,718
in the top hat.

427
00:21:17,520 --> 00:21:21,680
So a lot of our graph theory is
based on these types of graphs.

428
00:21:21,680 --> 00:21:24,090
And as we've talked about
before that our data tends

429
00:21:24,090 --> 00:21:25,810
to not look like that.

430
00:21:25,810 --> 00:21:27,390
So one of the things
we do so that we

431
00:21:27,390 --> 00:21:31,580
can apply the theory to
that data is that we often

432
00:21:31,580 --> 00:21:33,600
make with data look like that.

433
00:21:33,600 --> 00:21:36,390
So we'll often make
with data undirected.

434
00:21:36,390 --> 00:21:40,700
We'll often make the
data basically unweighted

435
00:21:40,700 --> 00:21:43,980
and other types of things so
we can apply all the theory

436
00:21:43,980 --> 00:21:46,440
that we've developed over
the last several decades

437
00:21:46,440 --> 00:21:49,440
on these particular types
of very well studied graphs.

438
00:21:49,440 --> 00:21:52,190
So now that we have a
perfect power law graph,

439
00:21:52,190 --> 00:21:55,870
we can see what happens if we
apply those same corrections

440
00:21:55,870 --> 00:21:57,270
to the data.

441
00:21:57,270 --> 00:21:58,670
And so here's what we see.

442
00:21:58,670 --> 00:22:00,530
So we generated a
perfect power law graph.

443
00:22:00,530 --> 00:22:02,340
The alpha is 1.3.

444
00:22:02,340 --> 00:22:03,630
The dmax was 1,000.

445
00:22:03,630 --> 00:22:05,060
Our Nd was 50.

446
00:22:05,060 --> 00:22:10,490
This generated a data set with
18,000 vertices and 84,000

447
00:22:10,490 --> 00:22:11,410
edges.

448
00:22:11,410 --> 00:22:17,510
And so here's a very
simple way to make it.

449
00:22:17,510 --> 00:22:21,550
We're going to make it
undirected by basically taking

450
00:22:21,550 --> 00:22:23,920
the matrix and
adding its transpose,

451
00:22:23,920 --> 00:22:25,610
and then taking
the upper diagonal.

452
00:22:25,610 --> 00:22:30,750
This is actually the best way
to make an adjacency matrix

453
00:22:30,750 --> 00:22:35,460
undirected, to take
that upper portion,

454
00:22:35,460 --> 00:22:40,190
because a lot of the
statistics that-- basically it

455
00:22:40,190 --> 00:22:43,210
saves you having to deal with
a lot of annoying factors of 2.

456
00:22:43,210 --> 00:22:45,790
So a lot of times, we'll
just do A plus A transpose,

457
00:22:45,790 --> 00:22:48,280
but then you get these annoying
factors of 2 lying around.

458
00:22:48,280 --> 00:22:51,940
And so this is a way
to sort of not do that.

459
00:22:51,940 --> 00:22:56,120
So we're getting rid of--
we've made it undirected.

460
00:22:56,120 --> 00:22:58,030
We're made it undirected
by doing that.

461
00:22:58,030 --> 00:23:01,070
We're going to make it
unweighted by basically

462
00:23:01,070 --> 00:23:03,900
converting everything to a 0
or 1, and then back to double.

463
00:23:03,900 --> 00:23:05,370
So that makes it unweighted.

464
00:23:05,370 --> 00:23:07,420
And then we're getting
rid of the diagonal,

465
00:23:07,420 --> 00:23:09,250
so that eliminates self-loops.

466
00:23:11,729 --> 00:23:13,020
So we've done all these things.

467
00:23:13,020 --> 00:23:14,740
We've cleaned up our
data in this way.

468
00:23:14,740 --> 00:23:16,400
So what happens?

469
00:23:16,400 --> 00:23:19,550
Well, so the triangles
were the input model.

470
00:23:19,550 --> 00:23:21,210
Well, now we've
cleaned up our data,

471
00:23:21,210 --> 00:23:27,970
and we see this sort of mess
that we've done to our data.

472
00:23:27,970 --> 00:23:34,050
And in fact, I'll call this--
in keeping with our Halloween

473
00:23:34,050 --> 00:23:37,150
theme here, we'll call this
our witch's broom distribution

474
00:23:37,150 --> 00:23:37,704
here.

475
00:23:37,704 --> 00:23:40,120
And if anybody's looked at
degree redistributions on data,

476
00:23:40,120 --> 00:23:43,290
it will be like you'll
recognize this shape instantly

477
00:23:43,290 --> 00:23:47,260
because you have this
bendiness coming up here.

478
00:23:47,260 --> 00:23:50,300
And then sort of
fanning out down here.

479
00:23:50,300 --> 00:23:57,860
Very common thing that we see
in the data sets that we plot.

480
00:23:57,860 --> 00:24:00,740
And in fact, there's not
an insignificant amount

481
00:24:00,740 --> 00:24:05,525
of literature devoted to trying
to understand these bumps

482
00:24:05,525 --> 00:24:09,450
and wiggles, and do they
really mean something

483
00:24:09,450 --> 00:24:13,287
underlying about the physical
phenomenon that's taking place.

484
00:24:13,287 --> 00:24:15,620
And while it's the case that
those bumps and wiggles may

485
00:24:15,620 --> 00:24:20,200
actually be representative
of some physical phenomenon,

486
00:24:20,200 --> 00:24:22,330
based on this, we
also have to concede

487
00:24:22,330 --> 00:24:25,630
the fact it's also consistent
with our cleaning up procedure.

488
00:24:25,630 --> 00:24:28,810
That is, the thing we're trying
to do to make our data better

489
00:24:28,810 --> 00:24:31,090
is introducing
nonlinear phenomenon

490
00:24:31,090 --> 00:24:36,010
on the data, which we may
confuse with real phenomena.

491
00:24:36,010 --> 00:24:39,270
So this is very much
a cautionary tale.

492
00:24:39,270 --> 00:24:41,860
And so based on
that, I certainly

493
00:24:41,860 --> 00:24:45,020
encourage people not to clean
up their data in that way,

494
00:24:45,020 --> 00:24:47,500
and keep the directedness.

495
00:24:47,500 --> 00:24:49,374
Don't throw away the self-loops.

496
00:24:49,374 --> 00:24:50,290
Keep the weightedness.

497
00:24:50,290 --> 00:24:52,597
Do your degree
distributions in this way.

498
00:24:52,597 --> 00:24:54,930
And live with the fact that
that's what your data really

499
00:24:54,930 --> 00:24:56,760
is like and try and
understand it that way,

500
00:24:56,760 --> 00:24:58,832
rather than trying
cleaning up in this way.

501
00:24:58,832 --> 00:25:00,040
Sometimes you have no choice.

502
00:25:00,040 --> 00:25:01,623
The algorithms that
you have will only

503
00:25:01,623 --> 00:25:03,940
work on data being that's
been cleaned up in this way.

504
00:25:03,940 --> 00:25:05,390
But you have to
recognize you are

505
00:25:05,390 --> 00:25:07,250
introducing a new phenomena.

506
00:25:07,250 --> 00:25:10,067
It's a highly non-linear
process, this cleaning up,

507
00:25:10,067 --> 00:25:11,650
and you have to be
careful about that.

508
00:25:14,580 --> 00:25:17,410
However, given that
we've done this,

509
00:25:17,410 --> 00:25:21,430
is there a way that we can
recover the original power law?

510
00:25:21,430 --> 00:25:23,360
So we can try that.

511
00:25:23,360 --> 00:25:31,060
So we have here is the data that
we-- the original data that we

512
00:25:31,060 --> 00:25:33,930
cleaned up is now these circles.

513
00:25:33,930 --> 00:25:34,620
OK?

514
00:25:34,620 --> 00:25:36,680
And we're going to
take that data set

515
00:25:36,680 --> 00:25:39,320
and compute an alpha
and an N and M from it

516
00:25:39,320 --> 00:25:42,360
from using pour
inversion formulas.

517
00:25:42,360 --> 00:25:45,510
And then compute what the
power law of that would be.

518
00:25:45,510 --> 00:25:48,170
So that's the triangles.

519
00:25:48,170 --> 00:25:52,400
So here's our poor
man's alpha fit.

520
00:25:52,400 --> 00:25:55,090
This is our model.

521
00:25:55,090 --> 00:25:56,300
This is what the model is.

522
00:25:56,300 --> 00:25:58,600
These triangles here is
the model, we're saying.

523
00:25:58,600 --> 00:26:02,060
And then we can say, aha,
let's use the bins that

524
00:26:02,060 --> 00:26:05,960
came from this model
to re-bin these circles

525
00:26:05,960 --> 00:26:08,426
onto these red plus signs here.

526
00:26:08,426 --> 00:26:09,550
So that's our new data set.

527
00:26:09,550 --> 00:26:11,008
And what you see
is that we've done

528
00:26:11,008 --> 00:26:15,200
a pretty good job of recovering
the original power law.

529
00:26:15,200 --> 00:26:18,670
So if we had data that we
observed to look like this,

530
00:26:18,670 --> 00:26:20,410
we wouldn't be sure
it was a power law.

531
00:26:20,410 --> 00:26:21,243
Like, we don't know.

532
00:26:21,243 --> 00:26:22,660
Say, well, what's
this bend here?

533
00:26:22,660 --> 00:26:24,290
And what's this
fanning out here?

534
00:26:24,290 --> 00:26:26,800
But then if you go through
this process and re-bin it,

535
00:26:26,800 --> 00:26:30,320
you can be like, oh, no, that
really looks like a power law.

536
00:26:30,320 --> 00:26:32,205
And so that's a way of
recovering the power

537
00:26:32,205 --> 00:26:34,955
law that we may have lost
through some filtering

538
00:26:34,955 --> 00:26:35,455
procedure.

539
00:26:41,190 --> 00:26:43,020
Here's another example.

540
00:26:43,020 --> 00:26:45,600
So what we're going to do is
essentially take our matrix

541
00:26:45,600 --> 00:26:48,460
and compute the
correlation of it.

542
00:26:48,460 --> 00:26:50,420
We talked a lot
about-- if we say

543
00:26:50,420 --> 00:26:54,490
I have an incidence matrix, we
multiply it to do correlation.

544
00:26:54,490 --> 00:26:57,155
In this case, we're
treating our random matrix

545
00:26:57,155 --> 00:26:59,070
as not an adjacency
matrix, but as

546
00:26:59,070 --> 00:27:02,330
an incidence matrix, a randomly
generated incidence matrix.

547
00:27:02,330 --> 00:27:05,810
And so these are, again,
the parameters that we use.

548
00:27:05,810 --> 00:27:10,630
We're converting it to all
unweighted, all 0's and 1's.

549
00:27:10,630 --> 00:27:13,224
And then we are
correlating it with itself

550
00:27:13,224 --> 00:27:14,640
to construct the
adjacency matrix.

551
00:27:14,640 --> 00:27:18,410
Taking the upper diagonal, and
then removing the diagonal.

552
00:27:18,410 --> 00:27:20,660
And this is the
result of what we see.

553
00:27:20,660 --> 00:27:22,140
So here's our input model.

554
00:27:22,140 --> 00:27:23,570
Again, the triangles.

555
00:27:23,570 --> 00:27:27,770
And then this is the measured--
what we get out from there.

556
00:27:27,770 --> 00:27:30,090
And if you saw this,
you might be like, wow,

557
00:27:30,090 --> 00:27:32,635
that's a really good power law.

558
00:27:32,635 --> 00:27:35,010
In fact, I've certainly seen
data-- I mean, most the time

559
00:27:35,010 --> 00:27:37,330
I would see, yep, that is
a power law distribution.

560
00:27:37,330 --> 00:27:40,080
We absolutely have a
power law distribution.

561
00:27:40,080 --> 00:27:42,300
However, we then
apply our procedure.

562
00:27:45,810 --> 00:27:48,320
So again, we have
our measured data.

563
00:27:48,320 --> 00:27:51,310
OK, we're going to do
our parameters here.

564
00:27:51,310 --> 00:27:54,220
Get our poor man's
alpha parameter.

565
00:27:54,220 --> 00:27:57,280
And then fit, the
triangles are the new fit.

566
00:27:57,280 --> 00:27:58,140
OK.

567
00:27:58,140 --> 00:28:01,350
And then we use
those bins to re-bin.

568
00:28:01,350 --> 00:28:05,220
And we see here that when
we actually re-bin the data,

569
00:28:05,220 --> 00:28:09,140
we get something that looks
very much not like a power law

570
00:28:09,140 --> 00:28:09,750
distribution.

571
00:28:09,750 --> 00:28:11,830
So there's an example
of the reverse.

572
00:28:11,830 --> 00:28:14,150
Before we had data that
didn't look like a power law,

573
00:28:14,150 --> 00:28:16,840
but when we re-binned it,
we recovered the power law.

574
00:28:16,840 --> 00:28:18,510
Here we have data
that may-- just

575
00:28:18,510 --> 00:28:21,700
sort of in this random binning
may look like a power law.

576
00:28:21,700 --> 00:28:23,540
We actually see
it has this bump.

577
00:28:23,540 --> 00:28:26,020
And then continuing with
our Halloween theme,

578
00:28:26,020 --> 00:28:28,300
we can call this the
witch's nose distribution,

579
00:28:28,300 --> 00:28:30,350
because it comes along
here as this giant bump

580
00:28:30,350 --> 00:28:32,180
and then goes back
to a power law.

581
00:28:32,180 --> 00:28:35,690
And this actually,
there's meaning for this.

582
00:28:35,690 --> 00:28:39,180
And we will see this
later in the actual data.

583
00:28:39,180 --> 00:28:41,660
But this is not just that
when you do these correlation

584
00:28:41,660 --> 00:28:45,940
matrices, certain types of them,
particularly self-correlations,

585
00:28:45,940 --> 00:28:50,030
very likely will produce
this type of distribution.

586
00:28:50,030 --> 00:28:52,800
But again, even though
we have this bump,

587
00:28:52,800 --> 00:28:56,640
you would still argue that
our linear power law is still

588
00:28:56,640 --> 00:28:59,230
a very good first order fit.

589
00:28:59,230 --> 00:29:02,050
So we still captured
most of the dynamic range

590
00:29:02,050 --> 00:29:04,330
of the distribution.

591
00:29:04,330 --> 00:29:05,747
And this is now a
delta from that.

592
00:29:05,747 --> 00:29:07,704
And so we're very
comfortable with that, right?

593
00:29:07,704 --> 00:29:09,130
We start with our linear models.

594
00:29:09,130 --> 00:29:10,707
That models most of the data.

595
00:29:10,707 --> 00:29:12,040
And then we have a second order.

596
00:29:12,040 --> 00:29:13,831
If we wanted to, we
could go in and come up

597
00:29:13,831 --> 00:29:15,360
with some kind of second order.

598
00:29:15,360 --> 00:29:17,400
Subtract the linear
model from here

599
00:29:17,400 --> 00:29:21,400
and you would see some kind
of hump distribution here.

600
00:29:21,400 --> 00:29:24,340
And you could then model
your data as a linear model,

601
00:29:24,340 --> 00:29:25,820
plus some kind of correction.

602
00:29:25,820 --> 00:29:29,300
Again, very classic
signal processing way

603
00:29:29,300 --> 00:29:32,040
to deal with our data,
and certainly seems

604
00:29:32,040 --> 00:29:33,650
as relevant here
as anywhere else.

605
00:29:39,260 --> 00:29:40,000
Let's see here.

606
00:29:47,920 --> 00:29:52,580
And again, so the power
law can be preserved

607
00:29:52,580 --> 00:29:54,400
as we talked about there.

608
00:29:54,400 --> 00:29:56,370
So moving on, another
phenomenon that's

609
00:29:56,370 --> 00:29:58,880
often documented
in the literature

610
00:29:58,880 --> 00:30:01,830
is called the densification.

611
00:30:01,830 --> 00:30:03,795
In fact, there's many
papers written on what

612
00:30:03,795 --> 00:30:05,510
is called densification.

613
00:30:05,510 --> 00:30:09,800
This is the observation that
if you construct a graph,

614
00:30:09,800 --> 00:30:12,520
and you compute the
ratio of the edges

615
00:30:12,520 --> 00:30:18,410
to vertices over time,
that ratio will go up.

616
00:30:18,410 --> 00:30:20,830
And there's a lot
of research talking

617
00:30:20,830 --> 00:30:23,050
about the physical
phenomenon that might

618
00:30:23,050 --> 00:30:25,300
produce that type of effect.

619
00:30:25,300 --> 00:30:28,040
And so while that physical
phenomenon might be there,

620
00:30:28,040 --> 00:30:31,460
it's also a byproduct of
just sampling the data.

621
00:30:31,460 --> 00:30:34,490
So for instance here,
what we're going to do

622
00:30:34,490 --> 00:30:37,570
is we created our
perfect power law graph

623
00:30:37,570 --> 00:30:39,300
and we're going to sample it.

624
00:30:39,300 --> 00:30:43,000
We're basically going to
take subsamples of that data.

625
00:30:43,000 --> 00:30:46,700
And we're going to do it
in little chunks, about 10%

626
00:30:46,700 --> 00:30:48,540
of the data at a time.

627
00:30:48,540 --> 00:30:52,130
And the triangles
and the circles

628
00:30:52,130 --> 00:30:55,760
show when we look at each
set of data independently.

629
00:30:55,760 --> 00:30:58,070
And then we have
these lines that

630
00:30:58,070 --> 00:31:01,170
show what happens when
we do it cumulatively.

631
00:31:01,170 --> 00:31:05,030
We basically take 10% of the
data, then 20% of the data,

632
00:31:05,030 --> 00:31:06,910
then 30% and move on here.

633
00:31:06,910 --> 00:31:10,070
And we have two different ways
of sampling our data here.

634
00:31:10,070 --> 00:31:12,150
Random is, I'm just
taking that whole matrix

635
00:31:12,150 --> 00:31:14,560
and I'm randomly
picking edges out of it.

636
00:31:14,560 --> 00:31:18,320
And what you see
is that each sample

637
00:31:18,320 --> 00:31:26,400
has a relatively low edge
to vertex distribution.

638
00:31:26,400 --> 00:31:30,960
But as you add more and more
and more up, it gets denser.

639
00:31:30,960 --> 00:31:34,340
And this is just simply
the fact that given

640
00:31:34,340 --> 00:31:39,610
a finite number of
vertices, you eventually,

641
00:31:39,610 --> 00:31:42,060
if you kept on adding
edges and edges and edges,

642
00:31:42,060 --> 00:31:45,027
eventually it would go--
this would become infinite.

643
00:31:45,027 --> 00:31:46,610
If you add an infinite
number of edges

644
00:31:46,610 --> 00:31:49,270
to a finite
[INAUDIBLE] vertices,

645
00:31:49,270 --> 00:31:52,100
then it will get denser and
denser and denser and denser.

646
00:31:52,100 --> 00:31:57,682
And this is sort of a byproduct
of treating these as 0 and 1's.

647
00:31:57,682 --> 00:31:59,440
And not recognizing.

648
00:31:59,440 --> 00:32:02,950
So this just naturally
occurs through sampling.

649
00:32:02,950 --> 00:32:05,370
The linear sampling here
is where, basically, I'm

650
00:32:05,370 --> 00:32:07,260
taking whole rows at a time.

651
00:32:07,260 --> 00:32:09,440
Or I could have taken
whole columns at a time.

652
00:32:09,440 --> 00:32:11,780
So I'm taking each row and
eventually, all right, I'm

653
00:32:11,780 --> 00:32:12,820
dropping them down.

654
00:32:12,820 --> 00:32:14,970
And there you say it's constant.

655
00:32:14,970 --> 00:32:16,900
So I'm basically for
each-- I'm essentially

656
00:32:16,900 --> 00:32:19,360
taking a whole vertex
and adding it at a time.

657
00:32:19,360 --> 00:32:22,640
And here the sampling is
somewhat independent of it.

658
00:32:22,640 --> 00:32:26,260
The densification-- if
you sample whole rows,

659
00:32:26,260 --> 00:32:29,592
the density of that will be
the same as if you did it.

660
00:32:29,592 --> 00:32:30,800
So this is just good to know.

661
00:32:30,800 --> 00:32:33,299
And good to know about
sampling and these phenomena

662
00:32:33,299 --> 00:32:34,840
can take place, and
then how sampling

663
00:32:34,840 --> 00:32:38,430
can play an important role
in the data that we observe.

664
00:32:41,770 --> 00:32:44,320
Another phenomenon that's
been studied extensively

665
00:32:44,320 --> 00:32:48,060
is what happens to the slope
of the degree distribution

666
00:32:48,060 --> 00:32:49,380
as you add data.

667
00:32:49,380 --> 00:32:52,540
And again, we do this exact
same type of sampling.

668
00:32:52,540 --> 00:32:57,330
And you see here that when we
just-- this data had a slope,

669
00:32:57,330 --> 00:32:59,980
I believe, of 1.3.

670
00:32:59,980 --> 00:33:03,540
And you can see if we
sample it randomly, just

671
00:33:03,540 --> 00:33:06,360
take random vertices, that
the slope starts out very,

672
00:33:06,360 --> 00:33:07,300
very high.

673
00:33:07,300 --> 00:33:09,530
And each sample stays high.

674
00:33:09,530 --> 00:33:11,710
But when we start
accumulating them,

675
00:33:11,710 --> 00:33:15,640
they start converging on
the true value-- converging

676
00:33:15,640 --> 00:33:17,150
from above.

677
00:33:17,150 --> 00:33:19,460
And likewise, when you
do linear sampling,

678
00:33:19,460 --> 00:33:23,020
you have a direction
in the opposite way.

679
00:33:23,020 --> 00:33:25,700
And they both end up converging
onto the true value here

680
00:33:25,700 --> 00:33:27,100
of 1.3.

681
00:33:27,100 --> 00:33:30,690
And so again, this just shows
that the slope of your degree

682
00:33:30,690 --> 00:33:34,320
distribution is also very much
a function of the sampling.

683
00:33:34,320 --> 00:33:37,240
It could also be a function
of the underlying phenomenon.

684
00:33:37,240 --> 00:33:39,170
But again, a cautionary
tale that one

685
00:33:39,170 --> 00:33:41,830
needs to be very aware
of how one's sampling.

686
00:33:41,830 --> 00:33:47,170
And again, these perfect
power law data sets

687
00:33:47,170 --> 00:33:49,370
are a very useful
tool for doing that.

688
00:33:49,370 --> 00:33:51,460
So if you have a real
data set and you're

689
00:33:51,460 --> 00:33:53,380
sampling in some
way, and you want

690
00:33:53,380 --> 00:33:57,250
to know what is maybe real
phenomenon versus what

691
00:33:57,250 --> 00:34:00,670
is sampling effects, if you go
and generate a perfect power

692
00:34:00,670 --> 00:34:02,980
law that's an approximation
of this data set,

693
00:34:02,980 --> 00:34:06,210
you can then very quickly
see which phenomena are just

694
00:34:06,210 --> 00:34:08,469
a result of sampling
a perfect power law,

695
00:34:08,469 --> 00:34:11,780
and which phenomena are maybe
indicative of some deeper

696
00:34:11,780 --> 00:34:14,760
underlying correlations
between the data.

697
00:34:14,760 --> 00:34:19,659
So again, a very
useful tool here.

698
00:34:19,659 --> 00:34:21,840
Moving on, we're going to
talk about subsampling.

699
00:34:21,840 --> 00:34:25,460
And one of the problems that we
have is very large data sets.

700
00:34:25,460 --> 00:34:29,370
And often we can't compute
the degree distribution

701
00:34:29,370 --> 00:34:30,449
on the entire data set.

702
00:34:30,449 --> 00:34:32,865
Or we can't-- and this has
sort of been a bread and butter

703
00:34:32,865 --> 00:34:36,860
of signal processing for years,
where if we want to compute

704
00:34:36,860 --> 00:34:40,320
a background model, that we
don't just simply sum up all

705
00:34:40,320 --> 00:34:40,820
the data.

706
00:34:40,820 --> 00:34:44,886
That we randomly select
data from the data

707
00:34:44,886 --> 00:34:46,969
set, and we use that as a
model of our background.

708
00:34:46,969 --> 00:34:48,940
And that's a much
more efficient way,

709
00:34:48,940 --> 00:34:51,580
from a computational and
data handling perspective,

710
00:34:51,580 --> 00:34:54,050
than simply computing
the mean or the variance

711
00:34:54,050 --> 00:34:57,970
based on the entire data set.

712
00:34:57,970 --> 00:35:01,570
So again, we need good
background estimation in order

713
00:35:01,570 --> 00:35:03,870
to do our anomaly detection.

714
00:35:03,870 --> 00:35:07,020
And again, it's prohibitive
to traverse the data.

715
00:35:07,020 --> 00:35:08,830
So the question is,
can we add accurately

716
00:35:08,830 --> 00:35:12,257
estimate the background
from a sample?

717
00:35:12,257 --> 00:35:13,340
So let's see what happens.

718
00:35:13,340 --> 00:35:14,570
We have a perfect power law.

719
00:35:14,570 --> 00:35:17,410
We can look at what happens
when we sample that.

720
00:35:17,410 --> 00:35:19,490
So we've generated a power law.

721
00:35:19,490 --> 00:35:19,990
OK.

722
00:35:19,990 --> 00:35:24,580
And this is not-- I've changed--
this may look like the degree

723
00:35:24,580 --> 00:35:26,950
distribution, but it's
actually a different plot.

724
00:35:26,950 --> 00:35:30,890
So this is showing every
single vertex in the data set.

725
00:35:30,890 --> 00:35:31,960
All right.

726
00:35:31,960 --> 00:35:35,060
And this shows the N
degree of that vertex.

727
00:35:35,060 --> 00:35:36,720
And we've sorted them.

728
00:35:36,720 --> 00:35:40,960
So the highest degree
vertex is over here

729
00:35:40,960 --> 00:35:44,970
and the lowest degree
vertex is over here.

730
00:35:44,970 --> 00:35:45,470
OK.

731
00:35:45,470 --> 00:35:48,135
So this is all the vertices.

732
00:35:50,740 --> 00:35:53,520
And so this is the true data.

733
00:35:53,520 --> 00:35:56,862
And this is what happens
when we take a 1/40 sample.

734
00:35:56,862 --> 00:35:59,070
I just say, I'm only going
to take 1/40 of the edges.

735
00:35:59,070 --> 00:36:01,380
What does it look like?

736
00:36:01,380 --> 00:36:04,517
Some relatively simple math
here, which I won't go over.

737
00:36:04,517 --> 00:36:05,475
But it's there for you.

738
00:36:05,475 --> 00:36:07,000
We can actually
come up a correction

739
00:36:07,000 --> 00:36:09,414
that allows us to
sample that data based

740
00:36:09,414 --> 00:36:10,580
on the median distributions.

741
00:36:10,580 --> 00:36:11,880
I apologize for the slides.

742
00:36:11,880 --> 00:36:17,140
We will correct them for the
web when we go out there.

743
00:36:17,140 --> 00:36:17,640
All right.

744
00:36:17,640 --> 00:36:19,067
So moving on.

745
00:36:19,067 --> 00:36:20,900
So one of the things
we talk about sampling,

746
00:36:20,900 --> 00:36:23,280
and we talk mainly about
single distributions.

747
00:36:23,280 --> 00:36:25,860
We can also talk about
joint distributions.

748
00:36:25,860 --> 00:36:29,730
So we can actually
use the degree

749
00:36:29,730 --> 00:36:33,620
as a way of labeling the
vertices and look at them.

750
00:36:33,620 --> 00:36:36,300
So it's a way of
compressing-- if we just say,

751
00:36:36,300 --> 00:36:39,180
we're going to label each
vertex by its degree,

752
00:36:39,180 --> 00:36:44,490
this is a way of compressing
many vertices into a smaller

753
00:36:44,490 --> 00:36:47,460
dimensional space
way to look at that.

754
00:36:47,460 --> 00:36:50,950
And we can then count
the correlations.

755
00:36:50,950 --> 00:36:54,960
We can look at the distribution
of how many edges are there

756
00:36:54,960 --> 00:36:59,300
from vertices of this degree
to vertices of that degree?

757
00:36:59,300 --> 00:37:01,920
And so it's a tool for
projecting our data

758
00:37:01,920 --> 00:37:03,640
and understanding
what's going on.

759
00:37:03,640 --> 00:37:05,370
And we can also then
re-bin that data

760
00:37:05,370 --> 00:37:10,170
with a power law, which will
make it more easily understood.

761
00:37:10,170 --> 00:37:20,070
So if we look here, we see that
we had the degree distribution.

762
00:37:20,070 --> 00:37:26,110
So this shows us for a
perfect power law data,

763
00:37:26,110 --> 00:37:32,890
for vertices with this degree
in and this degree out, how many

764
00:37:32,890 --> 00:37:34,320
edges there were between them.

765
00:37:34,320 --> 00:37:36,635
And as you see here,
obviously, there

766
00:37:36,635 --> 00:37:40,840
was a lot of vertices here
between low degree edges,

767
00:37:40,840 --> 00:37:43,240
and not so much here.

768
00:37:43,240 --> 00:37:46,470
But this is somewhat
a misnomer because

769
00:37:46,470 --> 00:37:47,810
of the way the data comes out.

770
00:37:47,810 --> 00:37:50,610
We're not really
binning it properly.

771
00:37:50,610 --> 00:37:55,080
However, if we go and fit a
perfect power law to this data,

772
00:37:55,080 --> 00:37:56,990
and then pick a new
set of bins based

773
00:37:56,990 --> 00:38:01,300
on those, so re-bin the
data, we can see here

774
00:38:01,300 --> 00:38:04,200
that we get a much smoother
uniform distribution.

775
00:38:04,200 --> 00:38:06,940
So while here we
may have thought

776
00:38:06,940 --> 00:38:12,460
that this was an artificially
low dense-- not dense region,

777
00:38:12,460 --> 00:38:14,030
this was artificially
high-- what

778
00:38:14,030 --> 00:38:16,060
you see here is
when you re-bin this

779
00:38:16,060 --> 00:38:20,500
that there's a very smooth
distribution from what

780
00:38:20,500 --> 00:38:23,060
we expect for our
perfect power law here.

781
00:38:23,060 --> 00:38:24,810
That this is a fairly
uniform distribution

782
00:38:24,810 --> 00:38:25,750
with respect to that.

783
00:38:25,750 --> 00:38:28,840
And so basically, it
essentially puts more bins

784
00:38:28,840 --> 00:38:31,290
where we actually
have data as opposed

785
00:38:31,290 --> 00:38:34,123
to wasting bins over here where
we don't really have any data.

786
00:38:39,690 --> 00:38:41,580
Using our perfect
power law model,

787
00:38:41,580 --> 00:38:43,320
we can actually
compute analytically

788
00:38:43,320 --> 00:38:44,920
what this should be.

789
00:38:44,920 --> 00:38:47,610
So here's an example of
what that looks like.

790
00:38:47,610 --> 00:38:49,870
And again, very
similar to what we see.

791
00:38:49,870 --> 00:38:55,160
And given the data, and
a model for the data,

792
00:38:55,160 --> 00:39:00,570
we can then compute the ratio
of the observed to the model

793
00:39:00,570 --> 00:39:06,400
to get a sense of
what data is unusual

794
00:39:06,400 --> 00:39:10,910
versus what we expect from a
perfect power-- or linear fit.

795
00:39:10,910 --> 00:39:12,540
And we see that
very clearly here.

796
00:39:12,540 --> 00:39:16,891
So this is the data just
dividing the data by the model

797
00:39:16,891 --> 00:39:17,390
here.

798
00:39:17,390 --> 00:39:19,960
And you see again,
you see all this data

799
00:39:19,960 --> 00:39:23,510
appears like this is the
ratio, the log of the ratio.

800
00:39:23,510 --> 00:39:26,460
And so it's a-- zero means
that it's essentially

801
00:39:26,460 --> 00:39:28,900
the ratio is 1, so all
of this is expected.

802
00:39:28,900 --> 00:39:31,150
And then we see all
these fluctuations here.

803
00:39:31,150 --> 00:39:33,060
Things that are higher
than we expected

804
00:39:33,060 --> 00:39:35,390
and things that are
lower than we expected.

805
00:39:35,390 --> 00:39:42,100
And so this is the classic-- the
time where you see this most is

806
00:39:42,100 --> 00:39:45,810
whenever they show you a map
of the United States by county

807
00:39:45,810 --> 00:39:51,440
of some-- maybe like it's--
a classic is like a cancer

808
00:39:51,440 --> 00:39:53,864
cluster, or heart
disease, or any phenomena.

809
00:39:53,864 --> 00:39:56,405
And what you see is that certain
counties in the western part

810
00:39:56,405 --> 00:39:58,790
of the United States
are extremely healthy,

811
00:39:58,790 --> 00:40:00,796
and certain counties
are just deadly.

812
00:40:00,796 --> 00:40:02,170
And it's just a
fact that they're

813
00:40:02,170 --> 00:40:04,169
very sparsely populated.

814
00:40:04,169 --> 00:40:06,460
And so they are dealing with
small numbers effect here.

815
00:40:06,460 --> 00:40:09,550
So basically, this is just
showing you oscillations

816
00:40:09,550 --> 00:40:11,790
between 0 and 1 here.

817
00:40:11,790 --> 00:40:14,730
So what we call
Poisson sampling.

818
00:40:14,730 --> 00:40:16,840
And so it makes
it very difficult

819
00:40:16,840 --> 00:40:20,580
to know what those are.

820
00:40:20,580 --> 00:40:24,880
However, if we re-bin the data
and then divide by the model,

821
00:40:24,880 --> 00:40:27,960
we see that the vast majority
of our data, as expected,

822
00:40:27,960 --> 00:40:34,352
is in this normal regime.

823
00:40:34,352 --> 00:40:35,810
Another thing we
can actually do is

824
00:40:35,810 --> 00:40:38,610
we can look at like the
most unexpected data set.

825
00:40:38,610 --> 00:40:41,510
So we can look at the
most typical data set.

826
00:40:41,510 --> 00:40:43,455
Oh, these got moved
here, didn't they.

827
00:40:43,455 --> 00:40:46,240
I don't know what's going
on with PowerPoint today.

828
00:40:46,240 --> 00:40:49,920
But this shows us the
surpluses, the deficits,

829
00:40:49,920 --> 00:40:51,400
and the most typical.

830
00:40:51,400 --> 00:40:58,260
So here's like the most extreme
bin, the most underrepresented

831
00:40:58,260 --> 00:41:00,500
bin, and sort of the
most average bin.

832
00:41:00,500 --> 00:41:02,780
And these are the areas
that they correspond to

833
00:41:02,780 --> 00:41:04,270
in the real data set.

834
00:41:04,270 --> 00:41:06,050
So you can use this
to find extremes

835
00:41:06,050 --> 00:41:10,440
based on a statistical test.

836
00:41:10,440 --> 00:41:15,520
And again, you see
that here again.

837
00:41:15,520 --> 00:41:18,600
We have these
different-- that shows

838
00:41:18,600 --> 00:41:22,730
what the measured over
expected is versus measured.

839
00:41:22,730 --> 00:41:25,107
And you can see, this
is the original data set

840
00:41:25,107 --> 00:41:26,440
and this is the re-bin data set.

841
00:41:26,440 --> 00:41:28,720
And you see that the
re-binning removes

842
00:41:28,720 --> 00:41:31,440
a lot of these very
sparse points over here

843
00:41:31,440 --> 00:41:34,100
and gives you very narrow
distribution around what you

844
00:41:34,100 --> 00:41:35,480
expected.

845
00:41:35,480 --> 00:41:38,920
You can actually go and find
selected edges if you want.

846
00:41:38,920 --> 00:41:43,210
So this just shows the
different types of edges.

847
00:41:43,210 --> 00:41:44,302
The maximum.

848
00:41:44,302 --> 00:41:46,510
If you actually wanted to
go and look at those edges,

849
00:41:46,510 --> 00:41:49,030
these would be maximum,
these would be the minimum,

850
00:41:49,030 --> 00:41:51,042
and these are the other types.

851
00:41:51,042 --> 00:41:52,000
So it's a useful thing.

852
00:41:52,000 --> 00:41:53,850
You can, again,
go, say all right,

853
00:41:53,850 --> 00:41:56,730
we found that if we have an
artificially high correlation

854
00:41:56,730 --> 00:42:00,530
between vertices with this
degree and this degree,

855
00:42:00,530 --> 00:42:02,260
we can then backtrack
and find out

856
00:42:02,260 --> 00:42:04,954
which specific vertices
there are, and see if that's

857
00:42:04,954 --> 00:42:06,120
anything that's interesting.

858
00:42:08,720 --> 00:42:13,120
We can also use this plot to
look at these questions of edge

859
00:42:13,120 --> 00:42:14,570
order.

860
00:42:14,570 --> 00:42:16,960
And so, hopefully,
this will work today.

861
00:42:16,960 --> 00:42:22,890
So here's is a-- if I
randomly select vertices

862
00:42:22,890 --> 00:42:26,867
and I compute their degree
versus-- so basically, what

863
00:42:26,867 --> 00:42:27,450
we did before.

864
00:42:27,450 --> 00:42:31,940
We subsample and then compute
over what is expected.

865
00:42:31,940 --> 00:42:33,520
And we play this.

866
00:42:33,520 --> 00:42:37,420
And you can just see here we get
this sort of-- when we randomly

867
00:42:37,420 --> 00:42:39,800
select the vertices,
we basically

868
00:42:39,800 --> 00:42:44,950
get kind of a
typical-- each sample

869
00:42:44,950 --> 00:42:46,810
looks very much the same.

870
00:42:46,810 --> 00:42:48,510
Again, up there in
these high degrees,

871
00:42:48,510 --> 00:42:50,610
we have this Poisson
sampling effect

872
00:42:50,610 --> 00:42:53,160
where we have some--
this is we have

873
00:42:53,160 --> 00:42:57,040
no vertices, so we get a 0,
which is lower than expected.

874
00:42:57,040 --> 00:42:58,620
And if we have
like two vertices,

875
00:42:58,620 --> 00:43:00,330
we get higher than expected.

876
00:43:00,330 --> 00:43:03,950
And so again, we still
have that Poisson effect.

877
00:43:03,950 --> 00:43:06,070
Interestingly, if we do
linear sampling, which

878
00:43:06,070 --> 00:43:08,340
we take whole rows
at the time, we

879
00:43:08,340 --> 00:43:12,090
get a very different
type of phenomenon.

880
00:43:12,090 --> 00:43:15,280
So you see you get
these-- whenever

881
00:43:15,280 --> 00:43:19,800
you do run into a high
degree row, by definition,

882
00:43:19,800 --> 00:43:22,290
it is unusual.

883
00:43:22,290 --> 00:43:24,555
Which again, means you
have to be very careful.

884
00:43:24,555 --> 00:43:26,430
You're going to run into
this high degree row

885
00:43:26,430 --> 00:43:28,690
eventually by sampling,
but you have to be careful.

886
00:43:28,690 --> 00:43:29,970
It's like, oh, my goodness.

887
00:43:29,970 --> 00:43:33,480
This is a very, very
unusual type of thing.

888
00:43:33,480 --> 00:43:38,111
So again, cautionary
tale about sampling.

889
00:43:38,111 --> 00:43:38,610
All right.

890
00:43:38,610 --> 00:43:40,443
So we've talked a lot
about the theory here.

891
00:43:40,443 --> 00:43:42,530
Let's get into some real data.

892
00:43:42,530 --> 00:43:44,680
So this is our
Reuter's data again.

893
00:43:44,680 --> 00:43:47,040
I showed it to
you the other day.

894
00:43:47,040 --> 00:43:49,250
The various document
distributions we had.

895
00:43:49,250 --> 00:43:54,790
In this case, 800,000 documents,
47,000 extracted entities,

896
00:43:54,790 --> 00:43:57,620
for a total of 6,000,000,
essentially, edges.

897
00:43:57,620 --> 00:44:00,860
So it's a bipartite graph
[INAUDIBLE] between documents.

898
00:44:00,860 --> 00:44:02,410
And four different types.

899
00:44:02,410 --> 00:44:04,290
And we can now look at
the different degree

900
00:44:04,290 --> 00:44:07,970
distributions of the different
classes and see what we have.

901
00:44:07,970 --> 00:44:10,830
So the first one we want to
look at are the locations.

902
00:44:10,830 --> 00:44:14,460
And so we look at the
distribution of the documents

903
00:44:14,460 --> 00:44:15,490
and the entities.

904
00:44:15,490 --> 00:44:20,510
So basically, imagine we took--
so we're very clear here.

905
00:44:20,510 --> 00:44:23,310
So what I'm doing is just
taking this part of the matrix,

906
00:44:23,310 --> 00:44:28,290
and the distribution of the
documents is summing this way.

907
00:44:28,290 --> 00:44:31,210
And the distribution locations
is summing up and down

908
00:44:31,210 --> 00:44:32,470
along the columns.

909
00:44:32,470 --> 00:44:33,910
So for each one
of these types, we

910
00:44:33,910 --> 00:44:35,170
can do those different types.

911
00:44:35,170 --> 00:44:37,810
We have, essentially, two
different degree distributions.

912
00:44:37,810 --> 00:44:39,540
One associated with
the documents and one

913
00:44:39,540 --> 00:44:41,330
associated with the entities.

914
00:44:41,330 --> 00:44:44,110
So this shows us our
document distribution.

915
00:44:44,110 --> 00:44:46,170
So we have the measured data.

916
00:44:46,170 --> 00:44:48,150
We have our fit.

917
00:44:48,150 --> 00:44:50,820
We have our model, and
then our re-binning.

918
00:44:50,820 --> 00:44:55,110
And you know, you could say that
this is approximately a power

919
00:44:55,110 --> 00:44:57,010
law, and that when
you re-bin it,

920
00:44:57,010 --> 00:44:59,120
that this sort of
S-shaped effect

921
00:44:59,120 --> 00:45:01,547
is still there, which probably
means it's really there.

922
00:45:01,547 --> 00:45:03,630
There's something going
on in the data that really

923
00:45:03,630 --> 00:45:05,530
is making this bowing effect.

924
00:45:05,530 --> 00:45:07,680
And so that's really there.

925
00:45:07,680 --> 00:45:11,797
Likewise here, we
have our model.

926
00:45:11,797 --> 00:45:12,880
We have the measured data.

927
00:45:12,880 --> 00:45:16,360
We have our model, alpha,
which is the blue line.

928
00:45:16,360 --> 00:45:17,420
We have the model fit.

929
00:45:17,420 --> 00:45:18,810
And then our re-binning.

930
00:45:18,810 --> 00:45:20,754
And again, you could
say the power law

931
00:45:20,754 --> 00:45:22,420
model's a pretty good
fit, but again, we

932
00:45:22,420 --> 00:45:24,160
have something going on here.

933
00:45:24,160 --> 00:45:27,297
Some kind of phenomenon
going on there.

934
00:45:27,297 --> 00:45:28,880
And we can then do
this for each type.

935
00:45:28,880 --> 00:45:31,160
We can look at
the organizations.

936
00:45:31,160 --> 00:45:33,230
And again, we don't have
as many organizations.

937
00:45:33,230 --> 00:45:35,780
Again, similar types
of phenomenon here.

938
00:45:35,780 --> 00:45:38,510
This is so sparse, it's
difficult to really say

939
00:45:38,510 --> 00:45:41,740
what's going on here.

940
00:45:41,740 --> 00:45:44,690
Of course, we do people,
which is, of course,

941
00:45:44,690 --> 00:45:46,719
people always are
the first thing

942
00:45:46,719 --> 00:45:48,760
that you talk about when
you talk about power law

943
00:45:48,760 --> 00:45:49,720
distributions.

944
00:45:49,720 --> 00:45:53,420
And again, very nicely, we have
our measured data and our fit.

945
00:45:53,420 --> 00:45:57,290
And again, we see this sort
of bent broom distribution

946
00:45:57,290 --> 00:46:00,900
here, bending, and then with
this fan out effect here.

947
00:46:00,900 --> 00:46:03,380
But then when we model
it and re-bin it,

948
00:46:03,380 --> 00:46:05,980
we get something that looks
much more like a true power law.

949
00:46:05,980 --> 00:46:08,330
And you can see that
very nicely here

950
00:46:08,330 --> 00:46:10,490
with the actual person data.

951
00:46:10,490 --> 00:46:12,510
A very good power law.

952
00:46:12,510 --> 00:46:16,370
So this tells us that regardless
of what this underlying data

953
00:46:16,370 --> 00:46:20,630
looks like, this probably really
is a power law distribution.

954
00:46:20,630 --> 00:46:22,030
And then we have the times.

955
00:46:22,030 --> 00:46:26,110
And again, a similar
type of thing here.

956
00:46:26,110 --> 00:46:27,480
Different types of things.

957
00:46:27,480 --> 00:46:30,830
And we actually have
a little spike here.

958
00:46:30,830 --> 00:46:33,772
The Reuter's data has
a certain collection

959
00:46:33,772 --> 00:46:35,980
of times associated with
the actual filing of events.

960
00:46:35,980 --> 00:46:39,570
There's only 35 of them, so
we do get a little spike here,

961
00:46:39,570 --> 00:46:41,460
which is actually
what we expect.

962
00:46:41,460 --> 00:46:43,650
Although you wouldn't
see that really clearly

963
00:46:43,650 --> 00:46:45,880
in the true data, but
when you re-bin it,

964
00:46:45,880 --> 00:46:48,490
this bump comes
out fairly clearly.

965
00:46:48,490 --> 00:46:51,450
So again, proper binning
extremely important.

966
00:46:51,450 --> 00:46:53,320
We can look at our
correlations as well.

967
00:46:53,320 --> 00:46:56,120
So let's just look at the
person-person correlations.

968
00:46:56,120 --> 00:46:58,880
And again, what we saw
here is sort of-- we

969
00:46:58,880 --> 00:47:00,440
see this is the raw data.

970
00:47:00,440 --> 00:47:02,584
So something that looks
very much like a power law.

971
00:47:02,584 --> 00:47:04,500
But when we go through
our re-binning process,

972
00:47:04,500 --> 00:47:09,200
we see kind of this witch's nose
effect here really in the data.

973
00:47:09,200 --> 00:47:12,090
So this would tell you to
first order it's a power law,

974
00:47:12,090 --> 00:47:16,790
but you actually have this
correlation taking place here,

975
00:47:16,790 --> 00:47:17,300
right here.

976
00:47:17,300 --> 00:47:23,260
So that's something that really
seems to be taking place.

977
00:47:23,260 --> 00:47:24,885
You see the same
thing when we do time.

978
00:47:29,568 --> 00:47:31,610
We can look at document.

979
00:47:31,610 --> 00:47:32,871
Let's now look at sampling.

980
00:47:32,871 --> 00:47:34,870
So again, this is the
same sampling that we did.

981
00:47:34,870 --> 00:47:36,540
We're going to
now basically look

982
00:47:36,540 --> 00:47:38,450
at the document densification.

983
00:47:38,450 --> 00:47:40,200
So this is selecting whole rows.

984
00:47:40,200 --> 00:47:43,310
We select a whole document,
so we're getting a whole row.

985
00:47:43,310 --> 00:47:46,760
And this just shows you the
four different types here.

986
00:47:46,760 --> 00:47:52,220
And you see that they
behave exactly as expected.

987
00:47:52,220 --> 00:47:54,520
Each individual
sample is reflective

988
00:47:54,520 --> 00:47:57,010
of the overall
densification, because you're

989
00:47:57,010 --> 00:47:58,570
taking whole rows.

990
00:47:58,570 --> 00:47:59,600
OK?

991
00:47:59,600 --> 00:48:01,370
Now let's take entities.

992
00:48:01,370 --> 00:48:03,430
So now we're cutting
across these.

993
00:48:03,430 --> 00:48:04,930
And now you see
something that looks

994
00:48:04,930 --> 00:48:06,250
much more like random sampling.

995
00:48:06,250 --> 00:48:09,360
When you randomly
select an entity,

996
00:48:09,360 --> 00:48:11,500
that's sort of a random
set of documents.

997
00:48:11,500 --> 00:48:13,990
And again, the
individual samples,

998
00:48:13,990 --> 00:48:15,810
but when you start
summing them up,

999
00:48:15,810 --> 00:48:18,900
you see that they get denser
and denser and denser.

1000
00:48:18,900 --> 00:48:21,120
So individual
document has a sort

1001
00:48:21,120 --> 00:48:23,720
of a good sample of the
overall distribution.

1002
00:48:23,720 --> 00:48:27,574
When you pick a person,
as you take a higher

1003
00:48:27,574 --> 00:48:28,990
fraction of the
data, you're going

1004
00:48:28,990 --> 00:48:32,310
to get more and more
and more with that.

1005
00:48:32,310 --> 00:48:35,350
So again, all consistent
with what we saw before.

1006
00:48:35,350 --> 00:48:37,970
A little noisier to
see here, but trust me,

1007
00:48:37,970 --> 00:48:43,650
the power law exponent also
behaves exactly as we expected.

1008
00:48:43,650 --> 00:48:45,800
Likewise, this is essentially
a linear sampling,

1009
00:48:45,800 --> 00:48:48,550
and here's the random
sampling behaving exactly

1010
00:48:48,550 --> 00:48:51,340
as we expected.

1011
00:48:51,340 --> 00:48:54,130
We can look at the
joint distributions.

1012
00:48:54,130 --> 00:48:58,790
So here's the actual-- this is
the location cross-correlation.

1013
00:48:58,790 --> 00:49:02,680
So they're showing you basically
the document versus entity

1014
00:49:02,680 --> 00:49:04,600
degree distributions.

1015
00:49:04,600 --> 00:49:05,600
So this is the measured.

1016
00:49:09,060 --> 00:49:10,560
And it's been re-binned.

1017
00:49:10,560 --> 00:49:12,480
This is measured
divided by the expected,

1018
00:49:12,480 --> 00:49:14,440
so the expected re-bins.

1019
00:49:14,440 --> 00:49:16,580
This is measured
divided by the model.

1020
00:49:16,580 --> 00:49:20,730
And here's the model, and
expected divided by the model.

1021
00:49:20,730 --> 00:49:23,440
And you can basically compare
all these different types

1022
00:49:23,440 --> 00:49:26,920
to create different
statistical tests.

1023
00:49:26,920 --> 00:49:31,350
And actually, we can find here
the measured re-binned divided

1024
00:49:31,350 --> 00:49:33,170
by the expected re-binned.

1025
00:49:33,170 --> 00:49:36,820
We can get our surpluses
and deficits and other types

1026
00:49:36,820 --> 00:49:38,264
of things in the actual data.

1027
00:49:38,264 --> 00:49:39,680
And so here you
see something that

1028
00:49:39,680 --> 00:49:42,770
maybe looks like an
artificially high grouping here,

1029
00:49:42,770 --> 00:49:44,535
some artificially low,
and some expected.

1030
00:49:44,535 --> 00:49:46,060
And you can then
use these as ways

1031
00:49:46,060 --> 00:49:49,127
to actually go and find
anomalous documents.

1032
00:49:49,127 --> 00:49:50,710
Or documents are
like-- you say, look,

1033
00:49:50,710 --> 00:49:52,335
this is the most
typical type of thing.

1034
00:49:52,335 --> 00:49:54,709
I mean, people are like, well,
why would you try and find

1035
00:49:54,709 --> 00:49:55,430
the most normal?

1036
00:49:55,430 --> 00:49:57,305
Well, a lot of times,
you want to be able to,

1037
00:49:57,305 --> 00:49:59,250
in terms of summarization,
be like here's

1038
00:49:59,250 --> 00:50:03,484
a very representative
set of documents.

1039
00:50:03,484 --> 00:50:05,025
They have the
statistical properties.

1040
00:50:05,025 --> 00:50:07,100
This is very consistent
with everything else.

1041
00:50:07,100 --> 00:50:09,725
Because people will [INAUDIBLE]
like what that mean to be that?

1042
00:50:09,725 --> 00:50:13,660
So again, a very useful
way to look at the data.

1043
00:50:13,660 --> 00:50:15,360
Do the same thing
with organization.

1044
00:50:15,360 --> 00:50:18,980
Again, basically get
the measured re-binned,

1045
00:50:18,980 --> 00:50:21,490
the measured divided
by the expected,

1046
00:50:21,490 --> 00:50:25,050
the expected re-binned,
the model, refit the model,

1047
00:50:25,050 --> 00:50:28,080
and the various different
ratios of all of them, which

1048
00:50:28,080 --> 00:50:31,340
you can do to find
various outliers and such.

1049
00:50:31,340 --> 00:50:32,590
Again, a very useful.

1050
00:50:32,590 --> 00:50:34,840
Interesting that it picked
the most representative one

1051
00:50:34,840 --> 00:50:35,930
way up here.

1052
00:50:35,930 --> 00:50:38,430
So that's an interesting--
so here's a very unusual

1053
00:50:38,430 --> 00:50:39,460
representative sample.

1054
00:50:44,560 --> 00:50:46,230
Persons, you can
do the same thing.

1055
00:50:46,230 --> 00:50:48,355
And I guess one of the
things I'm also trying to do

1056
00:50:48,355 --> 00:50:51,830
is that you see that these
all don't look the same.

1057
00:50:51,830 --> 00:50:52,510
Right?

1058
00:50:52,510 --> 00:50:55,600
Hopefully, from looking at
these, it's like this one,

1059
00:50:55,600 --> 00:50:58,030
the location
distribution is clearly

1060
00:50:58,030 --> 00:51:00,970
a little similar to the
organization distribution,

1061
00:51:00,970 --> 00:51:03,072
but the person distribution
looks very different.

1062
00:51:03,072 --> 00:51:05,030
And the time distribution
looks very different.

1063
00:51:05,030 --> 00:51:09,350
So you see that you have to be
careful about-- sometimes we'll

1064
00:51:09,350 --> 00:51:11,490
just take all these
different categories

1065
00:51:11,490 --> 00:51:13,950
and lump them together
in one big distribution.

1066
00:51:13,950 --> 00:51:16,310
And here's a situation,
like these are really

1067
00:51:16,310 --> 00:51:17,480
pretty different things.

1068
00:51:17,480 --> 00:51:18,855
And you really
want to treat them

1069
00:51:18,855 --> 00:51:23,290
as four distinct classes
of types of things.

1070
00:51:23,290 --> 00:51:25,700
So as an example
of selected edges,

1071
00:51:25,700 --> 00:51:28,010
this just shows
the most typical.

1072
00:51:28,010 --> 00:51:32,030
This just shows the various
entities that were selected,

1073
00:51:32,030 --> 00:51:33,790
and very representative
document.

1074
00:51:33,790 --> 00:51:36,910
And this is a very low
degree, various entities.

1075
00:51:36,910 --> 00:51:39,730
So this might show
you entities-- here's

1076
00:51:39,730 --> 00:51:41,690
the kind of surplus example.

1077
00:51:41,690 --> 00:51:44,910
And we could go in and
actually finding those edges.

1078
00:51:44,910 --> 00:51:46,660
Same with the person.

1079
00:51:46,660 --> 00:51:50,880
Very generic people,
higher degree.

1080
00:51:50,880 --> 00:51:51,880
Here's a very low.

1081
00:51:51,880 --> 00:51:54,620
So this person, Jeremy Smith,
is very unusual in terms

1082
00:51:54,620 --> 00:51:56,520
of what they connected with.

1083
00:51:56,520 --> 00:51:58,169
Surplus ones here.

1084
00:51:58,169 --> 00:51:59,960
Can't really read that
over there, but just

1085
00:51:59,960 --> 00:52:02,126
to give you examples of
things that you can actually

1086
00:52:02,126 --> 00:52:05,161
use this to go in and
actually find things.

1087
00:52:05,161 --> 00:52:05,660
All right.

1088
00:52:05,660 --> 00:52:08,210
So that brings again
to the lecture part.

1089
00:52:08,210 --> 00:52:11,080
So just, again, developing
this background model's

1090
00:52:11,080 --> 00:52:13,170
very important for the graphs.

1091
00:52:13,170 --> 00:52:15,015
Based on the perfect
power law, gives us

1092
00:52:15,015 --> 00:52:18,490
a very simple heuristic for
creating a linear model.

1093
00:52:18,490 --> 00:52:21,550
We can then really quantify
the effects of sampling,

1094
00:52:21,550 --> 00:52:23,320
which is very important.

1095
00:52:23,320 --> 00:52:25,000
Again, traditional
sampling approaches

1096
00:52:25,000 --> 00:52:28,820
can easily create
nonlinear phenomenon

1097
00:52:28,820 --> 00:52:31,890
that we have to be careful
of and be aware of.

1098
00:52:31,890 --> 00:52:33,700
It allows us to
develop techniques

1099
00:52:33,700 --> 00:52:36,550
for comparing real data
with power law fits

1100
00:52:36,550 --> 00:52:38,510
that we can then use
as statistical tests

1101
00:52:38,510 --> 00:52:40,300
for finding unusual
bits of data.

1102
00:52:40,300 --> 00:52:45,010
It's a very classic
detection theory here.

1103
00:52:45,010 --> 00:52:47,150
Come up with a model
for the background.

1104
00:52:47,150 --> 00:52:49,450
Create a linear fit
for that background.

1105
00:52:49,450 --> 00:52:51,660
Then use that model
to then quantify

1106
00:52:51,660 --> 00:52:54,880
the data to see which
things are unusual.

1107
00:52:54,880 --> 00:52:57,530
Again, just using
very classic detection

1108
00:52:57,530 --> 00:52:59,120
theory, the techniques.

1109
00:52:59,120 --> 00:53:01,520
Again, having a
background model being

1110
00:53:01,520 --> 00:53:03,400
the linchpin of doing that.

1111
00:53:03,400 --> 00:53:07,460
And I'm not saying that this--
this is very recent work.

1112
00:53:07,460 --> 00:53:11,580
This parallel model-- something
we did in the last year or so.

1113
00:53:11,580 --> 00:53:14,200
We just published
it this summer.

1114
00:53:14,200 --> 00:53:17,480
I can't guarantee that in
three or four years from now

1115
00:53:17,480 --> 00:53:20,000
that people will still
be using this model,

1116
00:53:20,000 --> 00:53:21,530
because it is very new.

1117
00:53:21,530 --> 00:53:23,670
But I think this
is representative

1118
00:53:23,670 --> 00:53:25,840
of the kinds of things that
will be in three or four

1119
00:53:25,840 --> 00:53:28,620
or five years, that
people-- it may not be this,

1120
00:53:28,620 --> 00:53:32,040
but something like this
to characterize our data.

1121
00:53:32,040 --> 00:53:33,700
And so I think it's
very useful there.

1122
00:53:33,700 --> 00:53:34,200
All right.

1123
00:53:34,200 --> 00:53:36,270
So with that, we will
take a short break

1124
00:53:36,270 --> 00:53:38,460
and then we will
show the example code

1125
00:53:38,460 --> 00:53:40,500
and talk about the assignment.

1126
00:53:40,500 --> 00:53:42,710
So very good.

1127
00:53:42,710 --> 00:53:44,600
Thank you very much.