1
00:00:00,090 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,820
Commons license.

3
00:00:03,820 --> 00:00:06,050
Your support will help
MIT OpenCourseWare

4
00:00:06,050 --> 00:00:10,160
continue to offer high quality
educational resources for free.

5
00:00:10,160 --> 00:00:12,690
To make a donation or to
view additional materials

6
00:00:12,690 --> 00:00:16,610
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,610 --> 00:00:17,261
at ocw.mit.edu.

8
00:00:21,085 --> 00:00:21,960
PROFESSOR: All right.

9
00:00:21,960 --> 00:00:28,120
So we're going to now
show some of the examples.

10
00:00:28,120 --> 00:00:34,230
And so again, they're in our
directory here, Examples.

11
00:00:34,230 --> 00:00:38,010
We're now in section 2, so we're
going to go to Applications.

12
00:00:38,010 --> 00:00:42,849
And we have this section
called Perfect Power Law.

13
00:00:42,849 --> 00:00:44,640
The actual example
codes are all these ones

14
00:00:44,640 --> 00:00:46,210
called PPL 1, 2, 3, and 4.

15
00:00:46,210 --> 00:00:49,179
We have actually a lot of
other functions in here,

16
00:00:49,179 --> 00:00:50,970
which are actually
really useful functions.

17
00:00:50,970 --> 00:00:53,480
They're the functions
that do all this stuff.

18
00:00:53,480 --> 00:00:56,560
And maybe they should actually
get folded into the main D4M

19
00:00:56,560 --> 00:00:57,130
distribution.

20
00:00:57,130 --> 00:00:59,230
They're here right now.

21
00:00:59,230 --> 00:01:01,510
They probably could be
cleaned up a little bit

22
00:01:01,510 --> 00:01:02,680
as part of the homework.

23
00:01:02,680 --> 00:01:05,330
But these are, again,
very useful functions

24
00:01:05,330 --> 00:01:09,510
for allowing you to generate
and fit these types of things.

25
00:01:09,510 --> 00:01:14,786
So I have already started
my MatLab session.

26
00:01:14,786 --> 00:01:15,570
There we are.

27
00:01:15,570 --> 00:01:18,050
So we will do the first one.

28
00:01:35,360 --> 00:01:38,463
Going along here.

29
00:01:38,463 --> 00:01:38,963
Figures.

30
00:02:10,370 --> 00:02:12,230
So if we go and look
at what we did here--

31
00:02:12,230 --> 00:02:18,110
so the first thing we did
is we set our parameters

32
00:02:18,110 --> 00:02:20,900
for our Perfect Power Law fit.

33
00:02:20,900 --> 00:02:23,420
So we set alpha at
1.3, a D max of 1,000,

34
00:02:23,420 --> 00:02:26,660
and then approximately 30 bins.

35
00:02:26,660 --> 00:02:28,340
We called our
function here, which

36
00:02:28,340 --> 00:02:31,150
is in this directory, which is
Power Law Distribution, which

37
00:02:31,150 --> 00:02:33,680
will create a power law
distribution from these three

38
00:02:33,680 --> 00:02:34,970
parameters.

39
00:02:34,970 --> 00:02:39,000
And then the first thing we did
is just plot a distribution.

40
00:02:39,000 --> 00:02:43,280
So we go here, then
we go to Figure 1.

41
00:02:43,280 --> 00:02:45,260
You can see there is
that distribution.

42
00:02:45,260 --> 00:02:48,810
So our Perfect Power
Law distribution

43
00:02:48,810 --> 00:02:52,170
with these parameters,
very nicely done.

44
00:02:55,130 --> 00:02:58,690
Then moving along,
we can compute.

45
00:02:58,690 --> 00:03:02,104
This is how you compute the
total number of vertices

46
00:03:02,104 --> 00:03:03,020
from the distribution.

47
00:03:03,020 --> 00:03:04,087
We just sum.

48
00:03:04,087 --> 00:03:05,670
It could be the total
number of edges.

49
00:03:05,670 --> 00:03:07,030
We can sum there.

50
00:03:07,030 --> 00:03:09,960
And then we see that
the total vertices

51
00:03:09,960 --> 00:03:18,960
was 18,187, 84,000 edges with
an edge to vertex ratio of 4.6.

52
00:03:18,960 --> 00:03:23,300
We now have a function called
edges from distribution.

53
00:03:23,300 --> 00:03:25,010
So if I pass in
the distribution--

54
00:03:25,010 --> 00:03:27,240
essentially the degrees
and the counts--

55
00:03:27,240 --> 00:03:29,816
it will then generate
a set of vertices

56
00:03:29,816 --> 00:03:31,565
that is consistent
with that distribution.

57
00:03:34,440 --> 00:03:38,820
And now I'm going to create
some permutations, basically,

58
00:03:38,820 --> 00:03:39,950
of these vertices.

59
00:03:39,950 --> 00:03:44,620
I'm going to create a--
randomly permute the edge

60
00:03:44,620 --> 00:03:46,670
order with these two sets here.

61
00:03:46,670 --> 00:03:49,210
So if I make a
random permutation

62
00:03:49,210 --> 00:03:53,960
of the number of edges, that
allows me to permute the edges.

63
00:03:53,960 --> 00:03:57,600
And I can also randomly permute
the vertex labels themselves,

64
00:03:57,600 --> 00:04:01,400
and then I can look at
these different permutations

65
00:04:01,400 --> 00:04:04,250
to see what kind of
grass they've created.

66
00:04:04,250 --> 00:04:08,930
So if I don't permute the
data, I just have a0 here.

67
00:04:08,930 --> 00:04:11,530
I just pass in the
vertices that were created.

68
00:04:11,530 --> 00:04:15,720
I create a non-permuted
adjacency matrix.

69
00:04:15,720 --> 00:04:17,670
I can then look at that.

70
00:04:21,269 --> 00:04:23,560
And that just shows you--
this is the adjacency matrix.

71
00:04:23,560 --> 00:04:24,320
This is in the plot.

72
00:04:24,320 --> 00:04:26,570
It just shows you that I
just created a vertex that's

73
00:04:26,570 --> 00:04:28,640
entirely self loops.

74
00:04:28,640 --> 00:04:30,557
Not very interesting,
but nevertheless,

75
00:04:30,557 --> 00:04:32,390
completely consistent
with the perfect power

76
00:04:32,390 --> 00:04:34,230
of law degree distribution.

77
00:04:34,230 --> 00:04:41,600
So likewise, if I just permute
the edges-- so basically,

78
00:04:41,600 --> 00:04:43,280
I take my permutation
here, and I just

79
00:04:43,280 --> 00:04:49,060
permute that-- this is just not
permuting the vertex labels,

80
00:04:49,060 --> 00:04:52,364
but just permuting the edges.

81
00:04:52,364 --> 00:04:55,620
To the next figure
here, 3, and we

82
00:04:55,620 --> 00:04:57,430
get this distribution--
basically

83
00:04:57,430 --> 00:04:58,770
start vertex, end vertex.

84
00:04:58,770 --> 00:05:00,470
Looks fairly random.

85
00:05:00,470 --> 00:05:04,190
However, you see the highest
degree vertices are up here

86
00:05:04,190 --> 00:05:08,120
at 0, 0, because that's the
order in which the generator

87
00:05:08,120 --> 00:05:08,870
spits them out.

88
00:05:08,870 --> 00:05:11,360
It gives you the highest
degree vertices first.

89
00:05:11,360 --> 00:05:14,050
And so although we've
reconnected them

90
00:05:14,050 --> 00:05:19,260
with different edges, you see
that their vertex labels still

91
00:05:19,260 --> 00:05:25,510
have an intrinsic order which is
corresponding to vertex degree.

92
00:05:25,510 --> 00:05:27,470
I should say this
generates this naturally.

93
00:05:27,470 --> 00:05:32,795
It's a very common way to--
if you have arbitrary vertices

94
00:05:32,795 --> 00:05:35,260
and you want to put them
on an adjacency matrix,

95
00:05:35,260 --> 00:05:37,290
one of the first things
to do is to reorder them

96
00:05:37,290 --> 00:05:38,580
according to degree.

97
00:05:38,580 --> 00:05:42,280
You'll always get some kind of
structure that emerges there,

98
00:05:42,280 --> 00:05:45,500
and it's often an easier way
to see what's going on there.

99
00:05:45,500 --> 00:05:47,800
And as you can see
from the previous data,

100
00:05:47,800 --> 00:05:49,770
you saw if you ordered
things by degree

101
00:05:49,770 --> 00:05:52,430
and you saw this type
of [INAUDIBLE] aha,

102
00:05:52,430 --> 00:05:54,830
that looks a lot like a
power law distribution

103
00:05:54,830 --> 00:05:57,690
just from the
scatter of the dots.

104
00:06:00,250 --> 00:06:01,200
So moving on here.

105
00:06:01,200 --> 00:06:04,450
So now we're just going to
permute the vertex labels.

106
00:06:04,450 --> 00:06:07,800
So we are not permuting the
edges, just the vertex labels.

107
00:06:12,030 --> 00:06:17,720
So if you see here-- and then
by permuting the vertex labels,

108
00:06:17,720 --> 00:06:19,800
it looks random,
but fairly sparse.

109
00:06:19,800 --> 00:06:22,160
That's because we still--
every single time,

110
00:06:22,160 --> 00:06:24,010
we've moved the
vertex labels around.

111
00:06:24,010 --> 00:06:27,160
But all the edges associated
with that vertex pair

112
00:06:27,160 --> 00:06:29,050
still will ship
around [INAUDIBLE].

113
00:06:29,050 --> 00:06:31,790
We haven't done anything
to break them up.

114
00:06:31,790 --> 00:06:33,470
Whereas before, we
broke up the edges,

115
00:06:33,470 --> 00:06:36,570
but we didn't change the labels.

116
00:06:36,570 --> 00:06:40,200
And so then the final one, which
is the one that we typically

117
00:06:40,200 --> 00:06:41,650
do when we want
to create random,

118
00:06:41,650 --> 00:06:45,540
is we permute both the
vertices and the edges

119
00:06:45,540 --> 00:06:52,240
to create a truly random
looking adjacency matrix.

120
00:06:52,240 --> 00:06:54,030
And there you see
you have something.

121
00:06:54,030 --> 00:06:56,690
And now what you
can also see here

122
00:06:56,690 --> 00:07:00,660
is these high degree
vertices do stand out.

123
00:07:00,660 --> 00:07:03,330
This is a very standard
power law distribution.

124
00:07:03,330 --> 00:07:07,320
You'll see these dense
rows and columns in there

125
00:07:07,320 --> 00:07:10,260
that are indicative of these
very high degree nodes.

126
00:07:10,260 --> 00:07:16,380
When we just permuted the
edges but not the vertices,

127
00:07:16,380 --> 00:07:19,160
these high degree
rows and columns

128
00:07:19,160 --> 00:07:23,550
all got shifted up into
the corner of the plot.

129
00:07:23,550 --> 00:07:28,280
So again, another way to begin
to look at the data in a way

130
00:07:28,280 --> 00:07:29,700
to recognize structure.

131
00:07:29,700 --> 00:07:31,412
And I think this
adjacency matrices,

132
00:07:31,412 --> 00:07:32,870
once you start
looking at them, you

133
00:07:32,870 --> 00:07:35,174
begin to get a comfort
zone, just as we

134
00:07:35,174 --> 00:07:36,340
do with other types of data.

135
00:07:36,340 --> 00:07:39,917
We begin to learn
what they look like.

136
00:07:39,917 --> 00:07:41,750
This is a very good way
to look at the data,

137
00:07:41,750 --> 00:07:43,410
because you can look
at a fair amount.

138
00:07:43,410 --> 00:07:47,240
If I were to actually plot
the graph of 18,000 vertices

139
00:07:47,240 --> 00:07:50,810
as a traditional graph of the
vertices and lines connecting

140
00:07:50,810 --> 00:07:52,170
them, it would just be blue.

141
00:07:52,170 --> 00:07:53,540
The entire screen would be blue.

142
00:07:53,540 --> 00:07:57,000
There would be no way
to properly position

143
00:07:57,000 --> 00:07:58,290
those vertices.

144
00:07:58,290 --> 00:08:00,560
Here, this is
84,000 data points,

145
00:08:00,560 --> 00:08:03,510
and I can still kind
of see it a little bit.

146
00:08:03,510 --> 00:08:05,080
This is the limit.

147
00:08:05,080 --> 00:08:07,710
100,000 edges is
the limit of what

148
00:08:07,710 --> 00:08:09,540
you can put on any
plot to really see it,

149
00:08:09,540 --> 00:08:11,480
unless there's some
hidden structure that

150
00:08:11,480 --> 00:08:15,260
allows you to really,
really move it all together.

151
00:08:15,260 --> 00:08:17,860
But the adjacency
matrix is a great way

152
00:08:17,860 --> 00:08:22,770
to look at fairly large graphs.

153
00:08:22,770 --> 00:08:24,260
So we generally do that.

154
00:08:24,260 --> 00:08:26,470
So this is the procedure.

155
00:08:26,470 --> 00:08:30,590
Create your perfect
parallel distribution,

156
00:08:30,590 --> 00:08:34,100
create some edges,
create some permutations,

157
00:08:34,100 --> 00:08:36,210
and then permute it, and
then you off you are.

158
00:08:36,210 --> 00:08:39,230
You've just created a randomized
perfect power law graph.

159
00:08:39,230 --> 00:08:42,990
Again, this is an example
of a code where you probably

160
00:08:42,990 --> 00:08:44,490
might even just
take this program

161
00:08:44,490 --> 00:08:47,450
and adjust it to
suit your needs.

162
00:08:47,450 --> 00:08:49,130
And then, I think, we did that.

163
00:08:49,130 --> 00:08:54,000
And then finally, we plotted the
degree distribution of a 3 just

164
00:08:54,000 --> 00:08:55,820
to show you that I'm not lying.

165
00:08:55,820 --> 00:08:58,080
And again, the triangle
is the original

166
00:08:58,080 --> 00:08:59,220
and the blue was the thing.

167
00:08:59,220 --> 00:09:01,740
And you see throughout
all those permutations,

168
00:09:01,740 --> 00:09:04,482
our degree distribution
remained the same.

169
00:09:04,482 --> 00:09:05,940
Even though they
look-- those would

170
00:09:05,940 --> 00:09:08,356
be completely different graphs,
their degree distributions

171
00:09:08,356 --> 00:09:09,640
are identical.

172
00:09:13,020 --> 00:09:14,570
So let's move on here.

173
00:09:29,460 --> 00:09:33,040
So this is just showing-- so
we actually rolled this up all

174
00:09:33,040 --> 00:09:35,180
into one function here for you.

175
00:09:35,180 --> 00:09:38,430
Ran power law matrix, if you
give it an alpha, a D max,

176
00:09:38,430 --> 00:09:41,800
and an ND, it will do
those three steps for you

177
00:09:41,800 --> 00:09:44,190
that I had in the previous
chart all in one thing

178
00:09:44,190 --> 00:09:48,170
and produce an adjacency
matrix that's a perfect power

179
00:09:48,170 --> 00:09:50,416
law based on these parameters.

180
00:09:50,416 --> 00:09:52,165
And again, you saw we
have the same number

181
00:09:52,165 --> 00:09:54,350
of vertices that we
saw and edges and ratio

182
00:09:54,350 --> 00:09:55,260
that we saw before.

183
00:09:55,260 --> 00:09:57,340
So that's all the same.

184
00:09:57,340 --> 00:10:01,020
Now we're going to transform
this data, clean up

185
00:10:01,020 --> 00:10:06,410
this data by making it
unweighted, undirected.

186
00:10:06,410 --> 00:10:07,910
We're going to
eliminate self loops.

187
00:10:07,910 --> 00:10:09,944
We're going take the
upper triangular part.

188
00:10:09,944 --> 00:10:11,610
Here's another one
that's unweighted, no

189
00:10:11,610 --> 00:10:14,120
self loops, different
versions of them.

190
00:10:14,120 --> 00:10:18,950
And then we do-- you can
see what those look like.

191
00:10:18,950 --> 00:10:21,950
So if we look at
the first one here,

192
00:10:21,950 --> 00:10:25,700
this just shows the
unweighted, what

193
00:10:25,700 --> 00:10:28,660
basically making the data
unweighted does to the data

194
00:10:28,660 --> 00:10:29,160
set.

195
00:10:29,160 --> 00:10:32,700
So the triangles are the
original data, and just making

196
00:10:32,700 --> 00:10:37,860
it unweighted, how it
distorts that data set.

197
00:10:42,690 --> 00:10:46,550
This shows you what happens
when you make it undirected.

198
00:10:46,550 --> 00:10:50,315
So unweighted means
we took any cases

199
00:10:50,315 --> 00:10:52,440
where we had vertices with
more than one connection

200
00:10:52,440 --> 00:10:54,495
to them, if something
had five connections, now

201
00:10:54,495 --> 00:10:56,430
it just gets one connection.

202
00:10:56,430 --> 00:10:59,130
And so that was a
fairly big distortion.

203
00:10:59,130 --> 00:11:03,100
A perfect example is if you
take a person's social network

204
00:11:03,100 --> 00:11:06,650
graph, and if you were
to make it unweighted,

205
00:11:06,650 --> 00:11:09,490
what you're saying is that
the connection you have

206
00:11:09,490 --> 00:11:11,600
with your spouse is
identical with someone

207
00:11:11,600 --> 00:11:14,910
that you emailed once or
that you friended once.

208
00:11:14,910 --> 00:11:16,410
And I think we all
agree that that's

209
00:11:16,410 --> 00:11:19,410
a fair amount of information
that's lost there.

210
00:11:19,410 --> 00:11:25,670
And so again, encouraging
folks to be aware of that

211
00:11:25,670 --> 00:11:28,630
and to be careful of
when they're doing it.

212
00:11:28,630 --> 00:11:33,310
Again, making it undirected
just means we basically--

213
00:11:33,310 --> 00:11:38,350
if I phone you a lot or I
cite you a lot in papers,

214
00:11:38,350 --> 00:11:42,070
That's the same as you
citing me a lot, which you

215
00:11:42,070 --> 00:11:44,500
lose some information there.

216
00:11:44,500 --> 00:11:47,312
Again, this shows the
kind of distortion

217
00:11:47,312 --> 00:11:48,895
that we get from
making it undirected.

218
00:11:53,380 --> 00:11:55,430
And again, this
shows what happens

219
00:11:55,430 --> 00:11:57,650
when you do no self loops.

220
00:11:57,650 --> 00:11:59,570
Well, there's not a
lot of self loops,

221
00:11:59,570 --> 00:12:02,980
so we only have affected
a few vertices here.

222
00:12:02,980 --> 00:12:08,430
So in this case, eliminating
self loops is not a terribly

223
00:12:08,430 --> 00:12:10,960
distorted-- doesn't really
distort the data very much

224
00:12:10,960 --> 00:12:12,430
at all.

225
00:12:12,430 --> 00:12:15,140
And then finally, this shows
the upper correlation matrix.

226
00:12:15,140 --> 00:12:16,890
So when we correlated
the two, basically

227
00:12:16,890 --> 00:12:19,370
multiplied the adjacency
matrix together,

228
00:12:19,370 --> 00:12:21,640
again showing what
we saw before.

229
00:12:24,840 --> 00:12:25,340
Moving on.

230
00:12:42,900 --> 00:12:48,950
You can see the plots that got
eliminated from my PowerPoint.

231
00:12:48,950 --> 00:12:50,740
MATLAB has defeated
PowerPoint's attempts

232
00:12:50,740 --> 00:12:52,640
to deny you your education.

233
00:12:52,640 --> 00:12:56,080
So again, what
we're doing here is

234
00:12:56,080 --> 00:13:00,200
we're creating a
perfect power law.

235
00:13:00,200 --> 00:13:01,170
This is a bigger one.

236
00:13:01,170 --> 00:13:02,940
I wan a lot of vertices.

237
00:13:02,940 --> 00:13:07,090
So this time, we had 50,000
vertices, 329,000 edges

238
00:13:07,090 --> 00:13:09,330
with a ratio of 6 and 1/2.

239
00:13:09,330 --> 00:13:11,240
We create our vertices.

240
00:13:11,240 --> 00:13:14,220
We randomize the edge
order, et cetera.

241
00:13:14,220 --> 00:13:17,690
Now we're going to randomly
pick a subsample of these.

242
00:13:17,690 --> 00:13:18,960
And what is F samp set at?

243
00:13:18,960 --> 00:13:20,290
It's 1/40.

244
00:13:20,290 --> 00:13:24,230
So I'm going to take
1/40 of all the vertices.

245
00:13:24,230 --> 00:13:26,440
Now I'm going to go and
compute that degree,

246
00:13:26,440 --> 00:13:29,660
and I'm going to basically
subsample all of these.

247
00:13:29,660 --> 00:13:32,520
And later, what you'll see-- so
let's just take a look at that.

248
00:13:37,450 --> 00:13:39,000
So this just shows
that chart here.

249
00:13:39,000 --> 00:13:41,460
So this is the original data.

250
00:13:41,460 --> 00:13:42,880
Again, this is the vertex.

251
00:13:42,880 --> 00:13:45,870
We're sorting the
vertices by degree here.

252
00:13:45,870 --> 00:13:48,120
So this is the
highest degree vertex.

253
00:13:48,120 --> 00:13:52,160
These are the lowest
degree vertex.

254
00:13:52,160 --> 00:13:54,550
And each vertex is
getting a dot here.

255
00:13:54,550 --> 00:13:56,140
So we have 50,000 vertices.

256
00:13:56,140 --> 00:14:00,570
They all get a dot here, and
we've only taken 1/40 of them.

257
00:14:00,570 --> 00:14:05,720
And this just shows you here--
If you only take 1/40 of them

258
00:14:05,720 --> 00:14:09,800
and then compute their
sample, this is what you get.

259
00:14:09,800 --> 00:14:13,929
Now, standard sampling
theory would say aha, well,

260
00:14:13,929 --> 00:14:15,220
I know how to correct for this.

261
00:14:15,220 --> 00:14:16,740
The way I correct
for this is I just

262
00:14:16,740 --> 00:14:19,880
multiply my sample data by 40.

263
00:14:19,880 --> 00:14:23,610
And we took 140, so that
means whenever I measure it,

264
00:14:23,610 --> 00:14:27,090
the true value should
be 40 times higher.

265
00:14:27,090 --> 00:14:29,310
So we can look at
that in figure 2.

266
00:14:29,310 --> 00:14:30,630
So this is the true sample.

267
00:14:35,540 --> 00:14:38,670
So again, we see for our
high-degree vertices here--

268
00:14:38,670 --> 00:14:41,300
this is the highest
degree vertex here again.

269
00:14:41,300 --> 00:14:42,800
So this is the
high-degree vertices.

270
00:14:42,800 --> 00:14:44,030
These are the
low-degree vertices.

271
00:14:44,030 --> 00:14:46,405
I don't know if I said that,
opposite when I mentioned it

272
00:14:46,405 --> 00:14:46,910
before.

273
00:14:46,910 --> 00:14:48,420
This is the highest
degree vertex.

274
00:14:48,420 --> 00:14:50,537
And you see that by
sampling the data,

275
00:14:50,537 --> 00:14:52,870
we're doing a very good job
on the high-degree vertices.

276
00:14:52,870 --> 00:14:54,510
We're sampling them just fine.

277
00:14:54,510 --> 00:15:00,200
And that's why statistics works.

278
00:15:00,200 --> 00:15:03,710
If something is really
not rare and you sample,

279
00:15:03,710 --> 00:15:05,530
you're going to get
a good estimate.

280
00:15:05,530 --> 00:15:09,130
However, for these low-degree
vertices over here,

281
00:15:09,130 --> 00:15:12,800
what you see-- by
multiplying by 40,

282
00:15:12,800 --> 00:15:15,110
we're significantly
over-estimating

283
00:15:15,110 --> 00:15:17,555
their probability.

284
00:15:17,555 --> 00:15:19,180
As I say, this is
the curve that proves

285
00:15:19,180 --> 00:15:21,610
that optimists and
pessimists are both correct.

286
00:15:24,680 --> 00:15:26,880
There are so many rare things.

287
00:15:26,880 --> 00:15:28,970
If the world is a
power law distribution,

288
00:15:28,970 --> 00:15:31,990
it means that there are so many
rare events in the world, some

289
00:15:31,990 --> 00:15:34,550
of them are going
to happen to you.

290
00:15:34,550 --> 00:15:38,865
So it means if you're an
optimist, go play the lottery.

291
00:15:38,865 --> 00:15:42,430
If you're a pessimist, it means
that lightning could hit you,

292
00:15:42,430 --> 00:15:45,550
and you better just stay inside.

293
00:15:45,550 --> 00:15:47,960
So there's just so
many rare things

294
00:15:47,960 --> 00:15:51,630
that some really rare
things are going to happen

295
00:15:51,630 --> 00:15:53,060
to you in your lifetime.

296
00:15:53,060 --> 00:15:55,060
Most likely, those rare
things are very mundane.

297
00:15:58,710 --> 00:15:59,980
But we can correct for this.

298
00:15:59,980 --> 00:16:05,190
And so we have basically a way
here of deriving calculations.

299
00:16:05,190 --> 00:16:09,300
So we compute the parameters
of our distribution,

300
00:16:09,300 --> 00:16:12,340
and through these two functions,
compute degree correction,

301
00:16:12,340 --> 00:16:13,890
and apply degree correction.

302
00:16:13,890 --> 00:16:17,730
We can actually go back
and say all right, given

303
00:16:17,730 --> 00:16:21,860
that we believe the data is
power law and we've sampled it,

304
00:16:21,860 --> 00:16:25,000
can we then come up with a
more uniform correction that

305
00:16:25,000 --> 00:16:28,020
basically gives us a better
estimate that works at both

306
00:16:28,020 --> 00:16:30,890
the high and the low end?

307
00:16:30,890 --> 00:16:34,560
And that's what you see here.

308
00:16:34,560 --> 00:16:36,300
So basically, we
haven't changed.

309
00:16:36,300 --> 00:16:38,170
The correction
hasn't changed here.

310
00:16:38,170 --> 00:16:41,366
But we've downgraded
these lower ones.

311
00:16:41,366 --> 00:16:42,740
And essentially,
what we're doing

312
00:16:42,740 --> 00:16:45,570
is instead of just using the
average as the statistic,

313
00:16:45,570 --> 00:16:46,860
we're using the median.

314
00:16:46,860 --> 00:16:51,720
So we're using, essentially,
a quantile-based correction

315
00:16:51,720 --> 00:16:55,121
here, a 50th percent
quantile-based based correction

316
00:16:55,121 --> 00:16:55,620
here.

317
00:16:55,620 --> 00:16:59,800
And that causes us to lower the
estimates of these vertices.

318
00:16:59,800 --> 00:17:01,810
And so it would be
a better estimate

319
00:17:01,810 --> 00:17:06,300
and allow you to do
the sampling of that.

320
00:17:06,300 --> 00:17:06,890
Very good.

321
00:17:11,960 --> 00:17:13,084
And our final demo.

322
00:17:38,050 --> 00:17:41,300
So now what we're doing
is power law fitting.

323
00:17:41,300 --> 00:17:43,260
And so we have the
routines for doing that.

324
00:17:43,260 --> 00:17:46,115
So again, here's
our distribution.

325
00:17:46,115 --> 00:17:48,020
It's a power law of 1.3.

326
00:17:48,020 --> 00:17:51,785
We've set D max to be 2,000
and about 60-ish bins.

327
00:17:51,785 --> 00:17:54,050
We create our
parallel distribution.

328
00:17:54,050 --> 00:17:57,230
It has 50,000 vertices,
329,000 edges.

329
00:17:57,230 --> 00:17:59,720
Ratio is the same as
the one we did before.

330
00:17:59,720 --> 00:18:02,920
We're going to make
it undirected--

331
00:18:02,920 --> 00:18:05,050
undirected and unweighted,
undirected, unweighted,

332
00:18:05,050 --> 00:18:08,910
no self loops, so standard
corrections that we do.

333
00:18:08,910 --> 00:18:11,510
We're going to compute
the degree distribution

334
00:18:11,510 --> 00:18:13,730
of that data and plot it.

335
00:18:13,730 --> 00:18:16,410
Or actually, get it there now.

336
00:18:16,410 --> 00:18:19,910
I'm going to then-- we have this
function called power law fit.

337
00:18:19,910 --> 00:18:24,521
So if I compute-- so I compute
the degree distribution.

338
00:18:24,521 --> 00:18:26,520
So we have this function
called out degree which

339
00:18:26,520 --> 00:18:28,270
gives us the distribution.

340
00:18:28,270 --> 00:18:33,480
And I can find, essentially, the
number of values with the one,

341
00:18:33,480 --> 00:18:36,660
and the one that-- our
maximum, so this is estimating

342
00:18:36,660 --> 00:18:38,160
our poor man's slope.

343
00:18:38,160 --> 00:18:39,490
So we're computing the slope.

344
00:18:39,490 --> 00:18:43,430
We're counting the total
number of edges here,

345
00:18:43,430 --> 00:18:45,980
and then we have this
function called power law fit.

346
00:18:45,980 --> 00:18:50,130
Basically, we can plug in
what the estimated alpha is,

347
00:18:50,130 --> 00:18:54,090
what the number of vertices
is, and the number of edges

348
00:18:54,090 --> 00:18:57,770
is to find our best
fit distribution.

349
00:18:57,770 --> 00:19:00,940
So this basically inverts
those formulas I showed you,

350
00:19:00,940 --> 00:19:03,130
which is given a
degree distribution

351
00:19:03,130 --> 00:19:06,880
that sums to a particular
number of vertices and sums

352
00:19:06,880 --> 00:19:09,360
to a particular number
of edges, can you

353
00:19:09,360 --> 00:19:13,746
give me a new D
max and a new ND,

354
00:19:13,746 --> 00:19:17,530
these parameters that don't
really have as much meaning,

355
00:19:17,530 --> 00:19:19,840
to do that?

356
00:19:19,840 --> 00:19:22,920
And so basically,
we use, essentially,

357
00:19:22,920 --> 00:19:26,600
a combination of three
different techniques here.

358
00:19:26,600 --> 00:19:31,410
Because this is so nonlinear,
and there's this-- basically,

359
00:19:31,410 --> 00:19:34,250
remember I talked about integer
bins and logarithmic bins?

360
00:19:34,250 --> 00:19:37,090
Well, if you look at that plot,
it showed there's that bending.

361
00:19:37,090 --> 00:19:42,540
It's a very nasty manifold,
the surface of this function.

362
00:19:42,540 --> 00:19:45,220
And it has a continuous
part and a discrete part.

363
00:19:45,220 --> 00:19:49,080
So what we do here is we do
essentially a sampled search

364
00:19:49,080 --> 00:19:51,960
where we randomly sample,
looking for a location.

365
00:19:51,960 --> 00:19:55,470
We do a heuristic search, which
is a simulated [INAUDIBLE]

366
00:19:55,470 --> 00:19:56,020
search.

367
00:19:56,020 --> 00:20:00,850
And we also use Broyden's
nonlinear-- essentially a

368
00:20:00,850 --> 00:20:05,420
variation of Newton's method to
all try and find the best set

369
00:20:05,420 --> 00:20:07,980
of parameters that
will fit this data.

370
00:20:07,980 --> 00:20:09,440
We rarely get an exact match.

371
00:20:09,440 --> 00:20:11,810
But you can see here it's
choosing different ones.

372
00:20:11,810 --> 00:20:13,854
And this gives you
how it's doing.

373
00:20:13,854 --> 00:20:15,270
From the sample
search, this shows

374
00:20:15,270 --> 00:20:17,860
you the number of vertices
and the number of edges

375
00:20:17,860 --> 00:20:19,030
it was able to achieve.

376
00:20:19,030 --> 00:20:23,190
The heuristic search
didn't do very well at all,

377
00:20:23,190 --> 00:20:27,520
and then the Broyden search
did a pretty good job,

378
00:20:27,520 --> 00:20:28,810
and it got us pretty well.

379
00:20:28,810 --> 00:20:34,360
So actually, it ended up
comparing all of these,

380
00:20:34,360 --> 00:20:40,010
and it ended up choosing the
sample search-- this one,

381
00:20:40,010 --> 00:20:40,910
this first one I did.

382
00:20:40,910 --> 00:20:42,660
It liked that best of all.

383
00:20:42,660 --> 00:20:45,540
So we'll look at that here.

384
00:20:48,490 --> 00:20:53,740
So Figure 2 was
the original data.

385
00:20:53,740 --> 00:20:57,210
Figure 1 shows you the
manifold of this space.

386
00:20:57,210 --> 00:20:59,990
And so plotted in
this coordinate system

387
00:20:59,990 --> 00:21:02,820
of n versus m, the dot
is the dot that it found.

388
00:21:02,820 --> 00:21:06,150
It's actually here
in this [INAUDIBLE]

389
00:21:06,150 --> 00:21:07,820
these lines show the boundaries.

390
00:21:07,820 --> 00:21:12,120
But you can see this very
nonlinear manifold here.

391
00:21:12,120 --> 00:21:14,080
This is the continuous regime.

392
00:21:14,080 --> 00:21:15,560
This is the integer regime.

393
00:21:15,560 --> 00:21:18,240
It cusps right at
the transition.

394
00:21:18,240 --> 00:21:21,465
Again, a very nasty
function to try and invert,

395
00:21:21,465 --> 00:21:23,590
which is why we used all
those different techniques

396
00:21:23,590 --> 00:21:24,240
to invert it.

397
00:21:29,640 --> 00:21:32,290
And then Figure 3
shows the results.

398
00:21:32,290 --> 00:21:37,400
So this black line shows
the original model input

399
00:21:37,400 --> 00:21:38,420
that we provided.

400
00:21:38,420 --> 00:21:41,360
So that was the true model.

401
00:21:41,360 --> 00:21:44,300
The circle shows the data
after it was transformed.

402
00:21:44,300 --> 00:21:47,465
We made it undirected,
unweighted with no self loops.

403
00:21:47,465 --> 00:21:48,940
That's this.

404
00:21:48,940 --> 00:21:51,970
Alpha is-- this is
our poor man's alpha.

405
00:21:51,970 --> 00:21:53,960
What you can see
is almost identical

406
00:21:53,960 --> 00:21:55,660
to the original model.

407
00:21:55,660 --> 00:21:58,740
So the poor man's alpha
does a very good job

408
00:21:58,740 --> 00:22:01,350
of fitting in this case.

409
00:22:01,350 --> 00:22:03,250
The triangle shows
the model fit.

410
00:22:03,250 --> 00:22:07,330
So when we fit the data, we came
up with that best fit it shows,

411
00:22:07,330 --> 00:22:09,420
and we then created
a new distribution.

412
00:22:09,420 --> 00:22:11,000
This is what it looked like.

413
00:22:11,000 --> 00:22:15,400
And then the plus sign
shows us rebinning that data

414
00:22:15,400 --> 00:22:17,610
onto the bins from the model.

415
00:22:17,610 --> 00:22:20,910
And you see that we've done
a very nice job of recovering

416
00:22:20,910 --> 00:22:23,690
the original power
law distribution even

417
00:22:23,690 --> 00:22:25,560
after we did that distortion.

418
00:22:25,560 --> 00:22:28,830
So again, that is
the last example.

419
00:22:28,830 --> 00:22:30,360
So I want thank you.

420
00:22:30,360 --> 00:22:32,990
And then for the
homework, a lot of you

421
00:22:32,990 --> 00:22:36,020
did Homework 2, which is great.

422
00:22:36,020 --> 00:22:39,720
I think Homework 3
wasn't such a great hit.

423
00:22:39,720 --> 00:22:42,490
I'm going to definitely
rethink that one.

424
00:22:42,490 --> 00:22:44,410
But the next homework
does not require

425
00:22:44,410 --> 00:22:46,130
you to have done homework 3.

426
00:22:46,130 --> 00:22:48,840
If you did homework
2, basically,

427
00:22:48,840 --> 00:22:52,940
it's just saying compute
a degree distribution.

428
00:22:52,940 --> 00:22:55,430
And in fact, you don't even
need to have done Homework 2.

429
00:22:55,430 --> 00:22:57,460
You can compute a
degree distribution

430
00:22:57,460 --> 00:23:00,720
on your Homework 2, or you can
compute a degree distribution

431
00:23:00,720 --> 00:23:01,522
on any data set.

432
00:23:01,522 --> 00:23:02,855
For example, today is Halloween.

433
00:23:02,855 --> 00:23:07,450
If you go trick or treating
with somebody-- yourself,

434
00:23:07,450 --> 00:23:10,360
your children, somebody
else's children,

435
00:23:10,360 --> 00:23:17,110
you can maybe email me the
histogram of your candy

436
00:23:17,110 --> 00:23:19,365
and plot it on a
degree distribution,

437
00:23:19,365 --> 00:23:22,900
and maybe compute the poor man's
alpha coefficient from that.

438
00:23:22,900 --> 00:23:26,210
And the other coefficients
from that would be a fun

439
00:23:26,210 --> 00:23:27,070
exercise to do.

440
00:23:27,070 --> 00:23:28,710
So that'll be the next homework.

441
00:23:28,710 --> 00:23:31,440
I'll email that out in
the next couple of days.

442
00:23:31,440 --> 00:23:33,210
So look forward.

443
00:23:33,210 --> 00:23:35,000
Again, if you send
me the homework

444
00:23:35,000 --> 00:23:43,010
prior to next-- actually, just
a reminder, no class next week.

445
00:23:43,010 --> 00:23:44,980
This room has been taken.

446
00:23:44,980 --> 00:23:48,010
But again, if you email me the
homework prior to this time

447
00:23:48,010 --> 00:23:50,964
next week, I will give
you feedback on it.

448
00:23:50,964 --> 00:23:52,880
You can still send me
the homework after that.

449
00:23:52,880 --> 00:23:54,930
I just won't give you
any feedback on it.

450
00:23:54,930 --> 00:23:58,210
So thank you again,
and look forward

451
00:23:58,210 --> 00:24:00,550
to seeing you in two weeks.