WEBVTT

00:00:04.520 --> 00:00:07.170
In this video, we'll
discuss the method

00:00:07.170 --> 00:00:09.770
of hierarchical clustering.

00:00:09.770 --> 00:00:12.490
In hierarchical
clustering, the clusters

00:00:12.490 --> 00:00:17.050
are formed by each data point
starting in its own cluster.

00:00:17.050 --> 00:00:21.100
As a small example, suppose
we have five data points.

00:00:21.100 --> 00:00:25.220
Each data point is labeled as
belonging in its own cluster.

00:00:25.220 --> 00:00:28.340
So this data point's in
the red cluster, this one's

00:00:28.340 --> 00:00:32.070
in the blue cluster, this
one's in the purple cluster,

00:00:32.070 --> 00:00:34.640
this one's in the green
cluster, and this one's

00:00:34.640 --> 00:00:36.840
in the yellow cluster.

00:00:36.840 --> 00:00:40.520
Then hierarchical clustering
combines the two nearest

00:00:40.520 --> 00:00:42.870
clusters into one cluster.

00:00:42.870 --> 00:00:45.620
We'll use Euclidean
and Centroid distances

00:00:45.620 --> 00:00:48.840
to decide which two
clusters are the closest.

00:00:48.840 --> 00:00:51.570
In our example, the
green and yellow clusters

00:00:51.570 --> 00:00:53.590
are closest together.

00:00:53.590 --> 00:00:57.690
So we would combine these two
clusters into one cluster.

00:00:57.690 --> 00:01:00.170
So now the green
cluster has two points,

00:01:00.170 --> 00:01:02.710
and the yellow cluster is gone.

00:01:02.710 --> 00:01:04.660
Now this process repeats.

00:01:04.660 --> 00:01:07.310
We again find the
two nearest clusters,

00:01:07.310 --> 00:01:11.160
which this time are the green
cluster and the purple cluster,

00:01:11.160 --> 00:01:13.870
and we combine them
into one cluster.

00:01:13.870 --> 00:01:16.050
Now the green cluster
has three points,

00:01:16.050 --> 00:01:18.820
and the purple cluster is gone.

00:01:18.820 --> 00:01:23.130
Now the two nearest clusters
are the red and blue clusters.

00:01:23.130 --> 00:01:25.360
So we would combine
these two clusters

00:01:25.360 --> 00:01:28.400
into one cluster,
the red cluster.

00:01:28.400 --> 00:01:31.380
So now we have just two
clusters, the red one

00:01:31.380 --> 00:01:33.160
and the green one.

00:01:33.160 --> 00:01:36.720
So now the final step is to
combine these two clusters

00:01:36.720 --> 00:01:38.870
into one cluster.

00:01:38.870 --> 00:01:41.580
So at the end of
hierarchical clustering,

00:01:41.580 --> 00:01:45.780
all of our data points
are in a single cluster.

00:01:45.780 --> 00:01:48.060
The hierarchical
cluster process can

00:01:48.060 --> 00:01:51.259
be displayed through
what's called a dendrogram.

00:01:51.259 --> 00:01:54.180
The data points are
listed along the bottom,

00:01:54.180 --> 00:01:57.470
and the lines show how the
clusters were combined.

00:01:57.470 --> 00:01:59.880
The height of the lines
represents how far

00:01:59.880 --> 00:02:03.280
apart the clusters were
when they were combined.

00:02:03.280 --> 00:02:06.180
So points 1 and 4
were pretty close

00:02:06.180 --> 00:02:08.259
together when they
were combined.

00:02:08.259 --> 00:02:11.270
But when we combined the
two clusters at the end,

00:02:11.270 --> 00:02:14.620
they were significantly
farther apart.

00:02:14.620 --> 00:02:18.040
We can use a dendrogram to
decide how many clusters we

00:02:18.040 --> 00:02:21.480
want for our final
clustering model.

00:02:21.480 --> 00:02:24.120
This dendrogram shows
the clustering process

00:02:24.120 --> 00:02:26.230
with ten data points.

00:02:26.230 --> 00:02:29.720
The easiest way to pick the
number of clusters you want

00:02:29.720 --> 00:02:33.930
is to draw a horizontal
line across the dendrogram.

00:02:33.930 --> 00:02:36.590
The number of vertical
lines that line crosses

00:02:36.590 --> 00:02:39.700
is the number of
clusters there will be.

00:02:39.700 --> 00:02:43.780
In this case, our line
crosses two vertical lines,

00:02:43.780 --> 00:02:48.400
meaning that we will have
two clusters-- one cluster

00:02:48.400 --> 00:02:52.720
with points 5, 2, and 7, and
one cluster with the remaining

00:02:52.720 --> 00:02:54.600
points.

00:02:54.600 --> 00:02:56.510
The farthest this
horizontal line

00:02:56.510 --> 00:02:59.160
can move up and down
in the dendrogram

00:02:59.160 --> 00:03:01.700
without hitting one of
the horizontal lines

00:03:01.700 --> 00:03:04.420
of the dendrogram,
the better that choice

00:03:04.420 --> 00:03:07.070
of the number of clusters is.

00:03:07.070 --> 00:03:11.200
If we instead selected
three clusters,

00:03:11.200 --> 00:03:13.950
this line can't move
as far up and down

00:03:13.950 --> 00:03:17.770
without hitting horizontal
lines in the dendrogram.

00:03:17.770 --> 00:03:22.060
This probably means that the
two cluster choice is better.

00:03:22.060 --> 00:03:23.970
But when picking the
number of clusters,

00:03:23.970 --> 00:03:26.700
you should also consider
how many clusters

00:03:26.700 --> 00:03:29.190
make sense for the
particular application

00:03:29.190 --> 00:03:32.000
you're working with.

00:03:32.000 --> 00:03:34.770
After selecting the number
of clusters you want,

00:03:34.770 --> 00:03:38.340
you should analyze your clusters
to see if they're meaningful.

00:03:38.340 --> 00:03:41.240
This can be done by
looking at basic statistics

00:03:41.240 --> 00:03:45.600
in each cluster, like the mean,
maximum, and minimum values

00:03:45.600 --> 00:03:48.730
in each cluster
and each variable.

00:03:48.730 --> 00:03:51.390
You can also check to
see if the clusters have

00:03:51.390 --> 00:03:55.150
a feature in common that was
not used in the clustering,

00:03:55.150 --> 00:03:57.210
like an outcome variable.

00:03:57.210 --> 00:03:59.520
This often indicates
that your clusters

00:03:59.520 --> 00:04:02.580
might help improve
a predictive model.

00:04:02.580 --> 00:04:06.610
In the next video, we'll
cluster our movies by genre,

00:04:06.610 --> 00:04:08.730
and then analyze
our clusters to see

00:04:08.730 --> 00:04:12.550
how they can be used to
perform content filtering.