WEBVTT

00:00:04.500 --> 00:00:07.540
In our previous video, we
found the distance matrix,

00:00:07.540 --> 00:00:10.190
which computes the pairwise
distances between all

00:00:10.190 --> 00:00:13.040
the intensity values
in the flower vector.

00:00:13.040 --> 00:00:15.430
Now we can cluster
the intensity values

00:00:15.430 --> 00:00:17.920
using hierarchical clustering.

00:00:17.920 --> 00:00:21.880
So we're going to type
"cluster intensity."

00:00:21.880 --> 00:00:24.540
And then we're going to use
the hclust function, which

00:00:24.540 --> 00:00:27.500
is the hierarchical clustering
function in R, which

00:00:27.500 --> 00:00:30.080
takes as an input
the distance matrix.

00:00:30.080 --> 00:00:32.810
And then we're going to
specify the clustering method

00:00:32.810 --> 00:00:35.450
to be "word."

00:00:35.450 --> 00:00:37.640
As a reminder,
the "words" method

00:00:37.640 --> 00:00:39.480
is a minimum variants
method, which

00:00:39.480 --> 00:00:42.650
tries to find compact
and spherical clusters.

00:00:42.650 --> 00:00:45.500
We can think about it as
trying to minimize the variance

00:00:45.500 --> 00:00:49.250
within each cluster and the
distance among clusters.

00:00:49.250 --> 00:00:51.000
Now we can plot the
cluster dendrogram.

00:00:51.000 --> 00:00:52.170
So-- plot(clusterIntensity).

00:00:58.060 --> 00:01:01.240
And now we obtain the
cluster dendrogram.

00:01:01.240 --> 00:01:04.540
Let's have here a little
aside or a quick reminder

00:01:04.540 --> 00:01:09.120
about how to read a dendrogram
and make sense of it.

00:01:09.120 --> 00:01:12.620
Let us first consider this
toy dendrogram example.

00:01:12.620 --> 00:01:14.390
The lowest row of
nodes represent

00:01:14.390 --> 00:01:17.250
the data or the
individual observations,

00:01:17.250 --> 00:01:20.170
and the remaining nodes
represent the clusters.

00:01:20.170 --> 00:01:22.270
The vertical lines
depict the distance

00:01:22.270 --> 00:01:24.720
between two nodes or clusters.

00:01:24.720 --> 00:01:28.120
The taller the line, the more
dissimilar the clusters are.

00:01:28.120 --> 00:01:33.479
For instance, cluster D-E-F
is closer to cluster B-C-D-E-F

00:01:33.479 --> 00:01:35.600
than cluster B-C is.

00:01:35.600 --> 00:01:38.720
And this is well depicted by the
height of the lines connecting

00:01:38.720 --> 00:01:43.720
each of clusters B-C and
D-E-F to their parent node.

00:01:43.720 --> 00:01:46.759
Now cutting the dendrogram
at a given level

00:01:46.759 --> 00:01:49.160
yields a certain
partitioning of the data.

00:01:49.160 --> 00:01:53.110
For instance, if we cut the tree
between levels two and three,

00:01:53.110 --> 00:01:58.590
we obtain four clusters,
A, B-C, D-E, and F.

00:01:58.590 --> 00:02:02.120
If we cut the dendrogram
between levels three and four,

00:02:02.120 --> 00:02:07.690
then we obtain three
clusters, A, B-C, and D-E-F.

00:02:07.690 --> 00:02:10.580
And if we were to cut the
dendrogram between levels four

00:02:10.580 --> 00:02:16.800
and five, then we obtain two
clusters, A and B-C-D-E-F.

00:02:16.800 --> 00:02:20.120
What to choose, two,
three, or four clusters?

00:02:20.120 --> 00:02:23.670
Well, the smaller the number
of clusters, the coarser

00:02:23.670 --> 00:02:25.230
the clustering is.

00:02:25.230 --> 00:02:27.850
But at the same time,
having many clusters

00:02:27.850 --> 00:02:30.020
may be too much of a stretch.

00:02:30.020 --> 00:02:33.410
We should always have
this trade-off in mind.

00:02:33.410 --> 00:02:35.750
Now the distance
information between clusters

00:02:35.750 --> 00:02:38.970
can guide our choice of
the number of clusters.

00:02:38.970 --> 00:02:42.300
A good partition belongs to a
cut that has a good enough room

00:02:42.300 --> 00:02:43.890
to move up and down.

00:02:43.890 --> 00:02:47.230
For instance, the cut between
levels two and three can go up

00:02:47.230 --> 00:02:51.280
until it reaches cluster D-E-F.
The cut between levels three

00:02:51.280 --> 00:02:54.310
and four has more room to move
until it reaches the cluster

00:02:54.310 --> 00:02:58.590
B-C-D-E-F. And the cut between
levels four and five has

00:02:58.590 --> 00:03:00.080
the least room.

00:03:00.080 --> 00:03:02.720
So it seems like
choosing three clusters

00:03:02.720 --> 00:03:06.040
is reasonable in this case.

00:03:06.040 --> 00:03:08.280
Going back to our
dendrogram, it seems

00:03:08.280 --> 00:03:11.580
that having two clusters
or three clusters

00:03:11.580 --> 00:03:13.390
is reasonable in our case.

00:03:13.390 --> 00:03:15.360
We can actually
visualize the cuts

00:03:15.360 --> 00:03:18.770
by plotting rectangles around
the clusters on this tree.

00:03:18.770 --> 00:03:23.560
To do so, we can use the
rect.hclust function,

00:03:23.560 --> 00:03:26.190
which takes as an input
clusterIntensiy, which

00:03:26.190 --> 00:03:27.540
is our tree.

00:03:27.540 --> 00:03:30.140
And then we can specify
the number of clusters

00:03:30.140 --> 00:03:30.760
that we want.

00:03:30.760 --> 00:03:33.260
So let's set k=3.

00:03:33.260 --> 00:03:35.660
And we can color the
borders of the rectangles.

00:03:35.660 --> 00:03:39.420
And let's color them,
for instance, in red.

00:03:39.420 --> 00:03:41.810
Now going back to
our dendrogram,

00:03:41.810 --> 00:03:44.010
now we can see
the three clusters

00:03:44.010 --> 00:03:46.870
in these red rectangles.

00:03:46.870 --> 00:03:50.400
Now let us split the data
into these three clusters.

00:03:50.400 --> 00:03:51.940
We're going to
call our clusters,

00:03:51.940 --> 00:03:55.650
for instance, flowerClusters.

00:03:55.650 --> 00:03:59.410
And then we're going to
use the function cut tree.

00:03:59.410 --> 00:04:02.870
And literally, this
function cuts the dendrogram

00:04:02.870 --> 00:04:05.420
into however many
clusters we want.

00:04:05.420 --> 00:04:06.920
The input would be
clusterIntensity.

00:04:09.500 --> 00:04:13.610
And then we have to specify k=3,
because we would like to have

00:04:13.610 --> 00:04:16.670
three clusters.

00:04:16.670 --> 00:04:18.769
Now let us output
the flower clusters

00:04:18.769 --> 00:04:20.769
variable to see how it looks.

00:04:20.769 --> 00:04:23.760
So flowerClusters.

00:04:23.760 --> 00:04:26.860
And we see that the
flower cluster is actually

00:04:26.860 --> 00:04:30.330
a vector that assigns each
intensity value in the flower

00:04:30.330 --> 00:04:31.830
vector to a cluster.

00:04:31.830 --> 00:04:35.240
It actually has the same
length, which is 2,005,

00:04:35.240 --> 00:04:38.820
and has values 1,
2, and 3, which

00:04:38.820 --> 00:04:41.030
correspond to each cluster.

00:04:41.030 --> 00:04:44.470
To find the mean intensity
value of each of our clusters,

00:04:44.470 --> 00:04:48.380
we can use the tapply
function and ask R to group

00:04:48.380 --> 00:04:52.630
the values in the flower
vector according to the flower

00:04:52.630 --> 00:04:55.770
clusters, and then
apply the mean statistic

00:04:55.770 --> 00:04:57.650
to each of the groups.

00:04:57.650 --> 00:05:00.540
What we obtain is that the first
cluster has a mean intensity

00:05:00.540 --> 00:05:04.170
value of 0.08, which
is closest to zero,

00:05:04.170 --> 00:05:05.620
and this means
that it corresponds

00:05:05.620 --> 00:05:07.850
to the darkest
shape in our image.

00:05:07.850 --> 00:05:11.210
And then the third cluster
here, which is closest to 1,

00:05:11.210 --> 00:05:13.940
corresponds to
the fairest shade.

00:05:13.940 --> 00:05:15.540
And now the fun part.

00:05:15.540 --> 00:05:18.020
Let us see how the
image was segmented.

00:05:18.020 --> 00:05:20.810
To output an image, we
can use the image function

00:05:20.810 --> 00:05:24.150
in R, which takes a
matrix as an input.

00:05:24.150 --> 00:05:28.460
But the variable flowerClusters,
as we just saw, is a vector.

00:05:28.460 --> 00:05:31.340
So we need to convert
it into a matrix.

00:05:31.340 --> 00:05:33.900
We can do this by setting the
dimension of this variable

00:05:33.900 --> 00:05:35.930
by using the dimension function.

00:05:35.930 --> 00:05:38.940
So, let's use the
dimension function, or dim,

00:05:38.940 --> 00:05:42.510
which takes as input
flowerClusters.

00:05:42.510 --> 00:05:44.690
And then we're going to
use the combined function,

00:05:44.690 --> 00:05:46.050
or the c function.

00:05:46.050 --> 00:05:48.690
And its first argument
will be the number of rows

00:05:48.690 --> 00:05:51.680
that we want for the matrix,
and that would be 50.

00:05:51.680 --> 00:05:54.620
And the second argument would
be the number of columns.

00:05:54.620 --> 00:05:56.330
Why did we use 50?

00:05:56.330 --> 00:06:01.310
Simply because we have a 50
by 50 resolution picture.

00:06:01.310 --> 00:06:03.660
Now pressing Enter,
and flowerClusters

00:06:03.660 --> 00:06:05.960
looks like a matrix.

00:06:05.960 --> 00:06:08.040
Now we can use the
function image,

00:06:08.040 --> 00:06:12.160
which takes as an input the
"flower cl clusters" matrix.

00:06:12.160 --> 00:06:16.650
And let's turn off the axes
by writing axes="false."

00:06:16.650 --> 00:06:20.180
And now, going back to
our graphics window,

00:06:20.180 --> 00:06:22.830
we can see our
segmented image here.

00:06:22.830 --> 00:06:25.280
The darkest shade corresponds
to the background,

00:06:25.280 --> 00:06:29.350
and this is actually associated
with the first cluster.

00:06:29.350 --> 00:06:31.650
The one in the middle is
the core of the flower,

00:06:31.650 --> 00:06:33.290
and this is cluster 2.

00:06:33.290 --> 00:06:35.720
And then the petals
correspond to cluster 3,

00:06:35.720 --> 00:06:39.520
which has the fairest
shade in our image.

00:06:39.520 --> 00:06:42.680
Let us now check how the
original image looked.

00:06:42.680 --> 00:06:46.430
Let's go back to the console
and then maximize it here.

00:06:46.430 --> 00:06:48.470
So let's go back to
our image function,

00:06:48.470 --> 00:06:53.980
but now this time the
input is the flower matrix.

00:06:53.980 --> 00:06:56.190
And then let's keep
the axis as false.

00:06:56.190 --> 00:06:59.950
But now, how about we add an
additional argument regarding

00:06:59.950 --> 00:07:01.220
the color scheme?

00:07:01.220 --> 00:07:02.580
Let's make it grayscale.

00:07:02.580 --> 00:07:05.100
So we're going to
take the color,

00:07:05.100 --> 00:07:07.920
and it's going to take
the function gray.

00:07:07.920 --> 00:07:10.240
And the input to this
function is a sequence

00:07:10.240 --> 00:07:13.470
of values that goes
from 0 to 1, which

00:07:13.470 --> 00:07:15.770
actually is from black to white.

00:07:15.770 --> 00:07:17.860
And then we have to
also specify its length,

00:07:17.860 --> 00:07:22.250
and that's specified as 256,
because this corresponds

00:07:22.250 --> 00:07:24.950
to the convention for grayscale.

00:07:24.950 --> 00:07:26.980
And now, going
back to our image,

00:07:26.980 --> 00:07:30.780
now we can see our original
grayscale image here.

00:07:30.780 --> 00:07:34.200
You can see that it has a
very, very low resolution.

00:07:34.200 --> 00:07:36.860
But in our next video,
we will try to segment

00:07:36.860 --> 00:07:39.760
an MRI image of
the brain that has

00:07:39.760 --> 00:07:42.350
a much, much higher resolution.