WEBVTT

00:00:04.500 --> 00:00:07.560
Let us try to understand the
format of the data handed

00:00:07.560 --> 00:00:10.030
to us in the CSV files.

00:00:10.030 --> 00:00:13.760
Grayscale images are represented
as a matrix of pixel intensity

00:00:13.760 --> 00:00:16.740
values that range
from zero to one.

00:00:16.740 --> 00:00:19.840
The intensity value zero
corresponds to the absence

00:00:19.840 --> 00:00:24.530
of color, or black, and the
value one corresponds to white.

00:00:24.530 --> 00:00:29.250
For 8 bits per pixel images,
we have 256 color levels

00:00:29.250 --> 00:00:31.680
ranging from zero to one.

00:00:31.680 --> 00:00:34.640
For instance, if we have the
following grayscale image,

00:00:34.640 --> 00:00:37.000
the pixel information
can be translated

00:00:37.000 --> 00:00:40.290
to a matrix of values
between zero and one.

00:00:40.290 --> 00:00:44.050
It is exactly this matrix that
we are given in our datasets.

00:00:44.050 --> 00:00:46.680
In other words, the
datasets contain a table

00:00:46.680 --> 00:00:49.000
of values between zero and one.

00:00:49.000 --> 00:00:50.800
And the number of
columns corresponds

00:00:50.800 --> 00:00:53.590
to the width of the image,
whereas the number of rows

00:00:53.590 --> 00:00:56.150
corresponds to the
height of the image.

00:00:56.150 --> 00:01:00.950
In this example, the
resolution is 7 by 7 pixels.

00:01:00.950 --> 00:01:04.370
We have to be careful when
reading the dataset in R.

00:01:04.370 --> 00:01:06.070
We need to make
sure that R reads

00:01:06.070 --> 00:01:08.150
in the matrix appropriately.

00:01:08.150 --> 00:01:10.150
Until now in this
class, our datasets

00:01:10.150 --> 00:01:14.090
were structured in a way where
the rows refer to observations

00:01:14.090 --> 00:01:16.560
and the columns
refer to variables.

00:01:16.560 --> 00:01:19.870
But this is not the case
for the intensity matrix.

00:01:19.870 --> 00:01:22.580
So keep in mind that we
need to do some maneuvering

00:01:22.580 --> 00:01:27.380
to make sure that R recognizes
the data as a matrix.

00:01:27.380 --> 00:01:31.120
Grayscale image segmentation
can be done by clustering pixels

00:01:31.120 --> 00:01:33.560
according to their
intensity values.

00:01:33.560 --> 00:01:35.630
So we can think of our
clustering algorithm

00:01:35.630 --> 00:01:38.870
as trying to divide the
spectrum of intensity values

00:01:38.870 --> 00:01:42.600
from zero to one into
intervals, or clusters.

00:01:42.600 --> 00:01:44.840
For instance, the red
cluster corresponds

00:01:44.840 --> 00:01:48.970
to the darkest shades, and the
green cluster to the lightest.

00:01:48.970 --> 00:01:53.030
Now, what should the input be
to the clustering algorithm?

00:01:53.030 --> 00:01:55.539
Well, our observations
should be all of the 7

00:01:55.539 --> 00:01:57.860
by 7 intensity values.

00:01:57.860 --> 00:02:00.740
Hence, we should
have 49 observations.

00:02:00.740 --> 00:02:02.600
And we only have
one variable, which

00:02:02.600 --> 00:02:04.790
is the pixel intensity value.

00:02:04.790 --> 00:02:07.830
So in other words, the input
to the clustering algorithm

00:02:07.830 --> 00:02:12.410
should be a vector containing 49
elements, or intensity values.

00:02:12.410 --> 00:02:15.780
But what we have
is a 7 by 7 matrix.

00:02:15.780 --> 00:02:18.230
A crucial step before
feeding the intensity

00:02:18.230 --> 00:02:22.380
values to the clustering
algorithm is morphing our data.

00:02:22.380 --> 00:02:24.670
We should modify
the matrix structure

00:02:24.670 --> 00:02:28.640
and lump all the intensity
values into a single vector.

00:02:28.640 --> 00:02:30.640
We will see that
we can do this in R

00:02:30.640 --> 00:02:33.650
using the as.vector function.

00:02:33.650 --> 00:02:36.060
Now, once we have the
vector, we can simply

00:02:36.060 --> 00:02:37.910
feed it into the
clustering algorithm

00:02:37.910 --> 00:02:42.160
and assign each element in
the vector to a cluster.

00:02:42.160 --> 00:02:44.620
Let us first use
hierarchical clustering

00:02:44.620 --> 00:02:46.690
since we are familiar with it.

00:02:46.690 --> 00:02:49.770
The first step is to calculate
the distance matrix, which

00:02:49.770 --> 00:02:52.390
computes the pairwise
distances among the elements

00:02:52.390 --> 00:02:54.290
of the intensity vector.

00:02:54.290 --> 00:02:57.400
How many such distances
do we need to calculate?

00:02:57.400 --> 00:02:59.980
Well, for each element
in the intensity vector,

00:02:59.980 --> 00:03:03.020
we need to calculate its
distance from the other 48

00:03:03.020 --> 00:03:03.920
elements.

00:03:03.920 --> 00:03:07.450
So this makes 48
calculations per element.

00:03:07.450 --> 00:03:11.030
And we have 49 such elements
in the intensity vector.

00:03:11.030 --> 00:03:16.310
In total, we should compute 49
times 48 pairwise distances.

00:03:16.310 --> 00:03:20.340
But due to symmetry, we really
need to calculate half of them.

00:03:20.340 --> 00:03:23.380
So the number of pairwise
distance calculations is

00:03:23.380 --> 00:03:24.210
actually (49*48)/2.

00:03:27.060 --> 00:03:30.990
In general, if we call the
size of the intensity vector n,

00:03:30.990 --> 00:03:36.320
then we need to compute
n*(n-1)/2 pairwise distances

00:03:36.320 --> 00:03:39.320
and store them in
the distance matrix.

00:03:39.320 --> 00:03:42.780
Now we should be
ready to go to R.

00:03:42.780 --> 00:03:45.070
I already navigated
to the directory

00:03:45.070 --> 00:03:48.370
where we saved the
flower.csv file, which

00:03:48.370 --> 00:03:52.020
contains the matrix of pixel
intensities of a flower image.

00:03:52.020 --> 00:03:54.860
Let us read in the matrix
and save it to a data frame

00:03:54.860 --> 00:03:59.170
and call it flower, then
use the read.csv function

00:03:59.170 --> 00:04:02.690
to instruct R to read
in the flower dataset.

00:04:02.690 --> 00:04:05.060
And then we have to
explicitly mention

00:04:05.060 --> 00:04:07.800
that we have no
headers in the CSV file

00:04:07.800 --> 00:04:11.610
because it only contains a
matrix of intensity values.

00:04:11.610 --> 00:04:13.200
So we're going to
type header=FALSE.

00:04:16.240 --> 00:04:18.279
Note that the
default in R assumes

00:04:18.279 --> 00:04:21.140
that the first row in the
dataset is the header.

00:04:21.140 --> 00:04:25.030
So if we didn't specify that we
have no headers in this case,

00:04:25.030 --> 00:04:26.450
we would have lost
the information

00:04:26.450 --> 00:04:29.880
from the first row of the
pixel intensity matrix.

00:04:29.880 --> 00:04:34.460
Now let us look at the structure
of the flower data frame.

00:04:34.460 --> 00:04:36.810
We realize that the
way the data is stored

00:04:36.810 --> 00:04:40.340
does not reflect that this is
a matrix of intensity values.

00:04:40.340 --> 00:04:44.409
Actually, R treats the rows as
observations and the columns

00:04:44.409 --> 00:04:46.340
as variables.

00:04:46.340 --> 00:04:48.820
Let's try to change the
data type to a matrix

00:04:48.820 --> 00:04:51.630
by using the as.matrix function.

00:04:51.630 --> 00:04:55.790
So let's define our
variable flowerMatrix

00:04:55.790 --> 00:04:58.670
and then use the
as.matrix function, which

00:04:58.670 --> 00:05:02.120
takes as an input the
flower data frame.

00:05:02.120 --> 00:05:06.930
And now if we look at the
structure of the flower matrix,

00:05:06.930 --> 00:05:11.770
we realize that we have
50 rows and 50 columns.

00:05:11.770 --> 00:05:14.130
What this suggests is that
the resolution of the image

00:05:14.130 --> 00:05:17.670
is 50 pixels in width
and 50 pixels in height.

00:05:17.670 --> 00:05:20.850
This is actually a very,
very small picture.

00:05:20.850 --> 00:05:24.290
I am very curious to see how
this image looks like, but lets

00:05:24.290 --> 00:05:27.110
hold off now and do
our clustering first.

00:05:27.110 --> 00:05:29.730
We do not want to be influenced
by how the image looks

00:05:29.730 --> 00:05:32.240
like in our decision of
the numbers of clusters

00:05:32.240 --> 00:05:34.840
we want to pick.

00:05:34.840 --> 00:05:36.480
To perform any
type of clustering,

00:05:36.480 --> 00:05:38.070
we saw earlier
that we would need

00:05:38.070 --> 00:05:40.470
to convert the matrix
of pixel intensities

00:05:40.470 --> 00:05:44.500
to a vector that contains all
the intensity values ranging

00:05:44.500 --> 00:05:45.930
from zero to one.

00:05:45.930 --> 00:05:49.550
And the clustering algorithm
divides the intensity spectrum,

00:05:49.550 --> 00:05:52.490
the interval zero to one,
into these joint clusters

00:05:52.490 --> 00:05:53.490
or intervals.

00:05:53.490 --> 00:05:58.100
So let us define the
vector flowerVector,

00:05:58.100 --> 00:06:02.650
and then now we're going to use
the function as.vector, which

00:06:02.650 --> 00:06:06.240
takes as an input
the flowerMatrix.

00:06:06.240 --> 00:06:11.020
And now if we look at the
structure of the flowerVector,

00:06:11.020 --> 00:06:16.570
we realize that we have
2,500 numerical values, which

00:06:16.570 --> 00:06:18.110
range between zero and one.

00:06:18.110 --> 00:06:22.310
And this totally makes sense
because this reflects the 50

00:06:22.310 --> 00:06:26.830
times 50 intensity values
that we had in our matrix.

00:06:26.830 --> 00:06:29.680
Now you might be wondering
why we can't immediately

00:06:29.680 --> 00:06:33.000
convert the data frame
flower to a vector.

00:06:33.000 --> 00:06:34.340
Let's try to do this.

00:06:34.340 --> 00:06:37.700
So let's go back to
our as.vector function

00:06:37.700 --> 00:06:40.620
and then have the input
be the flower data

00:06:40.620 --> 00:06:42.850
frame instead of
the flower matrix.

00:06:42.850 --> 00:06:46.990
And then, let's name this
variable flowerVector2, simply

00:06:46.990 --> 00:06:49.790
so that we don't overwrite
the flower vector.

00:06:49.790 --> 00:06:51.290
And now let's look
at its structure.

00:06:53.930 --> 00:06:57.750
It seems that R reads it exactly
like the flower data frame

00:06:57.750 --> 00:07:02.360
and sees it as 50
observations and 50 variables.

00:07:02.360 --> 00:07:06.010
So converting the data to a
matrix and then to the vector

00:07:06.010 --> 00:07:08.210
is a crucial step.

00:07:08.210 --> 00:07:11.660
Now we should be ready to start
our hierarchical clustering.

00:07:11.660 --> 00:07:14.850
The first step is to create the
distance matrix, as you already

00:07:14.850 --> 00:07:16.850
know, which in
this case computes

00:07:16.850 --> 00:07:19.970
the difference between every two
intensity values in our flower

00:07:19.970 --> 00:07:20.470
vector.

00:07:20.470 --> 00:07:22.340
So let's type
distance=dist(flowerVector,

00:07:22.340 --> 00:07:23.170
method="euclidean").

00:07:35.930 --> 00:07:38.020
Now that we have
the distance, next

00:07:38.020 --> 00:07:41.540
we will be computing the
hierarchical clusters.