15.071 | Spring 2017 | Graduate

The Analytics Edge

6 Clustering

6.2 Recommendations Worth a Million: An Introduction to Clustering

Quick Question

About how many years did it take for a team to submit a 10% improvement over Cinematch?

 
 
 
 

Explanation The contest started in October 2006, and eneded in July 2009. So it took about 2.5 years for a team to submit a 10% improvement solution.

Continue: Video 2: Recommendation Systems

Quick Question

Let’s consider a recommendation system on Amazon.com, an online retail site.

If Amazon.com constructs a recommendation system for books, and would like to use the same exact algorithm for shoes, what type would it have to be?

 
 

Explanation In the first case, the recommendation system would have to be collaborative filtering, since it can't use information about the items. In the second case, the recommendation system would be content filtering since other users are not involved.

If Amazon.com would like to suggest books to users based on the previous books they have purchased, what type of recommendation system would it be?

 
 

Explanation In the first case, the recommendation system would have to be collaborative filtering, since it can't use information about the items. In the second case, the recommendation system would be content filtering since other users are not involved.

Quick Question

In the previous video, we discussed how clustering is used to split the data into similar groups. Which of the following tasks do you think are appropriate for clustering? Select all that apply.

 
 
 

Explanation The first two options are appropriate tasks for clustering. Clustering probably wouldn't help us predict the winner of the World Series.

Quick Question

 

The movie “The Godfather” is in the genres action, crime, and drama, and is defined by the vector: (0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0)

The movie “Titanic” is in the genres action, drama, and romance, and is defined by the vector: (0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0)

What is the distance between “The Godfather” and “Titanic”, using euclidean distance?

Exercise 1

 Numerical Response 

 

Explanation

The distance between these two movies is the square root of 2. They have a difference of 1 in two genres - crime and romance.

CheckShow Answer

 

Quick Question

 

Suppose you are running the Hierarchical clustering algorithm with 212 observations.

How many clusters will there be at the start of the algorithm?

Exercise 1

 Numerical Response 

 

How many clusters will there be at the end of the algorithm?

Exercise 2

 Numerical Response 

 

Explanation

The Hierarchical clustering algorithm always starts with each data point in its own cluster, and ends with all data points in the same cluster. So there will be 212 clusters at the beginning of the algorithm, and 1 cluster at the end of the algorithm.

CheckShow Answer

 

Quick Question

 

Using the table function in R, please answer the following questions about the dataset “movies”.

How many movies are classified as comedies?

Exercise 1

 Numerical Response 

 

How many movies are classified as westerns?

Exercise 2

 Numerical Response 

 

How many movies are classified as romance AND drama?

Exercise 3

 Numerical Response 

 

Explanation

You can answer these questions by using the following commands:

table(movies$Comedy)

table(movies$Western)

table(movies$Romance, movies$Drama)

CheckShow Answer

 

Quick Question

Run the cutree function again to create the cluster groups, but this time pick k = 2 clusters. It turns out that the algorithm groups all of the movies that only belong to one specific genre in one cluster (cluster 2), and puts all of the other movies in the other cluster (cluster 1). What is the genre that all of the movies in cluster 2 belong to?

Explanation You can redo the cluster grouping with just two clusters by running the following command: clusterGroups = cutree(clusterMovies, k = 2) Then, by using the tapply function just like we did in the video, you can see the average value in each genre and cluster. It turns out that all of the movies in the second cluster belong to the drama genre. Alternatively, you can use colMeans or lapply as explained below Video 7.

Video 6: Getting the Data

In this video, we’ll be downloading our dataset from the MovieLens website. Please open the following link in a new window or tab of your browser to access the data: http://files.grouplens.org/datasets/movielens/ml-100k/u.item.

This video will show you how to load the data into R. 

An R script file with all of the commands used in this Lecture can be downloadedhere: Resource Unit6_Netflix (R).

Important Note: We’ll be using a text editor in this video to get the data into R. If you are on a Mac and are using TextEdit, the default file type is .rtf, so you will need to change the file type to txt. To do this, just go to Format → Make Plain Text, and the file will re-save as a txt file. Alternatively, depending on your operating system and web browser, you might just be able to save the file directly from the browser as a txt file.

Video 7: Hierarchical Clustering in R

Important Note: In this video, we use the “ward” method to do hierarchical clustering. This method was recently renamed in R to “ward.D”. If you are following along in R while watching the video, you will need to use the following command when doing the hierarchical clustering (“ward” is replaced with “ward.D”):

clusterMovies = hclust(distances, method = “ward.D”)

An Advanced Approach to Finding Cluster Centroids

In this video, we explain how you can find the cluster centroids by using the function “tapply” for each variable in the dataset. While this approach works and is familiar to us, it can be a little tedious when there are a lot of variables. An alternative approach is to use the colMeans function. With this approach, you only have one command for each cluster instead of one command for each variable. If you run the following command in your R console, you can get all of the column (variable) means for cluster 1:

colMeans(subset(movies[2:20], clusterGroups == 1))

You can repeat this for each cluster by changing the clusterGroups number. However, if you also have a lot of clusters, this approach is not that much more efficient than just using the tapply function.

A more advanced approach uses the “split” and “lapply” functions. The following command will split the data into subsets based on the clusters:

spl = split(movies[2:20], clusterGroups)

Then you can use spl to access the different clusters, because

spl[[1]]

is the same as

subset(movies[2:20], clusterGroups == 1)

so colMeans(spl[[1]]) will output the centroid of cluster 1. But an even easier approach uses the lapply function. The following command will output the cluster centroids for all clusters:

lapply(spl, colMeans)

The lapply function runs the second argument (colMeans) on each element of the first argument (each cluster subset in spl). So instead of using 19 tapply commands, or 10 colMeans commands, we can output our centroids with just two commands: one to define spl, and then the lapply command.

Note that if you have a variable called “split” in your current R session, you will need to remove it with rm(split) so that you can use the split function.

In this video, we use the spreadsheet ClusterMeans (ODS). This file can be opened in LibreOffice or OpenOffice. 

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions