15.071 | Spring 2017 | Graduate

The Analytics Edge

3 Logistic Regression

3.5 Assignment 3

Popularity of Music Records

The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales. 

Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.

Unfortunately, the success of an artist’s release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable. 

Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success. 

How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.

Taking an analytics approach, we aim to use information about a song’s properties to predict its popularity. The dataset songs (CSV) consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Here’s a detailed description of the variables:

 

  • year = the year the song was released
  • songtitle = the title of the song
  • artistname = the name of the artist of the song
  • songID and artistID = identifying variables for the song and artist
  • timesignature and timesignature_confidence = a variable estimating the time signature of the song, and the confidence in the estimate
  • loudness = a continuous variable indicating the average amplitude of the audio in decibels
  • tempo and tempo_confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
  • key and key_confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
  • energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
  • pitch = a continuous variable that indicates the pitch of the song
  • timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
  • Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)

Problem 1.1 - Understanding the Data

Use the read.csv function to load the dataset “songs.csv” into R.

How many observations (songs) are there in total?

Exercise 1

 Numerical Response 

 

Explanation

First, navigate to the directory on your computer containing the file “songs.csv”. You can load the dataset by using the command:

songs = read.csv(“songs.csv”)

Then, you can count the number of songs by using str(songs), which you can read off the first line on the output, or nrow(songs), which returns the number of rows in the data frame.

CheckShow Answer

Problem 1.2 - Understanding the Data

How many songs does the dataset include for which the artist name is “Michael Jackson”?

Exercise 2

 Numerical Response 

 

Explanation

One way to compute this would be using table():

table(songs$artistname == “Michael Jackson”)

We can also get this information using the dplyr package. We will first load the dplyr package:

library(dplyr)

and then filter the data frame to artistname == “Michael Jackson” and summarize the number of observations:

songs %>% filter(artistname == “Michael Jackson”) %>% summarize(count = n())

CheckShow Answer

Problem 2.1 - Creating Our Prediction Model

We wish to predict whether or not a song will make it to the Top 10. To do this, first use the filter function to split the data into a training set “SongsTrain” consisting of all the observations up to and including 2009 song releases, and a testing set “SongsTest”, consisting of the 2010 song releases.

How many observations (songs) are in the training set?

Exercise 3

 Numerical Response 

 

Explanation

SongsTrain = songs %>% filter(year <= 2009)

SongsTest = songs %>% filter(year == 2010)

The training set has 7201 observations, which can be found by looking at the structure with str(SongsTrain) or by typing nrow(SongsTrain).

CheckShow Answer

Problem 2.2 - Creating our Prediction Model

In this problem, our outcome variable is “Top10” - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model.

We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic regression model. So we won’t use the variables “year”, “songtitle”, “artistname”, “songID”, or “artistID”.

We have seen in the lecture that, to build the logistic regression model, we would normally explicitly input the formula including all the independent variables in R. However, in this case, this is a tedious amount of work since we have a large number of independent variables.

There is a nice trick to avoid doing so by using the symbol “.” that represents all the remaining variables. You can follow the steps below:

Step 1: we want to exclude some of the variables in our dataset from being used as independent variables (“year”, “songtitle”, “artistname”, “songID”, and “artistID”). To do this, we can use the following trick. First define a vector of variable names called nonvars - these are the variables that we won’t use in our model.

nonvars = c(“year”, “songtitle”, “artistname”, “songID”, “artistID”)

To remove these variables from your training and testing sets, type the following commands in your R console:

SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]

SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]

Step 2: build a logistic regression model to predict Top10 using the training data. We can now use “.” in place of enumerating all the remaining independent variables in the following way:

SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)

(Also, keep in mind that you can choose to put quotes around binomial, or leave out the quotes. R can understand this argument either way.)

Looking at the summary of your model, excluding the intercept, how many variables are significant at the 5% significance level?

Exercise 4

 Numerical Response 

 

Explanation

To answer this question, you first need to run the three given commands to remove the variables that we won’t use in the model from the datasets:

nonvars = c(“year”, “songtitle”, “artistname”, “songID”, “artistID”)

SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]

SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]

Then, you can create the logistic regression model with the following command:

SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)

Looking at the stars on the summary(SongsLog1) output, we can see that 20 variables have at least one star next to the p-values, which represents significance at the 5% level.

CheckShow Answer

Problem 2.3 - Creating Our Prediction Model

Let’s now think about the variables in our dataset related to the confidence of the time signature, key, and tempo (timesignature_confidence, key_confidence, and tempo_confidence). Our model seems to indicate that these confidence variables are significant (rather than the variables timesignature, key, and tempo themselves). What does the model suggest?

Exercise 5

 The lower our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10 

 The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10 

Explanation

If you look at the output summary(SongsLog1), where SongsLog1 is the name of your logistic regression model, you can see that the coefficient estimates for the confidence variables (timesignature_confidence, key_confidence, and tempo_confidence) are positive. This means that higher confidence leads to a higher predicted probability of a Top 10 hit.

CheckShow Answer

Problem 2.4 - Creating Our Prediction Model

In general, if the confidence is low for the time signature, tempo, and key, then the song is more likely to be complex. What does our model suggest in terms of complexity?

Exercise 6

 Mainstream listeners tend to prefer more complex songs 

 Mainstream listeners tend to prefer less complex songs 

Explanation

Since the coefficient values for timesignature_confidence, tempo_confidence, and key_confidence are all positive, lower confidence leads to a lower predicted probability of a song being a hit. So mainstream listeners tend to prefer less complex songs.

CheckShow Answer

Problem 2.5 - Creating Our Prediction Model

Songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”).

By inspecting the coefficient of the variable “loudness”, what does our model suggest?

Exercise 7

 Mainstream listeners prefer songs with heavy instrumentation 

 Mainstream listeners prefer songs with light instrumentation 

Explanation

The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs, which are those with heavier instrumentation.

CheckShow Answer

Problem 3.1 - Validating Our Model

Make predictions on the test set using our model. What is the accuracy of our model on the test set, using a threshold of 0.45? (Compute the accuracy as a number between 0 and 1.)

To get the accuracy after you make the predictions, you can use the table(variable1, variable2) command to generate a summary table that counts the number of observations for each of the possible combination of values in variable1 and variable2. You can also do so by using the group_by and summarize commands in dplyr package.

Exercise 8

 Numerical Response 

 

Explanation

You can make predictions on the test set by using the command:

testPredict = predict(SongsLog1, newdata=SongsTest, type=“response”)

Then, you can create a confusion matrix with a threshold of 0.45 by using the table command:

table(SongsTest$Top10, testPredict >= 0.45)

Or, alternatively, use the group_by and summarize functions in dplyr package:

SongsTest$Pred = testPredict >= 0.45

SongsTest %>% group_by(Top10, Pred) %>% summarize(count = n())

The accuracy of the model is (309+15)/(309+15+44+5) = 0.8686

CheckShow Answer

Problem 3.2 - Validating Our Model

Let’s check if there’s any incremental benefit in using our model instead of a baseline model. Given the difficulty of guessing which song is going to be a hit, an easier model would be to pick the most frequent outcome (a song is not a Top 10 hit) for all songs. What would the accuracy of the baseline model be on the test set? (Give your answer as a number between 0 and 1.)

Exercise 9

 Numerical Response 

 

Explanation

You can compute the baseline accuracy by summarizing the outcome variable in the test set. One way to do this is with table():

table(SongsTest$Top10)

Another approach would be to use dplyr:

SongsTest %>% group_by(Top10) %>% summarize(count = n())

The baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

CheckShow Answer

Problem 3.3 - Validating Our Model

What is the True Positive Rate of our model on the test set, using a threshold of 0.45?

Exercise 10

 Numerical Response 

 

What is the False Positive Rate of our model on the test set, using a threshold of 0.45?

Exercise 11

 Numerical Response 

 

Explanation

Using the confusion matrix we obtained before using either table or group_by/summarize:

table(SongsTest$Top10, testPredict >= 0.45)

We can compute the True Positive Rate to be the number of correctly identified Top10 songs divided by the total number of Top10 songs: 15/(15+44) = 0.2542373, and the False Positive Rate to be the number of non-Top10 songs that were identified as Top10 divided by the total number of non-Top10 songs: 5/(309+5) = 0.01592357.

CheckShow Answer

Predicting the Baseball World Series Champion

Last week, in the Moneyball lecture, we discussed how regular season performance is not strongly correlated with winning the World Series in baseball. In this homework question, we’ll use the same data to investigate how well we can predict the World Series winner at the beginning of the playoffs.

To begin, load the dataset baseball (CSV) into R using the read.csv function, and call the data frame “baseball”. This is the same data file we used during the Moneyball lecture, and the data comes from Baseball-Reference.com.

As a reminder, this dataset contains data concerning a baseball team’s performance in a given year. It has the following variables:

  • Team: A code for the name of the team
  • League: The Major League Baseball league the team belongs to, either AL (American League) or NL (National League)
  • Year: The year of the corresponding record
  • RS: The number of runs scored by the team in that year
  • RA: The number of runs allowed by the team in that year
  • W: The number of regular season wins by the team in that year
  • OBP: The on-base percentage of the team in that year
  • SLG: The slugging percentage of the team in that year
  • BA: The batting average of the team in that year
  • Playoffs: Whether the team made the playoffs in that year (1 for yes, 0 for no)
  • RankSeason: Among the playoff teams in that year, the ranking of their regular season records (1 is best)
  • RankPlayoffs: Among the playoff teams in that year, how well they fared in the playoffs. The team winning the World Series gets a RankPlayoffs of 1.
  • G: The number of games a team played in that year
  • OOBP: The team’s opponents’ on-base percentage in that year
  • OSLG: The team’s opponents’ slugging percentage in that year

Problem 1.1 - Limiting to Teams Making the Playoffs

Each row in the baseball dataset represents a team in a particular year.

How many team/year pairs are there in the whole dataset?

Exercise 1

 Numerical Response 

 

Explanation

You can read the dataset into R by using the following command:

baseball = read.csv(“baseball.csv”)

Then nrow(baseball) or str(baseball) both show that there are 1232 team/year pairs.

CheckShow Answer

Problem 1.2 - Limiting to Teams Making the Playoffs

Though the dataset contains data from 1962 until 2012, we removed several years with shorter-than-usual seasons. Using the table() function, identify the total number of years included in this dataset.

Exercise 2

 Numerical Response 

 

Explanation

table(baseball$Year) contains 47 years (1972, 1981, 1994, and 1995 are missing). You can count the number of years in the table, or the command length(table(baseball$Year)) directly provides the answer.

CheckShow Answer

Problem 1.3 - Limiting to Teams Making the Playoffs

Because we’re only analyzing teams that made the playoffs, use the subset() function to replace baseball with a data frame limited to teams that made the playoffs (so your subsetted data frame should still be called “baseball”). How many team/year pairs are included in the new dataset?

Exercise 3

 Numerical Response 

 

Explanation

baseball = subset(baseball, Playoffs == 1) limits the dataset, and the nrow() or str() functions can be used to identify that 244 team/year pairs remain.

CheckShow Answer

Problem 1.4 - Limiting to Teams Making the Playoffs

Through the years, different numbers of teams have been invited to the playoffs. Which of the following has been the number of teams making the playoffs in some season? Select all that apply.

Exercise 4

 2 

 4 

 6 

 8 

 10 

 12 

 

Explanation

Using table(baseball$Year), we can see at least one season had 2, 4, 8, and 10 contenders. A fancier approach would be to use table(table(baseball$Year)).

CheckShow Answer

Problem 2.1 - Adding an Important Predictor

It’s much harder to win the World Series if there are 10 teams competing for the championship versus just two. Therefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will contain the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCompetitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.

We start by storing the output of the table() function that counts the number of playoff teams from each year:

PlayoffTable = table(baseball$Year)

You can output the table with the following command:

PlayoffTable

We will use this stored table to look up the number of teams in the playoffs in the year of each team/year pair.

Just as we can use the names() function to get the names of a data frame’s columns, we can use it to get the names of the entries in a table. What best describes the output of names(PlayoffTable)?

Exercise 5

 Vector of years stored as numbers (type num) 

 Vector of years stored as strings (type chr) 

 Vector of frequencies stored as numbers (type num) 

 Vector of frequencies stored as strings (type chr) 

Explanation

From the call str(names(PlayoffTable)) we see PlayoffTable has names of type chr, which are the years of the teams in the dataset.

CheckShow Answer

Problem 2.2 - Adding an Important Predictor

Given a vector of names, the table will return a vector of frequencies. Which function call returns the number of playoff teams in 1990 and 2001? (HINT: If you are not sure how these commands work, go ahead and try them out in your R console!)

Exercise 6

 PlayoffTable(1990, 2001) 

 PlayoffTable(c(1990, 2001)) 

 PlayoffTable(“1990”, “2001”) 

 PlayoffTable(c(“1990”, “2001”)) 

 PlayoffTable[1990, 2001] 

 PlayoffTable[c(1990, 2001)] 

 PlayoffTable[“1990”, “2001”] 

 PlayoffTable[c(“1990”, “2001”)] 

Explanation

Because PlayoffTable is an object and not a function, we look up elements in it with square brackets instead of parentheses. We build the vector of years to be passed with the c() function. Because the names of PlayoffTable are strings and not numbers, we need to pass “1990” and “2001”.

CheckShow Answer

Problem 2.3 - Adding an Important Predictor

Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the dataset, and store it as a new variable named NumCompetitors in the baseball data frame. While of the following function calls accomplishes this? (HINT: Test out the functions if you are not sure what they do.)

Exercise 7

 baseball$NumCompetitors = PlayoffTable(baseball$Year) 

 baseball$NumCompetitors = PlayoffTable[baseball$Year] 

 baseball$NumCompetitors = PlayoffTable(as.character(baseball$Year)) 

 baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)] 

Explanation

Because PlayoffTable is an object and not a function, we look up elements in it with square brackets instead of parentheses. as.character() is needed to convert the Year variable in the dataset to a string, which we know from the previous parts is needed to look up elements in a table. If you’re not sure what a function does, remember you can look it up with the ? function. For instance, you could type ?as.character to look up information about as.character.

CheckShow Answer

Problem 2.4 - Adding an Important Predictor

Add the NumCompetitors variable to your baseball data frame. How many playoff team/year pairs are there in our dataset from years where 8 teams were invited to the playoffs?

Exercise 8

 Numerical Response 

 

Explanation

You can add the NumCompetitors variable to the baseball data frame with the following command:

baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]

Then you can obtain the number of team/year pairs with 8 teams in the playoffs by running table(baseball$NumCompetitors)

CheckShow Answer

Problem 3.1 - Bivariate Models for Predicting World Series Winner

In this problem, we seek to predict whether a team won the World Series; in our dataset this is denoted with a RankPlayoffs value of 1. Add a variable named WorldSeries to the baseball data frame, by typing the following command in your R console:

baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)

WorldSeries takes value 1 if a team won the World Series in the indicated year and a 0 otherwise. How many observations do we have in our dataset where a team did NOT win the World Series?

Exercise 9

 Numerical Response 

 

Explanation

You can create the WorldSeries variable by running the command:

baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)

Then, if you create the table:

table(baseball$WorldSeries)

You can see that there are 197 teams that did not win the World Series.

CheckShow Answer

Problem 3.2 - Bivariate Models for Predicting World Series Winner

When we’re not sure which of our variables are useful in predicting a particular outcome, it’s often helpful to build bivariate models, which are models that predict the outcome using a single independent variable. Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model? To determine significance, remember to look at the stars in the summary output of the model. We’ll define an independent variable as significant if there is at least one star at the end of the coefficients row for that variable (this is equivalent to the probability column having a value smaller than 0.05). Note that you have to build 12 models to answer this question! Use the entire dataset baseball to build the models. (Select all that apply.)

Exercise 10

 Year 

 RS 

 RA 

 W 

 OBP 

 SLG 

 BA 

 RankSeason 

 OOBP 

 OSLG 

 NumCompetitors 

 League 

 

Explanation

The results come from building each bivariate model and looking at its summary. For instance, the result for the variable Year can be obtained by running summary(glm(WorldSeries~Year, data=baseball, family=“binomial”)). You can save time on repeated model building by using the up arrow in your R terminal. The W and SLG variables were both nearly significant, with p = 0.0577 and 0.0504, respectively.

CheckShow Answer

Problem 4.1 - Multivariate Models for Predicting World Series Winner

In this section, we’ll consider multivariate models that combine the variables we found to be significant in bivariate models. Build a model using all of the variables that you found to be significant in the bivariate models. How many variables are significant in the combined model?

Exercise 11

 Numerical Response 

 

Explanation

You can create a model with all of the significant variables from the bivariate models (Year, RA, RankSeason, and NumCompetitors) by using the following command:

LogModel = glm(WorldSeries ~ Year + RA + RankSeason + NumCompetitors, data=baseball, family=binomial)

Looking at summary(LogModel), you can see that none of the variables are significant in the multivariate model!

CheckShow Answer

Problem 4.2 - Multivariate Models for Predicting World Series Winner

Often, variables that were significant in bivariate models are no longer significant in multivariate analysis due to correlation between the variables. Which of the following variable pairs have a high degree of correlation (a correlation greater than 0.8 or less than -0.8)? Select all that apply.

Exercise 12

 Year/RA 

 Year/RankSeason 

 Year/NumCompetitors 

 RA/RankSeason 

 RA/NumCompetitors 

 RankSeason/NumCompetitors 

 

Explanation

To test the correlation between two variables, use a command like cor(baseball$Year, baseball$RA). While every pair was at least moderately correlated, the only strongly correlated pair was Year/NumCompetitors, with correlation coefficient 0.914.

As a shortcut, you can compute all pair-wise correlations between these variables with:

cor(baseball[c(“Year”, “RA”, “RankSeason”, “NumCompetitors”)])

CheckShow Answer

Problem 4.3 - Multivariate Models for Predicting World Series Winner

Build all six of the two variable models listed in the previous problem. Together with the four bivariate models, you should have 10 different logistic regression models. Which model has the best AIC value (the minimum AIC value)?

Exercise 13

 Year 

 RA 

 RankSeason 

 NumCompetitors 

 Year/RA 

 Year/RankSeason 

 Year/NumCompetitors 

 RA/RankSeason 

 RA/NumCompetitors 

 RankSeason/NumCompetitors 

Explanation

The two-variable models can be built with the following commands:

Model1 = glm(WorldSeries ~ Year + RA, data=baseball, family=binomial)

Model2 = glm(WorldSeries ~ Year + RankSeason, data=baseball, family=binomial)

Model3 = glm(WorldSeries ~ Year + NumCompetitors, data=baseball, family=binomial)

Model4 = glm(WorldSeries ~ RA + RankSeason, data=baseball, family=binomial)

Model5 = glm(WorldSeries ~ RA + NumCompetitors, data=baseball, family=binomial)

Model6 = glm(WorldSeries ~ RankSeason + NumCompetitors, data=baseball, family=binomial)

None of the models with two independent variables had both variables significant, so none seem promising as compared to a simple bivariate model. Indeed the model with the lowest AIC value is the model with just NumCompetitors as the independent variable.

This seems to confirm the claim made by Billy Beane in Moneyball that all that matters in the Playoffs is luck, since NumCompetitors has nothing to do with the quality of the teams!

CheckShow Answer

Back: Assignment 3

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions