15.071 | Spring 2017 | Graduate

The Analytics Edge

2 Linear Regression

2.3 Moneyball: The Power of Sports Analytics

Introduction to Baseball Video

If you are unfamiliar with the game of baseball, please watch this short video clip for a quick introduction to the game. You don’t need to be a baseball expert to understand this lecture, but basic knowledge of the game will be helpful to you.

TruScribe. “Baseball Rules of Engagement.” March 27, 2012. YouTube. This video is from TrueScribeVideos and is not covered by our Creative Commons license.

Quick Question

 

If a baseball team scores 713 runs and allows 614 runs, how many games do we expect the team to win?

Using the linear regression model constructed during the lecture, enter the number of games we expect the team to win:

Exercise 1

 Numerical Response 

 

Explanation

Our linear regression model was

Wins = 80.88 + 0.1058*(Run Difference)

Here, the run difference is 99, so our prediction is

Wins = 80.88 + 0.1058*99 = 91 games.

CheckShow Answer

 

Quick Question

 

If a baseball team’s OBP is 0.311 and SLG is 0.405, how many runs do we expect the team to score?

Using the linear regression model constructed during the lecture (the one that uses OBP and SLG as independent variables), enter the number of runs we expect the team to score:

Exercise 1

 Numerical Response 

 

Explanation

Our linear regression model was:

Runs Scored = -804.63 + 2737.77*(OBP) + 1584.91*(SLG)

Here, OBP is 0.311 and SLG is 0.405, so our prediction is:

Runs Scored = -804.63 + 2737.77*0.311 + 1584.91*0.405 = 689 runs

If a baseball team’s opponents OBP (OOBP) is 0.297 and oppenents SLG (OSLG) is 0.370, how many runs do we expect the team to allow?

Using the linear regression model discussed during the lecture (the one on the last slide of the previous video), enter the number of runs we expect the team to allow:

Exercise 2

 Numerical Response 

 

Explanation

Our linear regression model was:

Runs Allowed = -837.38 + 2913.60*(OOBP) + 1514.29*(OSLG)

Here, OOBP is 0.297 and OSLG is 0.370, so our prediction is:

Runs Scored = -837.38 + 2913.60*(.297) + 1514.29*(.370) = 588 runs

CheckShow Answer

 

Quick Question

Suppose you are the General Manager of a baseball team, and you are selecting two players for your team. You have a budget of $1,500,000, and you have the choice between the following players:

Player Name OBP SLG Salary
Eric Chavez 0.338 0.540 $1,400,000
Jeremy Giambi 0.391 0.450 $1,065,000
Frank Menechino 0.369 0.374 $295,000
Greg Myers 0.313 0.447 $800,000
Carlos Pena 0.361 0.500 $300,000

Given your budget and the player statistics, which two players would you select?

Explanation We would select Jeremy Giambi and Carlos Pena, since they give the highest contribution to Runs Scored. We would not select Eric Chavez, since his salary consumes our entire budget, and although he has the highest SLG, there are players with better OBP. We would not select Frank Menechino since even though he has a high OBP, his SLG is low. We would not select Greg Myers since he is dominated by Carlos Pena in OBP and SLG, but has a much higher salary.

Quick Question

In 2012 and 2013, there were 10 teams in the MLB playoffs: the six teams that had the most wins in each baseball division, and four “wild card” teams. The playoffs start between the four wild card teams - the two teams that win proceed in the playoffs (8 teams remaining). Then, these teams are paired off and play a series of games. The four teams that win are then paired and play to determine who will play in the World Series. 

We can assign rankings to the teams as follows:

 

  • Rank 1: the team that won the World Series
  • Rank 2: the team that lost the World Series
  • Rank 3: the two teams that lost to the teams in the World Series
  • Rank 4: the four teams that made it past the wild card round, but lost to the above four teams
  • Rank 5: the two teams that lost the wild card round

In your R console, create a corresponding rank vector by typing

teamRank = c(1,2,3,3,4,4,4,4,5,5)

In this quick question, we’ll see how well these rankings correlate with the regular season wins of the teams. In 2012, the ranking of the teams and their regular season wins were as follows:

  • Rank 1: San Francisco Giants (Wins = 94)
  • Rank 2: Detroit Tigers (Wins = 88)
  • Rank 3: New York Yankees (Wins = 95), and St. Louis Cardinals (Wins = 88)
  • Rank 4: Baltimore Orioles (Wins = 93), Oakland A’s (Wins = 94), Washington Nationals (Wins = 98), Cincinnati Reds (Wins = 97)
  • Rank 5: Texas Rangers (Wins = 93), and Atlanta Braves (Wins = 94) 

Create a vector in R called wins2012, that has the wins of each team in 2012, in order of rank (the vector should have 10 numbers).

In 2013, the ranking of the teams and their regular season wins were as follows:

  • Rank 1: Boston Red Sox (Wins = 97)
  • Rank 2: St. Louis Cardinals (Wins = 97)
  • Rank 3: Los Angeles Dodgers (Wins = 92), and Detroit Tigers (Wins = 93)
  • Rank 4: Tampa Bay Rays (Wins = 92), Oakland A’s (Wins = 96), Pittsburgh Pirates (Wins = 94), and Atlanta Braves (Wins = 96)
  • Rank 5: Cleveland Indians (Wins = 92), and Cincinnati Reds (Wins = 90) 

Create another vector in R called wins2013, that has the wins of each team in 2013, in order of rank (the vector should have 10 numbers).

What is the correlation between teamRank and wins2012?

Exercise 1

 Numerical Response 

 

Explanation

You should have typed the following three lines in your R console:

teamRank = c(1,2,3,3,4,4,4,4,5,5)

wins2012 = c(94,88,95,88,93,94,98,97,93,94)

cor(teamRank, wins2012)

The output of the last line is 0.3477129, which is the correlation between teamRank and wins2012.

What is the correlation between teamRank and wins2013?

Exercise 2

 Numerical Response 

 

Explanation

You should have typed the following three lines in your R console:

teamRank = c(1,2,3,3,4,4,4,4,5,5)

wins2013 = c(97,97,92,93,92,96,94,96,92,90)

cor(teamRank, wins2013)

The output of the last line is -0.6556945, which is the correlation between teamRank and wins2013.

Since one of the correlations is positive and the other is negative, this means that there does not seem to be a pattern between regular season wins and winning the playoffs. We wouldn’t feel comfortable making a bet for this year given this data!

CheckShow Answer

Quick Question

Which of the following is most likely to be a topic of Sabermetric research?

 
 
 

Explanation Sabermetric research tries to take a quantitative approach to baseball. Predicting how many home runs the Oakland A's will hit next year is a very quantitative problem. While the other two topics could be an area of Sabermetric research, they are more qualitative.

While Moneyball made the use of analytics in sports very popular, baseball is not the only sport for which analytics is used. Analytics is currently used in almost every single sport, including basketball, soccer, cricket, and hockey.

Basketball: The study of analytics in basketball, called APBRmetrics, is very popular. There have been many books written in this area, including “Pro Basketball Forecast” by John Hollinger and “Basketball on Paper” by Dean Oliver. There are also several websites dedicated to the study of basketball analytics, including 82games.com. We’ll talk more about basketball during recitation.

Soccer: The soccer analytics community is currently growing, and new data is constantly being collected. Many argue that it is much harder to apply analytics to soccer, but there are several books and websites on the topic. Check out “The Numbers Game: Why Everything You Know about Football is Wrong” by Chris Anderson and David Sally, as well as the websites socceranalysts.com and soccermetrics.net.

Cricket: There are several websites dedicated to building models for evaluating player performance in cricket. Check out cricmetric.com and impactindexcricket.com.

Hockey: Analytics are used in hockey to track player performance and to better shape the composition of teams. Check out the websites hockeyanalytics.com and lighthousehockey.com.

Back: Video 6: The Analytics Edge in Sports

Video 1: The Story of Moneyball

In this lecture, we will be using the dataset baseball (CSV). Download this dataset to follow along in R as we build regression models. This data comes from the Baseball Reference website.

A script file containing all of the R commands used in this lecture can be downloaded here: Unit2_Moneyball (R).

The slides from all videos in this Lecture Sequence can be downloaded here: The Power of Sports Analytics (PDF - 1.4MB).

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions