WEBVTT
00:00:04.391 --> 00:00:05.890
ALLISON O'HAIR:
--previous video, we
00:00:05.890 --> 00:00:08.680
used data in linear
regression to show
00:00:08.680 --> 00:00:12.100
that if a team
scores at least 135
00:00:12.100 --> 00:00:15.460
more runs than their opponent
throughout the regular season,
00:00:15.460 --> 00:00:19.270
then we predict that that team
will win at least 95 games
00:00:19.270 --> 00:00:21.640
and make the playoffs.
00:00:21.640 --> 00:00:25.090
This means that we need to
know how many runs a team will
00:00:25.090 --> 00:00:28.390
score, which we will show
can be predicted using
00:00:28.390 --> 00:00:32.950
batting statistics, and how
many runs a team will allow,
00:00:32.950 --> 00:00:35.740
which we will show can be
predicted using fielding
00:00:35.740 --> 00:00:38.470
and pitching statistics.
00:00:38.470 --> 00:00:40.530
Let's start by creating
a linear regression
00:00:40.530 --> 00:00:43.200
model to predict runs scored.
00:00:43.200 --> 00:00:46.870
The Oakland A's are interested
in answering the question,
00:00:46.870 --> 00:00:49.710
how does a team score more runs?
00:00:49.710 --> 00:00:52.440
They discovered that
two baseball statistics
00:00:52.440 --> 00:00:55.980
were significantly more
important than anything else.
00:00:55.980 --> 00:00:59.460
These were on-base
percentage, or OBP,
00:00:59.460 --> 00:01:01.890
which is the percentage
of time a player gets
00:01:01.890 --> 00:01:06.300
on base, including walks,
and slugging percentage,
00:01:06.300 --> 00:01:10.260
or SLG, which measures how
far a player gets around
00:01:10.260 --> 00:01:15.680
the bases on his turn and
measures the power of a hitter.
00:01:15.680 --> 00:01:18.050
Most teams and
people in baseball
00:01:18.050 --> 00:01:22.970
focused instead on
batting average, or BA,
00:01:22.970 --> 00:01:25.220
which measures how
often a hitter gets
00:01:25.220 --> 00:01:27.380
on base by hitting the ball.
00:01:27.380 --> 00:01:31.070
This focuses on hits
instead of walks.
00:01:31.070 --> 00:01:34.700
The Oakland A's claimed
that on-base percentage
00:01:34.700 --> 00:01:36.650
was the most important.
00:01:36.650 --> 00:01:39.350
Slugging percentage
was important.
00:01:39.350 --> 00:01:42.710
And batting average
was overvalued.
00:01:42.710 --> 00:01:45.860
Let's see if we can use
linear regression in R
00:01:45.860 --> 00:01:48.500
to verify which
baseball statistics are
00:01:48.500 --> 00:01:51.720
more important to predict runs.
00:01:51.720 --> 00:01:55.200
In the previous video, we
created her dataset in R
00:01:55.200 --> 00:01:57.000
and called it Moneyball.
00:01:57.000 --> 00:01:59.670
Let's take a look at the
structure of our data again.
00:02:02.410 --> 00:02:05.230
Our dataset includes
many variables including
00:02:05.230 --> 00:02:10.090
runs scored, or RS,
on-base percentage, OBP,
00:02:10.090 --> 00:02:15.680
slugging percentage, SLG,
and batting average, BA.
00:02:15.680 --> 00:02:19.130
We want to see if we can use
linear regression to predict
00:02:19.130 --> 00:02:22.880
runs scored using the
three hitting statistics,
00:02:22.880 --> 00:02:25.970
on-base percentage, slugging
percentage, and batting
00:02:25.970 --> 00:02:27.480
average.
00:02:27.480 --> 00:02:30.740
So let's build a linear
regression equation using
00:02:30.740 --> 00:02:34.400
the LM function to
predict runs scored
00:02:34.400 --> 00:02:41.300
using the independent
variables OBP, SLG, and BA.
00:02:41.300 --> 00:02:46.040
Our dataset is Moneyball.
00:02:46.040 --> 00:02:50.950
If we look at the summary
of our regression equation,
00:02:50.950 --> 00:02:53.580
we can see that all of
our independent variables
00:02:53.580 --> 00:02:55.470
are highly significant.
00:02:55.470 --> 00:02:59.290
And our R squared is 0.93.
00:02:59.290 --> 00:03:01.050
If we look at our
coefficients, we
00:03:01.050 --> 00:03:04.950
see that the coefficient for
batting average is negative.
00:03:04.950 --> 00:03:08.500
This says that all other
things being equal,
00:03:08.500 --> 00:03:13.020
a team with a higher batting
average will score fewer runs.
00:03:13.020 --> 00:03:15.660
But this is a bit
counterintuitive.
00:03:15.660 --> 00:03:19.230
What's going on here is a
case of multi-collinearity.
00:03:19.230 --> 00:03:22.540
These three hitting statistics
are highly correlated.
00:03:22.540 --> 00:03:25.330
So it's hard to
interpret our model.
00:03:25.330 --> 00:03:28.230
Let's see if we can remove
batting average, the variable
00:03:28.230 --> 00:03:31.800
with the negative coefficient
and the least significance,
00:03:31.800 --> 00:03:37.180
to see if we can improve the
interpretability of our model.
00:03:37.180 --> 00:03:39.360
So let's rerun our
regression equation
00:03:39.360 --> 00:03:42.210
without the variable BA.
00:03:42.210 --> 00:03:45.580
If we look at the summary of
our new regression equation,
00:03:45.580 --> 00:03:49.750
we can see that our variables
are again highly significant.
00:03:49.750 --> 00:03:51.540
The coefficients
are all positive,
00:03:51.540 --> 00:03:53.580
as we would expect them to be.
00:03:53.580 --> 00:03:59.530
And our R squared is now 0.9296,
or about the same as before.
00:03:59.530 --> 00:04:03.180
So this model is simpler
with one fewer variable,
00:04:03.180 --> 00:04:05.640
but has about the same
predictive ability
00:04:05.640 --> 00:04:08.240
as a three variable model.
00:04:08.240 --> 00:04:10.220
You can experiment
and see that if we
00:04:10.220 --> 00:04:13.340
had taken out on-base percentage
or slugging percentage
00:04:13.340 --> 00:04:16.160
instead of batting
average, our R squared
00:04:16.160 --> 00:04:19.380
would have gone down more.
00:04:19.380 --> 00:04:21.450
If we look at the
coefficients, we
00:04:21.450 --> 00:04:24.450
can see that the coefficient
for on-base percentage
00:04:24.450 --> 00:04:26.760
is significantly higher
than the coefficient
00:04:26.760 --> 00:04:29.250
for slugging percentage.
00:04:29.250 --> 00:04:32.370
Since these variables are
on about the same scale,
00:04:32.370 --> 00:04:35.550
this tells us that on-base
percentage is worth more
00:04:35.550 --> 00:04:37.870
than slugging percentage.
00:04:37.870 --> 00:04:41.650
So using linear regression we're
able to verify the claims made
00:04:41.650 --> 00:04:45.730
in Moneyball that batting
average is overvalued,
00:04:45.730 --> 00:04:49.300
on-base percentage is the
most important, and slugging
00:04:49.300 --> 00:04:50.830
percentage is important.
00:04:50.830 --> 00:04:53.260
We can create a
very similar model
00:04:53.260 --> 00:04:56.680
to predict runs allowed
or opponent runs.
00:04:56.680 --> 00:05:00.190
This model uses pitching
statistics, opponents
00:05:00.190 --> 00:05:05.590
on-base percentage, or OOBP, and
opponents slugging percentage,
00:05:05.590 --> 00:05:07.810
or OSLG.
00:05:07.810 --> 00:05:10.810
The statistics are
computed in the same way
00:05:10.810 --> 00:05:13.810
as on-base percentage
and slugging percentage,
00:05:13.810 --> 00:05:16.510
but use the actions of
the opposing batters
00:05:16.510 --> 00:05:19.720
against our team's
pitcher and fielders.
00:05:19.720 --> 00:05:23.920
Using our dataset in R, we can
build a linear regression model
00:05:23.920 --> 00:05:27.700
to predict runs allowed using
opponents on-base percentage
00:05:27.700 --> 00:05:30.190
and opponents
slugging percentage.
00:05:30.190 --> 00:05:33.580
This is, again, a very strong
model with an R squared
00:05:33.580 --> 00:05:35.950
value of 0.91.
00:05:35.950 --> 00:05:39.230
And both variables
are significant.
00:05:39.230 --> 00:05:41.120
In the next video,
we'll show how
00:05:41.120 --> 00:05:44.870
we can apply these models
to predict whether or not
00:05:44.870 --> 00:05:47.560
a team will make the playoffs.