WEBVTT

00:00:09.500 --> 00:00:12.960
The topic of this recitation
is election forecasting,

00:00:12.960 --> 00:00:14.850
which is the art and
science of predicting

00:00:14.850 --> 00:00:17.690
the winner of an election
before any votes are actually

00:00:17.690 --> 00:00:22.910
cast using polling data
from likely voters.

00:00:22.910 --> 00:00:26.170
In this recitation, we are going
to look at the United States

00:00:26.170 --> 00:00:27.740
presidential election.

00:00:27.740 --> 00:00:29.170
In the United
States, a president

00:00:29.170 --> 00:00:30.800
is elected every four years.

00:00:30.800 --> 00:00:33.260
And while there are a number
of different political parties

00:00:33.260 --> 00:00:35.470
in the US, generally
there are only two

00:00:35.470 --> 00:00:37.260
competitive candidates.

00:00:37.260 --> 00:00:38.720
There's the
Republican candidate,

00:00:38.720 --> 00:00:40.780
who tends to be
more conservative,

00:00:40.780 --> 00:00:43.550
and the Democratic candidate,
who's more liberal.

00:00:43.550 --> 00:00:46.290
So for instance a recent
Republican president

00:00:46.290 --> 00:00:49.570
was George W. Bush, and a
recent Democratic president

00:00:49.570 --> 00:00:52.400
was Barack Obama.

00:00:52.400 --> 00:00:55.340
Now while in many countries
the leader of the country

00:00:55.340 --> 00:00:58.840
is elected using the
simple candidate who

00:00:58.840 --> 00:01:01.970
receives the largest number of
votes across the entire country

00:01:01.970 --> 00:01:04.540
is elected, in the
United States it's

00:01:04.540 --> 00:01:06.850
significantly more complicated.

00:01:06.850 --> 00:01:08.860
There are 50 states
in the United States,

00:01:08.860 --> 00:01:11.730
and each is assigned a
number of electoral votes

00:01:11.730 --> 00:01:13.539
based on its population.

00:01:13.539 --> 00:01:16.170
So for instance, the most
populous state, California,

00:01:16.170 --> 00:01:20.300
in 2012 had nearly 20 times
the number of electoral votes

00:01:20.300 --> 00:01:22.270
as the least populous states.

00:01:22.270 --> 00:01:23.920
And these number
of electoral votes

00:01:23.920 --> 00:01:25.850
are reassigned
periodically based

00:01:25.850 --> 00:01:28.920
on changes of populations
between states.

00:01:28.920 --> 00:01:32.180
Within a given state
in general, the system

00:01:32.180 --> 00:01:34.960
is winner take all in the sense
that the candidate who receives

00:01:34.960 --> 00:01:38.430
the most vote in that state
gets all of its electoral votes.

00:01:38.430 --> 00:01:40.310
And then across
the entire country,

00:01:40.310 --> 00:01:43.520
the candidate who receives
the most electoral votes

00:01:43.520 --> 00:01:46.920
wins the entire
presidential election.

00:01:46.920 --> 00:01:49.700
Now while it seems like a
somewhat subtle distinction,

00:01:49.700 --> 00:01:53.259
the electoral college versus
the simple popular vote model,

00:01:53.259 --> 00:01:55.600
it can have very
significant consequences

00:01:55.600 --> 00:01:57.400
on the outcome of the election.

00:01:57.400 --> 00:02:00.500
As an example, let's look at
the 2000 presidential election

00:02:00.500 --> 00:02:03.670
between George W.
Bush and Al Gore.

00:02:03.670 --> 00:02:06.340
As we can see on the
right here, Al Gore

00:02:06.340 --> 00:02:10.520
received more than 500,000 more
votes across the entire country

00:02:10.520 --> 00:02:14.060
than George W. Bush in
terms of the popular vote.

00:02:14.060 --> 00:02:15.820
But in terms of
the electoral vote,

00:02:15.820 --> 00:02:18.000
because of how those
votes were distributed,

00:02:18.000 --> 00:02:20.060
George Bush actually
won the election

00:02:20.060 --> 00:02:25.050
because he received five more
electoral votes than Al Gore.

00:02:25.050 --> 00:02:27.860
So our goal will be to
use polling data that's

00:02:27.860 --> 00:02:30.850
collected from likely
voters before the election

00:02:30.850 --> 00:02:32.960
to predict the
winner in each state,

00:02:32.960 --> 00:02:34.570
and therefore to
enable us to predict

00:02:34.570 --> 00:02:37.190
the winner of the
entire election

00:02:37.190 --> 00:02:39.730
in the electoral college system.

00:02:39.730 --> 00:02:42.260
While election prediction has
long attracted some attention,

00:02:42.260 --> 00:02:43.680
there's been a
particular interest

00:02:43.680 --> 00:02:46.540
in the problem for the
2012 presidential election,

00:02:46.540 --> 00:02:48.550
when then-New York
Times columnist Nate

00:02:48.550 --> 00:02:50.810
Silver took on the
task of predicting

00:02:50.810 --> 00:02:54.020
the winner in each state.

00:02:54.020 --> 00:02:55.829
To carry out this
prediction task,

00:02:55.829 --> 00:02:59.329
we're going to use some data
from RealClearPolitics.com that

00:02:59.329 --> 00:03:01.740
basically represents polling
data that was collected

00:03:01.740 --> 00:03:06.560
in the months leading up to
the 2004, 2008, and 2012 US

00:03:06.560 --> 00:03:08.400
presidential elections.

00:03:08.400 --> 00:03:10.870
Each row in the data
set represents a state

00:03:10.870 --> 00:03:13.150
in a particular election year.

00:03:13.150 --> 00:03:15.830
And the dependent variable,
which is called Republican,

00:03:15.830 --> 00:03:17.280
is a binary outcome.

00:03:17.280 --> 00:03:19.650
It's 1 if the Republican
won that state

00:03:19.650 --> 00:03:21.329
in that particular
election year,

00:03:21.329 --> 00:03:23.829
and a 0 if a Democrat won.

00:03:23.829 --> 00:03:25.610
The independent
variables, again,

00:03:25.610 --> 00:03:27.960
are related to polling
data in that state.

00:03:27.960 --> 00:03:31.870
So for instance, the Rasmussen
and SurveyUSA variables

00:03:31.870 --> 00:03:33.970
are related to two
major polls that

00:03:33.970 --> 00:03:37.040
are assigned across many
different states in the United

00:03:37.040 --> 00:03:38.140
States.

00:03:38.140 --> 00:03:40.720
And it represents the
percentage of voters

00:03:40.720 --> 00:03:42.950
who said they were
likely to vote Republican

00:03:42.950 --> 00:03:44.410
minus the percentage
who said they

00:03:44.410 --> 00:03:45.710
were likely to vote Democrat.

00:03:45.710 --> 00:03:49.520
So for instance, if the variable
SurveyUSA in our data set

00:03:49.520 --> 00:03:53.310
has value -6, it means
that 6% more voters

00:03:53.310 --> 00:03:55.420
said they were likely
to vote Democrat

00:03:55.420 --> 00:03:58.880
than said they were likely to
vote Republican in that state.

00:03:58.880 --> 00:04:00.430
We have two additional
variables that

00:04:00.430 --> 00:04:03.290
capture polling data from
a wider range of polls.

00:04:03.290 --> 00:04:06.420
Rasmussen and SurveyUSA are
definitely not the only polls

00:04:06.420 --> 00:04:09.100
that are run on a
state by state basis.

00:04:09.100 --> 00:04:12.240
DiffCount counts the number
of all the polls leading up

00:04:12.240 --> 00:04:16.000
to the election that predicted a
Republican winner in the state,

00:04:16.000 --> 00:04:19.160
minus the number of polls that
predicted a Democratic winner.

00:04:19.160 --> 00:04:22.420
And PropR, or
proportion Republican,

00:04:22.420 --> 00:04:24.850
has the proportion of all
those polls leading up

00:04:24.850 --> 00:04:29.160
to the election that
predicted a Republican winner.