WEBVTT

00:00:09.500 --> 00:00:11.280
In this recitation,
we're going to talk

00:00:11.280 --> 00:00:13.170
about predictive
coding-- an emerging

00:00:13.170 --> 00:00:17.820
use of text analytics in the
area of criminal justice.

00:00:17.820 --> 00:00:20.930
We'll start with the story of
Enron, the United States energy

00:00:20.930 --> 00:00:23.650
company based out of Houston,
Texas that was involved

00:00:23.650 --> 00:00:26.370
in a number of electricity
production and distribution

00:00:26.370 --> 00:00:27.540
markets.

00:00:27.540 --> 00:00:30.560
In the early 2000s,
Enron was a hot company,

00:00:30.560 --> 00:00:33.860
with the market capitalization
exceeding $60 billion,

00:00:33.860 --> 00:00:37.340
and Forbes magazine ranked
it as the most innovative US

00:00:37.340 --> 00:00:39.700
company six years in a row.

00:00:39.700 --> 00:00:42.850
Now, all that changed
in 2001 with the news

00:00:42.850 --> 00:00:45.820
of widespread accounting
fraud at the firm.

00:00:45.820 --> 00:00:48.950
This massive fraud led to
Enron's bankruptcy, the largest

00:00:48.950 --> 00:00:52.280
ever at the time, and led to
Enron's accounting firm, Arthur

00:00:52.280 --> 00:00:53.840
Andersen, dissolving.

00:00:53.840 --> 00:00:56.470
To this day, Enron
remains a symbol

00:00:56.470 --> 00:00:59.750
of corporate greed
and corruption.

00:00:59.750 --> 00:01:02.150
Now, what Enron's
collapse stemmed largely

00:01:02.150 --> 00:01:04.260
from accounting
fraud, the firm also

00:01:04.260 --> 00:01:05.820
faced sanctions
for its involvement

00:01:05.820 --> 00:01:09.110
in the California
electricity crisis.

00:01:09.110 --> 00:01:12.160
California is the most populous
state in the United States.

00:01:12.160 --> 00:01:15.900
And in 2000 to 2001, it had
a number of power blackouts,

00:01:15.900 --> 00:01:19.220
despite having sufficient
generating capacity.

00:01:19.220 --> 00:01:21.970
It later surfaced that
Enron played a key role

00:01:21.970 --> 00:01:25.050
in this energy crisis
by artificially reducing

00:01:25.050 --> 00:01:27.880
power supply to
spike prices and then

00:01:27.880 --> 00:01:30.840
making a profit from
this market instability.

00:01:30.840 --> 00:01:34.160
The Federal Energy Regulatory
Commission, or FERC,

00:01:34.160 --> 00:01:37.000
investigated Enron's
involvement in the crisis.

00:01:37.000 --> 00:01:38.920
And this investigation
eventually

00:01:38.920 --> 00:01:42.979
led to $1.52 billion settlement.

00:01:42.979 --> 00:01:45.440
FERC's investigation
into Enron will

00:01:45.440 --> 00:01:49.200
be the topic of
today's recitation.

00:01:49.200 --> 00:01:52.570
Now, Enron was a huge company,
and its corporate servers

00:01:52.570 --> 00:01:56.190
contained millions of emails
and other electronic files.

00:01:56.190 --> 00:01:58.000
Sifting through these
documents to find

00:01:58.000 --> 00:01:59.860
the ones relevant
to an investigation

00:01:59.860 --> 00:02:01.780
is no simple task.

00:02:01.780 --> 00:02:05.090
In law, this electronic
argument retrieval process

00:02:05.090 --> 00:02:07.350
is called the
e-discovery problem,

00:02:07.350 --> 00:02:11.390
and relevant files are
called responsive documents.

00:02:11.390 --> 00:02:15.190
Traditionally, the e-discovery
problem has been solved

00:02:15.190 --> 00:02:17.910
by using the key research--
in our case, perhaps,

00:02:17.910 --> 00:02:20.250
searching for phrases
like "electricity bid"

00:02:20.250 --> 00:02:23.110
or "energy schedule"--
followed by an expensive

00:02:23.110 --> 00:02:25.820
and time-consuming
manual review process,

00:02:25.820 --> 00:02:28.260
in which attorneys read
through thousands of documents

00:02:28.260 --> 00:02:31.030
to determine which
ones are responsive.

00:02:31.030 --> 00:02:34.780
However, predictive
coding is a new technique,

00:02:34.780 --> 00:02:37.450
in which attorneys mainly
label some documents

00:02:37.450 --> 00:02:40.010
and then use text
analytics models trained

00:02:40.010 --> 00:02:41.950
on the manually
labeled documents

00:02:41.950 --> 00:02:44.150
to predict which of
the remaining documents

00:02:44.150 --> 00:02:46.480
are responsive.

00:02:46.480 --> 00:02:48.570
Now, as part of
its investigation,

00:02:48.570 --> 00:02:51.910
the FERC released hundreds
of thousands of emails

00:02:51.910 --> 00:02:55.370
from top executives at Enron
creating the largest publicly

00:02:55.370 --> 00:02:57.480
available set of emails today.

00:02:57.480 --> 00:03:01.410
We will use this data set called
the Enron Corpus to perform

00:03:01.410 --> 00:03:04.330
predictive coding
in this recitation.

00:03:04.330 --> 00:03:06.980
Our data set contains
just two fields--

00:03:06.980 --> 00:03:10.030
email, which is the text
of the email in question,

00:03:10.030 --> 00:03:12.760
and responsive, which is
whether the email relates

00:03:12.760 --> 00:03:16.110
to energy schedules or bids.

00:03:16.110 --> 00:03:17.780
The labels for these
emails were made

00:03:17.780 --> 00:03:21.530
by attorneys as part of the
2010 text retrieval conference

00:03:21.530 --> 00:03:25.050
legal track, a predictive
coding competition.