WEBVTT

00:00:04.490 --> 00:00:09.180
Let us discuss data sources
in the health care industry.

00:00:09.180 --> 00:00:14.800
So the industry is data-rich,
but data may be hard to access.

00:00:14.800 --> 00:00:17.950
Sometimes it involves
unstructured data

00:00:17.950 --> 00:00:21.280
like doctor's notes.

00:00:21.280 --> 00:00:24.660
Often the data is hard
to get due to differences

00:00:24.660 --> 00:00:26.520
in technology.

00:00:26.520 --> 00:00:31.170
Hospitals in southern
Massachusetts versus California

00:00:31.170 --> 00:00:36.110
might use different technologies
and different platforms.

00:00:36.110 --> 00:00:40.460
Finally there are strong
privacy laws, HIPAA,

00:00:40.460 --> 00:00:42.540
around health care data sharing.

00:00:42.540 --> 00:00:43.420
So what is available?

00:00:48.170 --> 00:00:51.230
Claims data is a major source.

00:00:51.230 --> 00:00:54.520
Claims data are requests
for reimbursement submitted

00:00:54.520 --> 00:00:57.780
to insurance companies or
state-provided insurance

00:00:57.780 --> 00:01:00.320
from doctors, hospitals
and pharmacies.

00:01:03.160 --> 00:01:06.150
Another source of data is
the eligibility information

00:01:06.150 --> 00:01:08.660
for employees.

00:01:08.660 --> 00:01:12.320
And finally demographic
information: gender and age.

00:01:15.539 --> 00:01:20.940
Let me give you some
examples on claims data.

00:01:20.940 --> 00:01:25.160
So this shows six
different claims.

00:01:25.160 --> 00:01:28.180
Let's consider this one.

00:01:28.180 --> 00:01:31.560
So this is the provider's name.

00:01:31.560 --> 00:01:35.200
The corresponding
diagnostic code.

00:01:35.200 --> 00:01:41.080
This is about upper
respiratory disorders.

00:01:41.080 --> 00:01:46.400
This is another code
associated with the diagnosis.

00:01:46.400 --> 00:01:52.640
This is the scientific
term for the diagnosis.

00:01:52.640 --> 00:01:55.950
The specific code again.

00:01:55.950 --> 00:02:01.620
This was an office visit, and
it's an established patient.

00:02:01.620 --> 00:02:03.760
The date.

00:02:03.760 --> 00:02:12.460
And the amount of money that
was claimed by the physician.

00:02:12.460 --> 00:02:14.400
Others claims are similar.

00:02:17.920 --> 00:02:26.290
As we see, the claims data is
a rich, structured data source.

00:02:26.290 --> 00:02:29.620
It is very high dimensional.

00:02:29.620 --> 00:02:34.470
For example, claims
involving diagnosis

00:02:34.470 --> 00:02:37.329
involve thousands
of different codes.

00:02:37.329 --> 00:02:40.870
Similarly with drugs, where
there are tens of thousands,

00:02:40.870 --> 00:02:43.000
and procedures.

00:02:43.000 --> 00:02:46.530
However, this collection
of data does not

00:02:46.530 --> 00:02:49.890
capture all aspects of a
person's treatment or health.

00:02:49.890 --> 00:02:53.480
Many things must be inferred.

00:02:53.480 --> 00:02:56.300
Unlike electronic
medical records,

00:02:56.300 --> 00:02:58.510
we do not know the
results of a test,

00:02:58.510 --> 00:03:00.660
only that the test
was administered.

00:03:00.660 --> 00:03:07.070
For example, we do not know
the results of a blood test,

00:03:07.070 --> 00:03:09.240
but we do know that the
blood test was administered.

00:03:15.060 --> 00:03:19.550
The specific exercise we are
going to see in this lecture

00:03:19.550 --> 00:03:25.350
is an analytics approach
to building models starting

00:03:25.350 --> 00:03:28.700
with 2.4 million people
over a three year span.

00:03:33.150 --> 00:03:37.570
The observation period
was 2001 to 2003.

00:03:37.570 --> 00:03:40.270
This is where this
data is coming from.

00:03:40.270 --> 00:03:42.990
And then out of sample,
we make predictions

00:03:42.990 --> 00:03:46.590
for the period of 2003 and 2004.

00:03:46.590 --> 00:03:48.600
This was in the early
years of D2Hawkeye.

00:03:52.610 --> 00:03:56.990
Out of the 2.4 million people,
we included only people

00:03:56.990 --> 00:03:59.720
with data for at least 10
months in both periods,

00:03:59.720 --> 00:04:02.850
both in the observation
period and the results period.

00:04:02.850 --> 00:04:06.490
This decreased the
data to 400,000 people.