WEBVTT

00:00:04.810 --> 00:00:09.240
So let us explain
what claims data is.

00:00:09.240 --> 00:00:17.720
So medical claims are generated
when a patient visits a doctor.

00:00:17.720 --> 00:00:22.600
Medical claims include diagnosis
code, procedures codes,

00:00:22.600 --> 00:00:25.370
as well as costs.

00:00:25.370 --> 00:00:31.070
Pharmacy claims involve drugs,
the quantity of these drugs,

00:00:31.070 --> 00:00:36.150
the prescribing doctor, as
well as the medication costs.

00:00:36.150 --> 00:00:39.500
Claims data are
electronically available, they

00:00:39.500 --> 00:00:43.490
are standardized, they use
well-established codes.

00:00:43.490 --> 00:00:45.630
However, since
humans generate them,

00:00:45.630 --> 00:00:47.980
they are not 100% accurate.

00:00:47.980 --> 00:00:52.250
And often, under-reporting
is common in the sense

00:00:52.250 --> 00:00:55.730
that it's a tedious job
to record these claims,

00:00:55.730 --> 00:00:58.320
and as a result, often
people under-report them.

00:00:58.320 --> 00:01:03.950
Also, claims for hospital
visits can be vague.

00:01:03.950 --> 00:01:07.430
In creating a data
set, our objective

00:01:07.430 --> 00:01:11.590
was to assess quality,
health care quality.

00:01:11.590 --> 00:01:15.820
So we used a large health
insurance claims database,

00:01:15.820 --> 00:01:22.270
and we randomly selected
131 diabetes patients.

00:01:22.270 --> 00:01:27.360
The ages ranged between
35 to 55 and the costs

00:01:27.360 --> 00:01:32.340
were in the neighborhood
of $10,000 to $20,000.

00:01:32.340 --> 00:01:35.190
The period in which these
claims were recorded

00:01:35.190 --> 00:01:41.780
were September 1, 2003
to August 31, 2005.

00:01:41.780 --> 00:01:44.590
An expert physician
reviewed the claims

00:01:44.590 --> 00:01:48.020
and wrote descriptive
notes, like "ongoing use

00:01:48.020 --> 00:01:52.210
of narcotics"; "only on Avandia,
not a good first choice drug";

00:01:52.210 --> 00:01:55.140
"had regular visits,
mammogram, and immunizations";

00:01:55.140 --> 00:01:59.100
"was given home
testing supplies".

00:01:59.100 --> 00:02:02.810
After this review,
this expert physician

00:02:02.810 --> 00:02:07.520
rated the quality of care on a
two-point scale, poor or good.

00:02:07.520 --> 00:02:12.000
Examples included,
I'd say care was poor.

00:02:12.000 --> 00:02:13.080
Poorly treated diabetes.

00:02:13.080 --> 00:02:17.900
Not an eye exam, but overall
I'd say high quality.

00:02:17.900 --> 00:02:20.900
So based on these comments,
we extracted variables.

00:02:20.900 --> 00:02:24.070
The dependent variable
was the quality of care.

00:02:24.070 --> 00:02:27.720
The independent variables
involve the ongoing use

00:02:27.720 --> 00:02:32.150
of narcotics; only on Avandia,
not a good first choice drug;

00:02:32.150 --> 00:02:34.520
had regular visits,
mammogram, and immunizations;

00:02:34.520 --> 00:02:37.540
was given home testing supplies.

00:02:37.540 --> 00:02:39.660
Overall, the
independent variables

00:02:39.660 --> 00:02:41.900
involved diabetes
treatment variables,

00:02:41.900 --> 00:02:45.710
patient demographics, health
care utilization, providers,

00:02:45.710 --> 00:02:47.160
claims, and prescriptions.

00:02:50.720 --> 00:02:55.270
The dependent variable was
modeled as a binary variable --

00:02:55.270 --> 00:02:59.100
1 for low-quality care and
0 for high-quality care.

00:02:59.100 --> 00:03:01.770
This is by its nature
a categorical variable.

00:03:01.770 --> 00:03:05.040
It only takes two
possible values.

00:03:05.040 --> 00:03:08.530
We have seen linear
regression as a way

00:03:08.530 --> 00:03:11.740
of predicting
continuous outcomes.

00:03:11.740 --> 00:03:17.190
Of course, we can
utilize linear regression

00:03:17.190 --> 00:03:19.470
to predict quality of
care here, but then we

00:03:19.470 --> 00:03:22.710
have to round the
outcome to 0 or 1.

00:03:22.710 --> 00:03:28.260
Instead, we will
explain in this lecture

00:03:28.260 --> 00:03:31.090
how we can use logistic
regression, which

00:03:31.090 --> 00:03:33.290
is an extension of
linear regression,

00:03:33.290 --> 00:03:36.590
to environments where
the dependent variable is

00:03:36.590 --> 00:03:37.280
categorical.

00:03:37.280 --> 00:03:40.460
In our case, 0 or 1.