WEBVTT
00:00:04.500 --> 00:00:08.310
Let's now see how the baseline
method used by D2Hawkeye
00:00:08.310 --> 00:00:11.250
would perform on this data set.
00:00:11.250 --> 00:00:13.140
The baseline method
would predict
00:00:13.140 --> 00:00:16.320
that the cost bucket
for a patient in 2009
00:00:16.320 --> 00:00:19.860
will be the same
as it was in 2008.
00:00:19.860 --> 00:00:23.680
So let's create a classification
matrix to compute the accuracy
00:00:23.680 --> 00:00:26.830
for the baseline
method on the test set.
00:00:26.830 --> 00:00:31.800
So we'll use the table function,
where the actual outcomes are
00:00:31.800 --> 00:00:42.110
ClaimsTest$bucket2009,
and our predictions are
00:00:42.110 --> 00:00:53.530
ClaimsTest$bucket2008.
00:00:53.530 --> 00:00:57.570
The accuracy is the sum of the
diagonal, the observations that
00:00:57.570 --> 00:01:00.030
were classified
correctly, divided
00:01:00.030 --> 00:01:03.850
by the total number of
observations in our test set.
00:01:03.850 --> 00:01:26.780
So we want to add up 110138
+ 10721 + 2774 + 1539 + 104.
00:01:26.780 --> 00:01:29.039
And we want to divide
by the total number
00:01:29.039 --> 00:01:32.960
of observations in this
table, or the number of rows
00:01:32.960 --> 00:01:33.640
in ClaimsTest.
00:01:39.380 --> 00:01:44.380
So the accuracy of the
baseline method is 0.68.
00:01:44.380 --> 00:01:46.970
Now how about the penalty error?
00:01:46.970 --> 00:01:50.729
To compute this, we need to
first create a penalty matrix
00:01:50.729 --> 00:01:53.460
in R. Keep in mind
that we'll put
00:01:53.460 --> 00:01:57.560
the actual outcomes on the
left, and the predicted outcomes
00:01:57.560 --> 00:01:59.210
on the top.
00:01:59.210 --> 00:02:05.460
So we'll call it
PenaltyMatrix, which
00:02:05.460 --> 00:02:10.289
will be equal to a
matrix object in R.
00:02:10.289 --> 00:02:12.020
And then we need
to give the numbers
00:02:12.020 --> 00:02:18.650
that should fill up the
matrix: 0, 1, 2, 3, 4.
00:02:18.650 --> 00:02:21.740
That'll be the first row.
00:02:21.740 --> 00:02:26.590
And then 2, 0, 1, 2, 3.
00:02:26.590 --> 00:02:28.800
That'll be the second row.
00:02:28.800 --> 00:02:34.170
4, 2, 0, 1, 2 for the third row.
00:02:34.170 --> 00:02:39.110
6, 4, 2, 0, 1 for
the fourth row.
00:02:39.110 --> 00:02:45.720
And finally, 8, 6, 4,
2, 0 for the fifth row.
00:02:45.720 --> 00:02:49.130
And then after the
parentheses, type a comma,
00:02:49.130 --> 00:03:00.420
and then byrow = TRUE,
and then add nrow = 5.
00:03:00.420 --> 00:03:03.360
Close the parentheses,
and hit Enter.
00:03:03.360 --> 00:03:05.400
So what did we just create?
00:03:05.400 --> 00:03:09.110
Type PenaltyMatrix
and hit Enter.
00:03:12.260 --> 00:03:15.700
So with the previous command,
we filled up our matrix row
00:03:15.700 --> 00:03:17.620
by row.
00:03:17.620 --> 00:03:20.160
The actual outcomes
are on the left,
00:03:20.160 --> 00:03:22.950
and the predicted
outcomes are on the top.
00:03:22.950 --> 00:03:25.810
So as we saw in the
slides, the worst outcomes
00:03:25.810 --> 00:03:29.160
are when we predict
a low cost bucket,
00:03:29.160 --> 00:03:33.110
but the actual outcome
is a high cost bucket.
00:03:33.110 --> 00:03:34.900
We still give
ourselves a penalty
00:03:34.900 --> 00:03:37.670
when we predict a
high cost bucket
00:03:37.670 --> 00:03:43.090
and it's actually a low cost
bucket, but it's not as bad.
00:03:43.090 --> 00:03:46.820
So now to compute the penalty
error of the baseline method,
00:03:46.820 --> 00:03:49.570
we can multiply our
classification matrix
00:03:49.570 --> 00:03:51.700
by the penalty matrix.
00:03:51.700 --> 00:03:54.530
So go ahead and hit the
Up arrow to get back
00:03:54.530 --> 00:03:56.390
to where you created
the classification
00:03:56.390 --> 00:03:59.210
matrix with the table function.
00:03:59.210 --> 00:04:02.950
And we're going to surround
the entire table function
00:04:02.950 --> 00:04:08.150
by as.matrix to
convert it to a matrix
00:04:08.150 --> 00:04:11.610
so that we can multiply
it by our penalty matrix.
00:04:11.610 --> 00:04:14.730
So now at the end,
close the parentheses
00:04:14.730 --> 00:04:21.220
and then multiply by
PenaltyMatrix and hit Enter.
00:04:21.220 --> 00:04:24.100
So what this does is
it takes each number
00:04:24.100 --> 00:04:27.160
in the classification
matrix and multiplies it
00:04:27.160 --> 00:04:31.170
by the corresponding number
in the penalty matrix.
00:04:31.170 --> 00:04:33.360
So now to compute
the penalty error,
00:04:33.360 --> 00:04:35.750
we just need to sum
it up and divide
00:04:35.750 --> 00:04:38.930
by the number of
observations in our test set.
00:04:38.930 --> 00:04:41.740
So scroll up once,
and then we'll
00:04:41.740 --> 00:04:44.760
just surround our
entire previous command
00:04:44.760 --> 00:04:45.740
by the sum function.
00:04:51.960 --> 00:05:00.400
And we'll divide by the
number of rows in ClaimsTest
00:05:00.400 --> 00:05:02.150
and hit Enter.
00:05:02.150 --> 00:05:06.850
So the penalty error for
the baseline method is 0.74.
00:05:06.850 --> 00:05:09.180
In the next video,
our goal will be
00:05:09.180 --> 00:05:14.790
to create a CART model that
has an accuracy higher than 68%
00:05:14.790 --> 00:05:18.290
and a penalty error
lower than 0.74.