WEBVTT

00:00:15.088 --> 00:00:16.880
DAVID SONTAG: So I'll
begin today's lecture

00:00:16.880 --> 00:00:20.860
by giving a brief recap
of risk stratification.

00:00:20.860 --> 00:00:24.048
We didn't get to finish talking
survival modeling on Thursday,

00:00:24.048 --> 00:00:25.840
and so I'll go a little
bit more into that,

00:00:25.840 --> 00:00:27.590
and I'll answer some
of the questions that

00:00:27.590 --> 00:00:30.735
arose during our discussions
and on Piazza since.

00:00:30.735 --> 00:00:32.860
And then the vast majority
of today's lecture we'll

00:00:32.860 --> 00:00:35.100
be talking about a new topic--

00:00:35.100 --> 00:00:37.600
in particular, physiological
time series modeling.

00:00:37.600 --> 00:00:40.400
I'll give two examples of
physiological time series

00:00:40.400 --> 00:00:43.810
modeling-- the first one
coming from monitoring patients

00:00:43.810 --> 00:00:46.050
in intensive care units,
and the second one

00:00:46.050 --> 00:00:47.800
asking a very different
type of question--

00:00:47.800 --> 00:00:53.620
that of diagnosing patients'
heart conditions using EKGs.

00:00:53.620 --> 00:00:55.390
And both of these
correspond to readings

00:00:55.390 --> 00:00:57.040
that you had for
today's lecture,

00:00:57.040 --> 00:00:59.200
and we'll go into much
more depth in these--

00:00:59.200 --> 00:01:01.640
of those papers today, and
I'll provide much more color

00:01:01.640 --> 00:01:02.140
around them.

00:01:05.379 --> 00:01:07.862
So just to briefly remind you
where we were on Thursday,

00:01:07.862 --> 00:01:10.320
we talked about how one could
formalize risk stratification

00:01:10.320 --> 00:01:12.840
instead of as a classification
problem of what would happen,

00:01:12.840 --> 00:01:15.570
let's say, in some
predefined time period,

00:01:15.570 --> 00:01:17.640
rather thinking about
risk stratification

00:01:17.640 --> 00:01:21.390
as a regression question,
or regression task.

00:01:21.390 --> 00:01:23.910
Given what you know about
a patient at time zero,

00:01:23.910 --> 00:01:26.380
predicting time to event--

00:01:26.380 --> 00:01:29.790
so for example, here the
event might be death, divorce,

00:01:29.790 --> 00:01:31.150
college graduation.

00:01:31.150 --> 00:01:35.850
And patient one-- that event
happened at time step nine.

00:01:35.850 --> 00:01:38.340
Patient two, that event
happened at time step 12.

00:01:38.340 --> 00:01:42.510
And for patient four, we don't
know when that event happened,

00:01:42.510 --> 00:01:45.240
because it was censored.

00:01:45.240 --> 00:01:47.710
In particular, after
time step seven,

00:01:47.710 --> 00:01:50.340
we no longer get to view
any of the patients' data,

00:01:50.340 --> 00:01:52.960
and so we don't know when
that red dot would be--

00:01:52.960 --> 00:01:55.360
sometime in the future or never.

00:01:55.360 --> 00:01:57.550
So this is what we mean by
right censor data, which

00:01:57.550 --> 00:02:01.443
is precisely what survival
modeling is aiming to solve.

00:02:01.443 --> 00:02:03.235
Are there questions
about this setup first?

00:02:06.358 --> 00:02:08.259
AUDIENCE: You flipped the x on--

00:02:08.259 --> 00:02:09.759
DAVID SONTAG: Yeah,
I realized that.

00:02:09.759 --> 00:02:11.860
I flipped the x and the o
in today's presentation,

00:02:11.860 --> 00:02:14.720
but that's not relevant.

00:02:14.720 --> 00:02:18.370
So f of t is the
probability of death,

00:02:18.370 --> 00:02:20.742
or the event occurring
at time step t.

00:02:20.742 --> 00:02:22.450
And although in this
slide I'm showing it

00:02:22.450 --> 00:02:24.300
as an unconditional
model, in general,

00:02:24.300 --> 00:02:25.875
you should think about this
as a conditional density.

00:02:25.875 --> 00:02:28.333
So you might be conditioning
on some covariates or features

00:02:28.333 --> 00:02:31.810
that you have for that
patient at baseline.

00:02:31.810 --> 00:02:34.515
And very important
for survival modeling

00:02:34.515 --> 00:02:35.890
and for the next
things I'll tell

00:02:35.890 --> 00:02:39.790
you are the survival function,
to note it as capital S of t.

00:02:39.790 --> 00:02:45.170
And that's simply 1 minus the
cumulative density function.

00:02:45.170 --> 00:02:48.040
So it's the probability
that the event occurring,

00:02:48.040 --> 00:02:49.120
which is time--

00:02:49.120 --> 00:02:51.640
which is denoted
here as capital T,

00:02:51.640 --> 00:02:54.430
occurs greater
than some little t.

00:02:54.430 --> 00:02:56.770
So it's this function,
which is simply

00:02:56.770 --> 00:02:58.450
given to you by
the integral from 0

00:02:58.450 --> 00:03:01.370
to infinity of the density.

00:03:01.370 --> 00:03:04.520
So in pictures,
this is the density.

00:03:04.520 --> 00:03:06.230
On the x-axis is time.

00:03:06.230 --> 00:03:08.040
The y-axis is the
density function.

00:03:08.040 --> 00:03:11.660
And this black curve is
what I'm denoting as f of t.

00:03:11.660 --> 00:03:18.390
And this white area is capital s
of c, the survival probability,

00:03:18.390 --> 00:03:21.250
or survival function.

00:03:21.250 --> 00:03:21.990
Yes?

00:03:21.990 --> 00:03:23.532
AUDIENCE: So I just
want to be clear.

00:03:23.532 --> 00:03:26.446
So if you were to
integrate the entire curve,

00:03:26.446 --> 00:03:31.717
[INAUDIBLE] by infinity you're
going to be [INAUDIBLE]..

00:03:31.717 --> 00:03:33.550
DAVID SONTAG: In the
way that I described it

00:03:33.550 --> 00:03:38.950
to here, yes, because we're
talking about the time

00:03:38.950 --> 00:03:41.050
to event.

00:03:41.050 --> 00:03:44.680
But often we might be in
scenarios where the event may

00:03:44.680 --> 00:03:47.662
never occur, and so that--

00:03:47.662 --> 00:03:49.870
you can formalize that in
a couple of different ways.

00:03:49.870 --> 00:03:52.060
You could put that at point
mass at s of infinity,

00:03:52.060 --> 00:03:55.270
or you could simply say that
the integral from 0 to infinity

00:03:55.270 --> 00:03:57.730
is some quantity less than 1.

00:03:57.730 --> 00:03:59.440
And in the readings
that I'm referencing

00:03:59.440 --> 00:04:00.850
in the very bottom of
those slides-- it shows you

00:04:00.850 --> 00:04:03.292
how you can very easily
modify all of the frameworks

00:04:03.292 --> 00:04:05.500
I'm telling you about here
to deal with that scenario

00:04:05.500 --> 00:04:07.120
where the event may never occur.

00:04:07.120 --> 00:04:09.702
But for the purposes
of my presentation,

00:04:09.702 --> 00:04:11.410
you can assume that
the event will always

00:04:11.410 --> 00:04:13.000
occur at some point.

00:04:13.000 --> 00:04:17.350
It's a very minor modification
where you, in essence, divide

00:04:17.350 --> 00:04:21.700
the densities by a constant,
which accounts for the fact

00:04:21.700 --> 00:04:26.170
that it wouldn't integrate
to one otherwise.

00:04:26.170 --> 00:04:30.100
Now, a key question
that has to be solved

00:04:30.100 --> 00:04:33.400
when trying to use a parametric
approach to survivor modeling

00:04:33.400 --> 00:04:35.240
is, what should that
f of t look like?

00:04:35.240 --> 00:04:38.070
What should that density
function look like?

00:04:38.070 --> 00:04:42.040
And what I'm showing you here
is a table of some very commonly

00:04:42.040 --> 00:04:43.990
used density functions.

00:04:43.990 --> 00:04:46.060
What you see in
these two columns--

00:04:46.060 --> 00:04:49.600
on the right hand column is the
density function f of t itself.

00:04:49.600 --> 00:04:52.180
Lambda denotes some
parameter of the model.

00:04:52.180 --> 00:04:54.520
t is the time.

00:04:54.520 --> 00:04:57.940
And on this second middle
column is the survival function.

00:04:57.940 --> 00:05:01.510
So this is obtained for these
particular parametric forms

00:05:01.510 --> 00:05:06.790
by an analytical solution
solving that integral from t

00:05:06.790 --> 00:05:08.320
to infinity.

00:05:08.320 --> 00:05:11.090
This is the analytic
solution for that.

00:05:11.090 --> 00:05:13.600
And so these go by common
names of exponential,

00:05:13.600 --> 00:05:15.880
weeble, log-normal, and so on.

00:05:15.880 --> 00:05:17.950
And critically, all of
these have support only

00:05:17.950 --> 00:05:21.192
on the positive real numbers,
because the event can ever

00:05:21.192 --> 00:05:22.150
occur at negative time.

00:05:24.690 --> 00:05:27.900
Now, we live in a
day and age where

00:05:27.900 --> 00:05:32.340
we no longer have to make
standard parametric assumptions

00:05:32.340 --> 00:05:33.600
for densities.

00:05:33.600 --> 00:05:36.990
We could, for example, try
to formalize the density

00:05:36.990 --> 00:05:41.110
as some output of some
deep neural network.

00:05:41.110 --> 00:05:44.833
But if we don't use a
parametric approach,

00:05:44.833 --> 00:05:46.500
so there are two ways
to try to do that.

00:05:46.500 --> 00:05:48.470
One way to do that would
be to say that we're

00:05:48.470 --> 00:05:50.120
going to model the post--

00:05:50.120 --> 00:05:56.660
the distribution, f of t, as one
of these things, where lambda

00:05:56.660 --> 00:05:58.640
or whatever the
parameters of distribution

00:05:58.640 --> 00:06:00.470
are given to by the
output of, let's

00:06:00.470 --> 00:06:03.890
say, a deep neural network
on the covariate x.

00:06:03.890 --> 00:06:05.413
So that would be one approach.

00:06:05.413 --> 00:06:06.830
A very different
approach would be

00:06:06.830 --> 00:06:10.130
a non-parametric distribution
where you say, OK, I'm

00:06:10.130 --> 00:06:12.200
going to define f of
t extremely flexibly,

00:06:12.200 --> 00:06:15.980
not as one of these forms.

00:06:15.980 --> 00:06:18.500
And there one runs into a
slightly different challenge,

00:06:18.500 --> 00:06:20.720
because as I'll show
you in the next slide,

00:06:20.720 --> 00:06:22.520
to do maximum
likelihood estimation

00:06:22.520 --> 00:06:24.860
of these distributions
from censor data,

00:06:24.860 --> 00:06:27.680
one needs to get-- one needs
to make use of this survival

00:06:27.680 --> 00:06:29.820
function, s of t.

00:06:29.820 --> 00:06:33.060
And so if you're
f if t is complex,

00:06:33.060 --> 00:06:37.153
and you don't have a nice
analytic solution for s of t,

00:06:37.153 --> 00:06:38.820
then you're going to
have to somehow use

00:06:38.820 --> 00:06:41.400
a numerical approximation
of s of t during limiting.

00:06:41.400 --> 00:06:43.200
So it's definitely
possible, but it's going

00:06:43.200 --> 00:06:44.492
to be a little bit more effort.

00:06:46.510 --> 00:06:49.010
So now here's where I'm going
to get into maximum likelihood

00:06:49.010 --> 00:06:51.530
estimation of these
distributions,

00:06:51.530 --> 00:06:54.350
and to define for you
the likelihood function,

00:06:54.350 --> 00:06:57.150
I'm going to break it down
into two different settings.

00:06:57.150 --> 00:06:59.300
The first setting
is an observation

00:06:59.300 --> 00:07:03.530
which is uncensored, meaning
we do observe when the event--

00:07:03.530 --> 00:07:05.710
death, for example-- occurs.

00:07:05.710 --> 00:07:08.720
And in that case, the
probability of the event--

00:07:08.720 --> 00:07:09.450
it's very simple.

00:07:09.450 --> 00:07:13.345
It's just probability of the
event occurring at capital--

00:07:13.345 --> 00:07:15.230
at capital T, random
variable T, equals

00:07:15.230 --> 00:07:16.580
a little t-- is just f or t.

00:07:16.580 --> 00:07:17.080
Done.

00:07:19.600 --> 00:07:22.870
However, what happens
if, for this data point,

00:07:22.870 --> 00:07:26.650
you don't observe when the event
occurred because of censoring?

00:07:26.650 --> 00:07:30.120
Well, of course, you could just
throw away that data point,

00:07:30.120 --> 00:07:32.380
not use it in your
estimation, but that's

00:07:32.380 --> 00:07:34.930
precisely what we mentioned
at the very beginning

00:07:34.930 --> 00:07:38.625
of last week's lecture-- was
the goal of survival modeling

00:07:38.625 --> 00:07:40.000
to not do that,
because if we did

00:07:40.000 --> 00:07:44.440
that, it would introduce bias
into our estimation procedure.

00:07:44.440 --> 00:07:48.110
So we would like to be able
to use that observation

00:07:48.110 --> 00:07:50.770
that this data
point was censored,

00:07:50.770 --> 00:07:53.530
but the only information we
can get from that observation

00:07:53.530 --> 00:07:57.040
is that capital
T, the event time,

00:07:57.040 --> 00:08:00.490
must have occurred
some time larger

00:08:00.490 --> 00:08:03.040
than the observed-- the
time of censoring, which

00:08:03.040 --> 00:08:05.410
is little t here.

00:08:05.410 --> 00:08:09.130
So we don't know precisely
when capital T was, but we

00:08:09.130 --> 00:08:12.560
know it's something larger than
the observed centering time

00:08:12.560 --> 00:08:13.660
little of t.

00:08:13.660 --> 00:08:17.830
And that, remember, is precisely
what the survival function

00:08:17.830 --> 00:08:19.000
is capturing.

00:08:19.000 --> 00:08:20.680
So for a censored
observation, we're

00:08:20.680 --> 00:08:24.253
going to use capital S of
t within the likelihood.

00:08:24.253 --> 00:08:26.920
So now we can then combine these
two for censored and uncensored

00:08:26.920 --> 00:08:30.820
data, and what we get is the
following likelihood objective.

00:08:30.820 --> 00:08:33.880
This is-- I'm showing you here
the log likelihood objective.

00:08:33.880 --> 00:08:38.320
Recall from last week that
little b of i simply denotes

00:08:38.320 --> 00:08:40.570
is this observation
censored or not?

00:08:40.570 --> 00:08:44.290
So if bi is 1, it means
the time that you're given

00:08:44.290 --> 00:08:47.200
is the time of the
censoring event.

00:08:47.200 --> 00:08:49.540
And if bi is 0, it means
the time you're given

00:08:49.540 --> 00:08:51.787
is the time that
the event occurs.

00:08:51.787 --> 00:08:54.370
So here what we're going to do
is now sum over all of the data

00:08:54.370 --> 00:08:56.320
points in your data
set from little i

00:08:56.320 --> 00:09:02.500
equals 1 to little n of bi
times log of probability

00:09:02.500 --> 00:09:06.660
under the censored model
plus 1 minus bi times log

00:09:06.660 --> 00:09:08.410
of probability under
the uncensored model.

00:09:08.410 --> 00:09:10.510
And so this bi is just going
to switch on which of these two

00:09:10.510 --> 00:09:12.760
you're going to use for
that given data point.

00:09:12.760 --> 00:09:15.550
So the learning objective for
maximum likelihood estimation

00:09:15.550 --> 00:09:18.070
here is very similar
to what you're used to

00:09:18.070 --> 00:09:21.640
in learning distributions
with the big difference

00:09:21.640 --> 00:09:23.470
that, for censored
data, we're going

00:09:23.470 --> 00:09:29.080
to use the survival function
to estimate its probability.

00:09:29.080 --> 00:09:30.313
Are there any questions?

00:09:35.150 --> 00:09:37.270
And this, of course,
could then be

00:09:37.270 --> 00:09:39.760
optimized via your
favorite algorithm,

00:09:39.760 --> 00:09:42.400
whether it be stochastic
gradient descent,

00:09:42.400 --> 00:09:43.920
or second order
method, and so on.

00:09:43.920 --> 00:09:44.420
Yep?

00:09:44.420 --> 00:09:45.785
AUDIENCE: I have a
question about the a kind

00:09:45.785 --> 00:09:46.452
of side project.

00:09:46.452 --> 00:09:48.620
You mentioned that we
could use [INAUDIBLE]..

00:09:48.620 --> 00:09:49.370
DAVID SONTAG: Yes.

00:09:49.370 --> 00:09:51.310
AUDIENCE: And then combine it
with the parametric approach.

00:09:51.310 --> 00:09:51.730
DAVID SONTAG: Yes.

00:09:51.730 --> 00:09:53.563
AUDIENCE: So is that
true that we just still

00:09:53.563 --> 00:09:55.705
have the parametric
assumption that we kind of map

00:09:55.705 --> 00:09:57.250
the input to the parameters?

00:09:57.250 --> 00:09:58.385
DAVID SONTAG: Exactly.

00:09:58.385 --> 00:09:59.260
That's exactly right.

00:09:59.260 --> 00:10:04.980
So consider the following
picture where for--

00:10:04.980 --> 00:10:08.500
this is time, t.

00:10:08.500 --> 00:10:11.812
And this is f of t.

00:10:11.812 --> 00:10:13.670
You can imagine
for any one patient

00:10:13.670 --> 00:10:15.907
you might have a
different function.

00:10:15.907 --> 00:10:18.490
You might-- but they might all
be of the same parametric form.

00:10:18.490 --> 00:10:21.332
So they might be
like that, or maybe

00:10:21.332 --> 00:10:22.540
they're shifted a little bit.

00:10:25.130 --> 00:10:27.070
So you think about
each of these three

00:10:27.070 --> 00:10:30.355
things as being from the
same parametric family

00:10:30.355 --> 00:10:33.310
of distributions, but
with different means.

00:10:33.310 --> 00:10:35.380
And in this case,
then the mean is

00:10:35.380 --> 00:10:37.603
given to as the output of
the deep neural network.

00:10:37.603 --> 00:10:39.520
And so that would be the
way it would be used,

00:10:39.520 --> 00:10:41.812
and then one could just back
propagate in the usual way

00:10:41.812 --> 00:10:43.106
to do learning.

00:10:43.106 --> 00:10:43.692
Yep?

00:10:43.692 --> 00:10:45.400
AUDIENCE: Can you
repeat what b sub i is?

00:10:45.400 --> 00:10:45.780
DAVID SONTAG: Excuse me?

00:10:45.780 --> 00:10:47.572
AUDIENCE: Could you
repeat what b sub i is?

00:10:47.572 --> 00:10:50.740
DAVID SONTAG: b sub i is just an
indicator whether the i-th data

00:10:50.740 --> 00:10:54.510
point was censored
or not censored.

00:10:54.510 --> 00:10:55.020
Yes?

00:10:55.020 --> 00:10:59.200
AUDIENCE: So [INAUDIBLE] equal
it's more a probability density

00:10:59.200 --> 00:11:00.130
function [INAUDIBLE].

00:11:00.130 --> 00:11:01.880
DAVID SONTAG: Cumulative
density function.

00:11:01.880 --> 00:11:05.750
AUDIENCE: Yeah, but
[INAUDIBLE] probability.

00:11:05.750 --> 00:11:10.350
No, for the [INAUDIBLE] it's
probability density function.

00:11:10.350 --> 00:11:11.820
DAVID SONTAG Yes, so just to--

00:11:11.820 --> 00:11:13.200
AUDIENCE: [INAUDIBLE]

00:11:13.200 --> 00:11:15.113
DAVID SONTAG: Excuse me?

00:11:15.113 --> 00:11:16.530
AUDIENCE: Will
that be any problem

00:11:16.530 --> 00:11:18.440
to combine those
two types there?

00:11:18.440 --> 00:11:20.190
DAVID SONTAG: That's
a very good question.

00:11:20.190 --> 00:11:24.550
So the observation was that
you have two different types

00:11:24.550 --> 00:11:27.412
of probabilities used here.

00:11:27.412 --> 00:11:28.870
In this case, we're
using something

00:11:28.870 --> 00:11:32.200
like the cumulative
density, whereas here we're

00:11:32.200 --> 00:11:35.380
using the probability
density function.

00:11:35.380 --> 00:11:38.890
The question was, are these
two on different scales?

00:11:38.890 --> 00:11:40.540
Does it make sense
to combine them

00:11:40.540 --> 00:11:43.360
in this type of linear fashion
with the same weighting?

00:11:43.360 --> 00:11:45.490
And I think it does make sense.

00:11:45.490 --> 00:11:59.430
So think about a setting where
you have a very small time

00:11:59.430 --> 00:12:00.210
range.

00:12:00.210 --> 00:12:02.273
You're not exactly sure
when this event occurs.

00:12:02.273 --> 00:12:03.690
It's something in
this time range.

00:12:06.690 --> 00:12:10.440
In the setting of
the censored data,

00:12:10.440 --> 00:12:14.180
where that time range could
potentially be very large,

00:12:14.180 --> 00:12:17.650
your model is providing--

00:12:21.150 --> 00:12:23.670
your log probability
is somehow going

00:12:23.670 --> 00:12:28.590
to be much more flat,
because you're covering

00:12:28.590 --> 00:12:29.715
much more probability mass.

00:12:32.930 --> 00:12:34.830
And so that
observation, I think,

00:12:34.830 --> 00:12:37.200
intuitively is likely
to have a much--

00:12:37.200 --> 00:12:41.640
a bit of a smaller effect on
the overall learning algorithm.

00:12:41.640 --> 00:12:44.590
These observations-- you know
precisely where they are,

00:12:44.590 --> 00:12:51.280
and so as you deviate from that,
you incur the corresponding log

00:12:51.280 --> 00:12:53.740
loss penalty.

00:12:53.740 --> 00:12:55.918
But I do think
that it makes sense

00:12:55.918 --> 00:12:57.210
to have them in the same scale.

00:12:57.210 --> 00:12:59.977
If anyone in the room has done
work with [INAUDIBLE] modeling

00:12:59.977 --> 00:13:02.310
and has a different answer
to that, I'd love to hear it.

00:13:06.450 --> 00:13:09.510
Not today, but maybe
someone in the future

00:13:09.510 --> 00:13:11.260
will answer this
question differently.

00:13:11.260 --> 00:13:14.250
I'm going to move on for now.

00:13:14.250 --> 00:13:18.480
So the remaining question that
I want to talk about today

00:13:18.480 --> 00:13:22.020
is how one evaluates
survival models.

00:13:22.020 --> 00:13:27.240
So we talked about binary
classification a lot

00:13:27.240 --> 00:13:29.090
in the context of
risk stratification

00:13:29.090 --> 00:13:31.590
in the beginning, and we talked
about how area under the ROC

00:13:31.590 --> 00:13:34.900
curve is one measure of
classification performance,

00:13:34.900 --> 00:13:36.630
but here we're doing more--

00:13:36.630 --> 00:13:40.120
something more akin to
regression, not classification.

00:13:40.120 --> 00:13:43.180
A standard measure that's
used to measure performance

00:13:43.180 --> 00:13:46.240
is known as the C-statistic,
or concordance index.

00:13:46.240 --> 00:13:48.130
Those are one in the same--

00:13:48.130 --> 00:13:49.630
and is defined as follows.

00:13:49.630 --> 00:13:52.050
And it has a very
intuitive definition.

00:13:52.050 --> 00:13:55.300
It sums over pairs
of data points

00:13:55.300 --> 00:13:59.120
that can be compared
to one another,

00:13:59.120 --> 00:14:04.860
and it says, OK, what
is the likelihood

00:14:04.860 --> 00:14:08.550
of the event happening
for an event that

00:14:08.550 --> 00:14:11.100
occurs before an event--

00:14:11.100 --> 00:14:11.820
another event.

00:14:11.820 --> 00:14:16.020
And what you want is that the
likelihood of the event that,

00:14:16.020 --> 00:14:18.060
on average, in essence,
should occur later

00:14:18.060 --> 00:14:21.215
should be larger than the event
that should occur earlier.

00:14:21.215 --> 00:14:23.340
I'm going to first illustrate
it with this picture,

00:14:23.340 --> 00:14:24.840
and then I'll work
through the math.

00:14:24.840 --> 00:14:28.510
So here's the picture, and
then we'll talk about the math.

00:14:28.510 --> 00:14:31.960
So what I'm showing you here
are every single observation

00:14:31.960 --> 00:14:34.130
in your data set,
and they're sorted

00:14:34.130 --> 00:14:40.900
by either the censoring
time or the event time.

00:14:40.900 --> 00:14:45.130
So by black, I'm illustrating
uncensored data points.

00:14:45.130 --> 00:14:50.010
And by red, I'm denoting
censored data points.

00:14:50.010 --> 00:14:54.140
Now, here we see that
this data point--

00:14:54.140 --> 00:14:58.593
the event happened before this
data point's censoring event.

00:14:58.593 --> 00:15:00.260
Now, since this data
point was censored,

00:15:00.260 --> 00:15:03.030
it means it's true event
time you could think about as

00:15:03.030 --> 00:15:05.700
sometime into the far future.

00:15:05.700 --> 00:15:11.330
So what we would want
is that the model

00:15:11.330 --> 00:15:20.050
gives that the probability that
this event happens by this time

00:15:20.050 --> 00:15:24.010
should be larger
than the probability

00:15:24.010 --> 00:15:29.190
that this event
happens by this time,

00:15:29.190 --> 00:15:31.320
because this actually
occurred first.

00:15:31.320 --> 00:15:33.842
And these two are comparable
together-- to each other.

00:15:33.842 --> 00:15:35.550
On the other hand, it
wouldn't make sense

00:15:35.550 --> 00:15:39.450
to compare y2 and y4,
because both of these

00:15:39.450 --> 00:15:41.610
were censored data
points, and we don't know

00:15:41.610 --> 00:15:43.090
precisely when they occurred.

00:15:43.090 --> 00:15:45.090
So for example, it could
have very well happened

00:15:45.090 --> 00:15:50.280
that the event 2
happened after event 4.

00:15:50.280 --> 00:15:53.250
So what I'm showing you here
with each of these lines

00:15:53.250 --> 00:15:54.750
are the pairwise
comparisons that

00:15:54.750 --> 00:15:56.325
are actually possible to make.

00:15:56.325 --> 00:15:58.200
You can make pairwise
comparisons, of course,

00:15:58.200 --> 00:16:00.730
between any pair of events
that actually did occur,

00:16:00.730 --> 00:16:02.272
and you can make
pairwise comparisons

00:16:02.272 --> 00:16:06.690
between censored events and
events that occurred before it.

00:16:06.690 --> 00:16:11.500
Now, if you now look at this
formula, the formula in this

00:16:11.500 --> 00:16:15.100
indicate-- this is looking
at an indicator of survival

00:16:15.100 --> 00:16:18.080
functions between pairs of data
points, and which pairs of data

00:16:18.080 --> 00:16:18.580
points?

00:16:18.580 --> 00:16:21.170
It was precisely those
pairs of data points,

00:16:21.170 --> 00:16:24.700
which I'm showing comparisons
of with these blue lines here.

00:16:24.700 --> 00:16:28.210
So we're going to sum over i
such that bi is equal to 0,

00:16:28.210 --> 00:16:32.830
and remember that means it
is an uncensored data point.

00:16:32.830 --> 00:16:35.650
And then we look at--

00:16:35.650 --> 00:16:41.050
we look at yi compared to
all other yj that's great--

00:16:41.050 --> 00:16:45.520
that has a value greater than--
both censored and uncensored.

00:16:45.520 --> 00:16:51.860
Now, if your data had no
sensor data points in it,

00:16:51.860 --> 00:16:56.310
then you can verify that,
in fact, this corresponds--

00:16:56.310 --> 00:16:58.310
so there's one other
assumption one has to make,

00:16:58.310 --> 00:16:59.610
which is that--

00:16:59.610 --> 00:17:02.817
suppose that your
outcome is binary.

00:17:02.817 --> 00:17:05.109
And so if you might wonder
how you get a binary outcome

00:17:05.109 --> 00:17:09.760
from this, imagine that
your density function

00:17:09.760 --> 00:17:13.359
looked a little bit like this,
where it could occur either

00:17:13.359 --> 00:17:18.490
at time 1 or time 2.

00:17:18.490 --> 00:17:21.160
So something like that.

00:17:24.130 --> 00:17:29.300
So if the event can
occur at only two times,

00:17:29.300 --> 00:17:31.770
not a whole range
of times, then this

00:17:31.770 --> 00:17:35.570
is analogous to
a binary outcome.

00:17:35.570 --> 00:17:37.210
And so if you have
a binary outcome

00:17:37.210 --> 00:17:40.630
like this and no censoring,
then, in fact, that C-statistic

00:17:40.630 --> 00:17:42.763
is exactly equal to the
area under the ROC curve.

00:17:42.763 --> 00:17:44.930
So that just connects it a
little bit back to things

00:17:44.930 --> 00:17:45.400
we're used to.

00:17:45.400 --> 00:17:45.850
Yep?

00:17:45.850 --> 00:17:47.767
AUDIENCE: Just to make
sure that I understand.

00:17:47.767 --> 00:17:50.370
So y1 is going to be
we observed an event,

00:17:50.370 --> 00:17:53.920
and y2 is going to be we
know that no event occurred

00:17:53.920 --> 00:17:55.210
until that day?

00:17:55.210 --> 00:17:58.320
DAVID SONTAG: Every dot
corresponds to one event,

00:17:58.320 --> 00:17:59.370
either censored or not.

00:17:59.370 --> 00:18:00.070
AUDIENCE: Thank you.

00:18:00.070 --> 00:18:01.445
DAVID SONTAG: And
they're sorted.

00:18:01.445 --> 00:18:04.110
In this figure, they're
sorted by the time

00:18:04.110 --> 00:18:08.475
of either the censoring
or the event occurring.

00:18:14.420 --> 00:18:16.570
So I talked to--

00:18:16.570 --> 00:18:18.190
when I talked about
C-statistic, it--

00:18:18.190 --> 00:18:21.730
that's one way to measure
performance of your survival

00:18:21.730 --> 00:18:23.780
modeling, but you
might remember that I--

00:18:23.780 --> 00:18:25.780
that when we talked about
binary classification,

00:18:25.780 --> 00:18:27.363
we said how area
under there ROC curve

00:18:27.363 --> 00:18:29.080
in itself is very
limiting, and so we

00:18:29.080 --> 00:18:30.340
should think through
other performance

00:18:30.340 --> 00:18:31.215
metrics of relevance.

00:18:31.215 --> 00:18:33.652
So here are a few other
things that you could do.

00:18:33.652 --> 00:18:35.110
One thing you could
do is you could

00:18:35.110 --> 00:18:38.330
use the mean squared error.

00:18:38.330 --> 00:18:41.410
So again, thinking about
this as a regression problem.

00:18:41.410 --> 00:18:43.090
But of course, that
only makes sense

00:18:43.090 --> 00:18:45.430
for uncensored data points.

00:18:45.430 --> 00:18:47.563
So focus just in the
uncensored data points,

00:18:47.563 --> 00:18:49.480
look to see how well
we're doing at predicting

00:18:49.480 --> 00:18:51.410
when the event occurs.

00:18:51.410 --> 00:18:55.270
The second thing one could
do, since you have the ability

00:18:55.270 --> 00:18:58.490
to define the likelihood
of an observation,

00:18:58.490 --> 00:19:02.260
censored or not censored,
one could hold out data,

00:19:02.260 --> 00:19:05.080
and look at the held-out
likelihood or log likelihood

00:19:05.080 --> 00:19:07.170
of that held-out data.

00:19:07.170 --> 00:19:08.760
And the third thing
you could do is

00:19:08.760 --> 00:19:12.270
you can-- after learning
using this survival modeling

00:19:12.270 --> 00:19:17.250
framework, one could then turn
it into a binary classification

00:19:17.250 --> 00:19:19.380
problem by, for
example, artificially

00:19:19.380 --> 00:19:24.060
choosing time ranges, like
greater than three months is 1.

00:19:24.060 --> 00:19:25.830
Less than three months is 0.

00:19:25.830 --> 00:19:27.460
That would be one
crude definition.

00:19:27.460 --> 00:19:29.010
And then once you've
done a reduction

00:19:29.010 --> 00:19:30.468
to a binary
classification problem,

00:19:30.468 --> 00:19:32.763
you could use all of
the existing performance

00:19:32.763 --> 00:19:35.430
metrics they're used to thinking
about for binary classification

00:19:35.430 --> 00:19:37.020
to evaluate the
performance there--

00:19:37.020 --> 00:19:40.550
things like positive
predictive value, for example.

00:19:40.550 --> 00:19:42.990
And you could, of course,
choose different reductions

00:19:42.990 --> 00:19:44.970
and get different
performance statistics out.

00:19:44.970 --> 00:19:47.700
So this is just a
small subset of ways

00:19:47.700 --> 00:19:49.710
to try to evaluate
survivor modeling,

00:19:49.710 --> 00:19:51.812
but it's a very,
very rich literature.

00:19:51.812 --> 00:19:53.520
And again, on the
bottom of these slides,

00:19:53.520 --> 00:19:54.978
I pointed you to
several references

00:19:54.978 --> 00:19:57.640
that you could go
to to learn more.

00:19:57.640 --> 00:19:59.710
The final comment
I wanted to make

00:19:59.710 --> 00:20:02.470
is that I only told
you about one estimator

00:20:02.470 --> 00:20:05.830
in today's lecture, and that's
known as the likelihood based

00:20:05.830 --> 00:20:07.030
estimator.

00:20:07.030 --> 00:20:09.307
But there is a whole
other estimation approach

00:20:09.307 --> 00:20:11.890
for survival modelings, which
is very important to know about,

00:20:11.890 --> 00:20:14.218
that are called partial
likelihood estimators.

00:20:14.218 --> 00:20:16.510
And for those of you who have
heard of Cox proportional

00:20:16.510 --> 00:20:18.468
hazards models-- and I
know they were discussed

00:20:18.468 --> 00:20:19.750
in Friday's recitation--

00:20:19.750 --> 00:20:21.700
that's an example
of a class of model

00:20:21.700 --> 00:20:26.040
that's commonly used within this
partial likelihood estimator.

00:20:26.040 --> 00:20:28.540
Now, at a very intuitive level,
what this partial likelihood

00:20:28.540 --> 00:20:31.600
estimator is doing is it's
working with something

00:20:31.600 --> 00:20:33.100
like the C-statistic.

00:20:33.100 --> 00:20:38.890
So notice how the C-statistic
only looks at relative

00:20:38.890 --> 00:20:40.810
orderings of events--

00:20:40.810 --> 00:20:44.130
of their event occurrences.

00:20:44.130 --> 00:20:47.330
It doesn't care about exactly
when the event occurred or not.

00:20:47.330 --> 00:20:52.100
In some sense,
there's a constant.

00:20:52.100 --> 00:20:55.910
There's-- in this
survival function,

00:20:55.910 --> 00:21:01.400
which could be divided out from
both sides of this inequality,

00:21:01.400 --> 00:21:05.330
and it wouldn't affect
anything about the statistic.

00:21:05.330 --> 00:21:07.488
And so one could think
about other ways of learning

00:21:07.488 --> 00:21:09.030
these models by
saying, well, we want

00:21:09.030 --> 00:21:10.770
to learn a survival
function such

00:21:10.770 --> 00:21:14.620
that it gets the ordering
correct between data points.

00:21:14.620 --> 00:21:17.760
Now, such a survival function
wouldn't do a very good job.

00:21:17.760 --> 00:21:20.400
There's no reason it would
do any good at getting

00:21:20.400 --> 00:21:23.700
the precise time of
when an event occurs,

00:21:23.700 --> 00:21:28.980
but if your goal were
to just figure out

00:21:28.980 --> 00:21:31.405
what is the sorted order
of patients by risk

00:21:31.405 --> 00:21:33.780
so that you're going to do an
intervention on the 10 most

00:21:33.780 --> 00:21:37.710
risky people, then getting that
order incorrect is going to be

00:21:37.710 --> 00:21:40.290
enough, and that's
precisely the intuition used

00:21:40.290 --> 00:21:42.190
behind these partial
likelihood estimators--

00:21:42.190 --> 00:21:44.940
so they focus on something
which is a little bit less

00:21:44.940 --> 00:21:47.040
than the original
goal, but in doing

00:21:47.040 --> 00:21:49.570
so, they can have much better
statistical complexity,

00:21:49.570 --> 00:21:51.445
meaning the amount of
data they need in order

00:21:51.445 --> 00:21:52.780
to fit this models well.

00:21:52.780 --> 00:21:54.390
And again, this is
a very rich topic.

00:21:54.390 --> 00:21:56.100
All I wanted to do is
give you a pointer to it

00:21:56.100 --> 00:21:57.940
so that you can go read more
about it if this is something

00:21:57.940 --> 00:21:58.732
of interest to you.

00:22:01.910 --> 00:22:06.590
So now moving on
into the recap, one

00:22:06.590 --> 00:22:09.110
of the most important points
that we discussed last week

00:22:09.110 --> 00:22:10.910
was about non-stationarity.

00:22:10.910 --> 00:22:13.250
And there was a question
posted to Piazza,

00:22:13.250 --> 00:22:14.990
which was really interesting,
which is how do you actually

00:22:14.990 --> 00:22:16.115
deal with non-stationarity.

00:22:16.115 --> 00:22:18.080
And I spoke a lot
about it existing,

00:22:18.080 --> 00:22:19.700
and I talked about
how to test for it,

00:22:19.700 --> 00:22:23.527
but I didn't say what
to do if you have it.

00:22:23.527 --> 00:22:25.610
So I thought this was such
an interesting question

00:22:25.610 --> 00:22:28.190
that I would also talk about
it a bit during lecture.

00:22:28.190 --> 00:22:32.540
So the short answer is, if
you have to have a solution

00:22:32.540 --> 00:22:36.280
that you deploy
tomorrow, then here's

00:22:36.280 --> 00:22:38.442
the hack that sometimes works.

00:22:38.442 --> 00:22:40.900
You take your most recent data,
like the last three months'

00:22:40.900 --> 00:22:42.358
data, and you hope
that there's not

00:22:42.358 --> 00:22:45.490
much non-stationarity
within last three months.

00:22:45.490 --> 00:22:47.800
You throw out all
the historical data,

00:22:47.800 --> 00:22:51.410
and you just train using
the most recent data.

00:22:51.410 --> 00:22:55.110
So a bit unsatisfying,
because you

00:22:55.110 --> 00:22:57.950
might have now extremely
little data left to learn with,

00:22:57.950 --> 00:23:02.885
but if you have enough volume,
it might be good enough.

00:23:02.885 --> 00:23:04.260
But the real
interesting question

00:23:04.260 --> 00:23:06.460
from a research perspective
is how could you optimally use

00:23:06.460 --> 00:23:07.540
that historical data.

00:23:07.540 --> 00:23:10.150
So here are three
different ways.

00:23:10.150 --> 00:23:14.420
So one way has to
do with imputation.

00:23:14.420 --> 00:23:17.713
Imagine that the way in which
your data was non-stationary

00:23:17.713 --> 00:23:19.130
was because there
were, let's say,

00:23:19.130 --> 00:23:24.110
parts of time when certain
features were just unavailable.

00:23:24.110 --> 00:23:27.452
I gave you this example last
week of laboratory test results

00:23:27.452 --> 00:23:29.660
across time, and I showed
you how there are sometimes

00:23:29.660 --> 00:23:31.202
these really big
blocks of time where

00:23:31.202 --> 00:23:34.810
no lab tests are available,
or very few are available.

00:23:34.810 --> 00:23:37.387
Well, luckily we live in a world
with high dimensional data,

00:23:37.387 --> 00:23:39.720
and what that means is there's
often a lot of redundancy

00:23:39.720 --> 00:23:40.930
in the data.

00:23:40.930 --> 00:23:45.840
So what you could imagine
doing is imputing features

00:23:45.840 --> 00:23:48.390
that you observed
to be missing, such

00:23:48.390 --> 00:23:50.520
that the missingness
properties, in fact,

00:23:50.520 --> 00:23:54.177
aren't changing as much
across time after imputation.

00:23:54.177 --> 00:23:56.010
And if you do that as
a pre-processing step,

00:23:56.010 --> 00:23:57.810
it may allow you
to make use of much

00:23:57.810 --> 00:24:00.690
more of the historical data.

00:24:00.690 --> 00:24:03.570
A different approach, which
is intimately tied to that,

00:24:03.570 --> 00:24:05.490
has to do with
transforming the data.

00:24:05.490 --> 00:24:07.770
Instead of imputing
it, transforming it

00:24:07.770 --> 00:24:10.710
into another representation
altogether, such that

00:24:10.710 --> 00:24:15.102
that presentation is
invariant across time.

00:24:15.102 --> 00:24:16.560
And here I'm giving
you a reference

00:24:16.560 --> 00:24:19.380
to this paper by Ganin et al
from the Journal of Machine

00:24:19.380 --> 00:24:21.660
Learning Research 2016,
which talks about how

00:24:21.660 --> 00:24:24.815
to do domain and variant
learning of neural networks,

00:24:24.815 --> 00:24:26.190
and that's one
approach to do so.

00:24:26.190 --> 00:24:28.482
And I view those two as being
very similar-- imputation

00:24:28.482 --> 00:24:30.210
and transformations.

00:24:30.210 --> 00:24:32.970
A second approach is
to re-weight the data

00:24:32.970 --> 00:24:36.230
to look like the current data.

00:24:36.230 --> 00:24:38.170
So imagine that you
go back in time,

00:24:38.170 --> 00:24:39.670
and you say, you know what?

00:24:39.670 --> 00:24:43.050
I ICD-10 codes, for
some very weird reason--

00:24:43.050 --> 00:24:44.530
this is not true, by the way--

00:24:44.530 --> 00:24:47.260
ICD-10 codes in
this untrue world

00:24:47.260 --> 00:24:51.870
happen to be used between
March and April of 2003.

00:24:51.870 --> 00:24:57.190
And then they weren't
used again until 2015.

00:24:57.190 --> 00:24:59.630
So instead of throwing away
all of the previous data,

00:24:59.630 --> 00:25:02.630
we're going to
recognize that those--

00:25:02.630 --> 00:25:04.740
that three month
interval 10 years ago

00:25:04.740 --> 00:25:07.680
was actually drawn from a very
similar distribution as what

00:25:07.680 --> 00:25:09.200
we're going to be
testing on today.

00:25:09.200 --> 00:25:12.230
So we're going to weight those
data points up very much,

00:25:12.230 --> 00:25:14.330
and down weight the
data points that are

00:25:14.330 --> 00:25:16.760
less like the ones from today.

00:25:16.760 --> 00:25:19.790
That's the intuition behind
these re-weighting approaches,

00:25:19.790 --> 00:25:22.010
and we're going to talk
much more about that

00:25:22.010 --> 00:25:23.900
in the context of
causal inference,

00:25:23.900 --> 00:25:25.953
not because these two have
to do with each other,

00:25:25.953 --> 00:25:28.370
but they have-- they end up
using a very similar technique

00:25:28.370 --> 00:25:32.600
for how to deal with datas
that shift, or covariate shift.

00:25:32.600 --> 00:25:34.910
And the final technique
that I'll mention

00:25:34.910 --> 00:25:37.710
is based on online
learning algorithms.

00:25:37.710 --> 00:25:44.420
So the idea there is that there
might be cut points, change

00:25:44.420 --> 00:25:47.760
points across time.

00:25:47.760 --> 00:25:52.275
So maybe the data looks one
way up until this change point,

00:25:52.275 --> 00:25:53.900
and then suddenly
the data looks really

00:25:53.900 --> 00:25:55.590
different until
this change point,

00:25:55.590 --> 00:25:57.132
and then suddenly
the data looks very

00:25:57.132 --> 00:25:59.440
different on into the future.

00:25:59.440 --> 00:26:01.940
So here I'm showing you there
are two change points in which

00:26:01.940 --> 00:26:04.610
data set shift happens.

00:26:04.610 --> 00:26:06.900
What these online learning
algorithms do is they say,

00:26:06.900 --> 00:26:09.350
OK, suppose we were
forced to make predictions

00:26:09.350 --> 00:26:11.360
throughout this
time period using

00:26:11.360 --> 00:26:13.400
only the historical
data to make predictions

00:26:13.400 --> 00:26:15.200
at each point in time.

00:26:15.200 --> 00:26:18.650
Well, if we could
somehow recognize

00:26:18.650 --> 00:26:21.172
that there might
be these shifts,

00:26:21.172 --> 00:26:22.880
we could design
algorithms that are going

00:26:22.880 --> 00:26:25.910
to be robust to those shifts.

00:26:25.910 --> 00:26:28.040
And then one could try to
analyze-- mathematically

00:26:28.040 --> 00:26:30.350
analyze those algorithms
based on the amount of regret

00:26:30.350 --> 00:26:33.770
they would have to, for example,
an algorithm that knew exactly

00:26:33.770 --> 00:26:35.098
when those changes were.

00:26:35.098 --> 00:26:36.890
And of course, we don't
know precisely when

00:26:36.890 --> 00:26:38.970
those changes were.

00:26:38.970 --> 00:26:41.660
And so there's a whole field of
algorithms trying to do that,

00:26:41.660 --> 00:26:44.930
and here I'm just give me one
citation for a recent work.

00:26:47.680 --> 00:26:49.240
So to conclude risk
stratification--

00:26:49.240 --> 00:26:51.970
this is the last slide here.

00:26:51.970 --> 00:26:55.080
Maybe ask your
question after class.

00:26:55.080 --> 00:26:56.490
We've talked about
two approaches

00:26:56.490 --> 00:26:58.890
for formalizing risk
stratification-- first

00:26:58.890 --> 00:27:00.000
as binary classification.

00:27:00.000 --> 00:27:02.430
Second as regression.

00:27:02.430 --> 00:27:04.950
And in the regression
framework, one

00:27:04.950 --> 00:27:06.772
has to think about
censoring, which is why

00:27:06.772 --> 00:27:07.980
we call it survival modeling.

00:27:11.090 --> 00:27:16.550
Second, in our examples,
and again in your homework

00:27:16.550 --> 00:27:20.990
assignment that's
coming up next week,

00:27:20.990 --> 00:27:25.010
we'll see that
often the variables,

00:27:25.010 --> 00:27:29.850
the features that are most
predictive make a lot of sense.

00:27:29.850 --> 00:27:32.480
In the diabetes case, we said--

00:27:32.480 --> 00:27:36.740
we saw how patients having
comorbidities of diabetes,

00:27:36.740 --> 00:27:39.080
like hypertension, or
patients being obese

00:27:39.080 --> 00:27:42.370
were very predictive of
patients getting diabetes.

00:27:42.370 --> 00:27:46.120
So you might ask yourself, is
there something causal there?

00:27:46.120 --> 00:27:49.870
Are those features that are very
predictive in fact causing--

00:27:49.870 --> 00:27:52.180
what's causing the patient
to develop type 2 diabetes?

00:27:52.180 --> 00:27:55.580
Like, for example,
obesity causing diabetes.

00:27:55.580 --> 00:27:58.290
And this is where I
want to caution you.

00:27:58.290 --> 00:28:02.190
You shouldn't interpret these
very predictive features

00:28:02.190 --> 00:28:04.950
in a causal fashion,
particularly

00:28:04.950 --> 00:28:07.650
not when one starts to work
with high dimensional data,

00:28:07.650 --> 00:28:12.680
as we do in this course.

00:28:12.680 --> 00:28:15.290
The reason for that
is very subtle,

00:28:15.290 --> 00:28:18.200
and we'll talk about that in
the causal inference lectures,

00:28:18.200 --> 00:28:20.180
but I just wanted to
give you a pointer

00:28:20.180 --> 00:28:22.500
now that you shouldn't
think about it in that way.

00:28:22.500 --> 00:28:26.540
And you'll understand
why in just a few weeks.

00:28:26.540 --> 00:28:31.820
And finally we talked about ways
of dealing with missing data.

00:28:31.820 --> 00:28:35.620
I gave you one
feature representation

00:28:35.620 --> 00:28:39.407
for the diabetes case,
which was designed

00:28:39.407 --> 00:28:40.490
to deal with missing data.

00:28:40.490 --> 00:28:46.890
It said, was there any
diagnosis code 250.01

00:28:46.890 --> 00:28:49.280
in the last three months?

00:28:49.280 --> 00:28:50.650
And if there was, you have a 1.

00:28:50.650 --> 00:28:51.317
If you don't, 0.

00:28:51.317 --> 00:28:53.178
So it's designed to
recognize that you

00:28:53.178 --> 00:28:55.720
don't have information, perhaps,
for some large chunk of time

00:28:55.720 --> 00:28:58.100
in that window.

00:28:58.100 --> 00:29:01.520
But that missing data
could also be dangerous

00:29:01.520 --> 00:29:05.690
if that missingness itself has
caused you to non-stationarity,

00:29:05.690 --> 00:29:09.290
which is then going to result in
your test distribution looking

00:29:09.290 --> 00:29:11.490
different from your
training distribution.

00:29:11.490 --> 00:29:14.450
And that's where approaches
that are based on imputation

00:29:14.450 --> 00:29:17.840
could actually be very valuable,
not because they improve

00:29:17.840 --> 00:29:20.240
your predictive accuracy
when everything goes right,

00:29:20.240 --> 00:29:22.820
but because they might improve
your predictive accuracy when

00:29:22.820 --> 00:29:24.565
things go wrong.

00:29:24.565 --> 00:29:26.690
And so one of your readings
for last week's lecture

00:29:26.690 --> 00:29:29.510
was actually an example of
that, where they used a Gaussian

00:29:29.510 --> 00:29:34.190
process model to impute much of
the missing data in a patient's

00:29:34.190 --> 00:29:36.200
continuous vital
signs, and then they

00:29:36.200 --> 00:29:38.210
used a recurrent neural
network to predict

00:29:38.210 --> 00:29:41.480
based on that imputed data.

00:29:41.480 --> 00:29:46.050
So in that case, there are
really two things going on.

00:29:46.050 --> 00:29:48.345
First is this robustness
to data set shift,

00:29:48.345 --> 00:29:49.720
but there's a
second thing, which

00:29:49.720 --> 00:29:51.220
is going on as well,
which has to do

00:29:51.220 --> 00:29:54.370
with a trade-off between
the amount of data you have

00:29:54.370 --> 00:29:58.960
and the complexity of
the prediction problem.

00:29:58.960 --> 00:30:00.520
By doing imputations,
sometimes you

00:30:00.520 --> 00:30:02.320
make your problem
look a bit simpler,

00:30:02.320 --> 00:30:05.170
and simpler algorithms might
succeed where otherwise they

00:30:05.170 --> 00:30:07.665
would fail because not
having enough data.

00:30:07.665 --> 00:30:09.040
And that's something
that you saw

00:30:09.040 --> 00:30:12.150
in that last week's reading.

00:30:12.150 --> 00:30:14.730
So I'm done with
risk stratification.

00:30:14.730 --> 00:30:18.030
I'll take a one minute breather
for everyone in the room,

00:30:18.030 --> 00:30:20.765
and then we'll start
with the main topic

00:30:20.765 --> 00:30:22.890
of this lecture, which is
physiological time-series

00:30:22.890 --> 00:30:23.390
modeling.

00:30:27.870 --> 00:30:28.720
Let's say started.

00:30:37.047 --> 00:30:38.880
So here's a baby that's
not doing very well.

00:30:42.050 --> 00:30:44.180
This baby is in the
intensive care unit.

00:30:48.050 --> 00:30:51.230
Maybe it was a premature infant.

00:30:51.230 --> 00:30:56.360
Maybe it's a baby who
has some chronic disease,

00:30:56.360 --> 00:30:59.510
and, of course, parents
are very worried.

00:30:59.510 --> 00:31:02.410
This baby is getting
very close monitoring.

00:31:02.410 --> 00:31:04.220
It's connected to lots
of different probes.

00:31:07.031 --> 00:31:10.160
In number one here, it's
illustrating a three probe--

00:31:10.160 --> 00:31:13.670
three lead ECG, which we'll be
talking about much more, which

00:31:13.670 --> 00:31:17.270
is measuring its heart, how
the baby's heart is doing.

00:31:17.270 --> 00:31:21.170
Over here, this number
three is something

00:31:21.170 --> 00:31:24.206
attached to the baby's foot,
which is measuring its--

00:31:24.206 --> 00:31:27.563
it's a pulse oximeter, which
is measuring the baby's oxygen

00:31:27.563 --> 00:31:29.480
saturation, the amount
of oxygen in the blood.

00:31:32.780 --> 00:31:35.900
Number four is a probe which
is measuring the baby's

00:31:35.900 --> 00:31:37.040
temperature and so on.

00:31:37.040 --> 00:31:40.168
And so we're really taking
really close measurements

00:31:40.168 --> 00:31:41.960
of this baby, because
we want to understand

00:31:41.960 --> 00:31:43.400
how is this baby doing.

00:31:43.400 --> 00:31:47.120
We recognize that there might
be really sudden changes

00:31:47.120 --> 00:31:49.010
in the baby's state
of health that we

00:31:49.010 --> 00:31:52.650
want to be able to recognize
as early as possible.

00:31:52.650 --> 00:31:56.240
And so behind the scenes,
next to this baby,

00:31:56.240 --> 00:31:58.790
you'll, of course, have a
huge number of monitors,

00:31:58.790 --> 00:32:00.915
each of the monitors showing
the readouts from each

00:32:00.915 --> 00:32:03.200
of these different signals.

00:32:03.200 --> 00:32:07.870
And this type of data is really
prevalent in intensive care

00:32:07.870 --> 00:32:10.600
units, but you'll also
see in today's lecture

00:32:10.600 --> 00:32:12.760
how some aspects of
this data are now

00:32:12.760 --> 00:32:15.040
starting to make its way
to the home, as well.

00:32:15.040 --> 00:32:20.590
So for example, EKGs are now
available on Apple and Samsung

00:32:20.590 --> 00:32:24.250
watches to help understand--

00:32:24.250 --> 00:32:27.010
help to help with
diagnosis of arrhythmias,

00:32:27.010 --> 00:32:29.290
even for people at home.

00:32:29.290 --> 00:32:30.790
And so from this
type of data, there

00:32:30.790 --> 00:32:34.210
are a number of really important
use cases to think about.

00:32:34.210 --> 00:32:36.210
The first one is to
recognize that often we're

00:32:36.210 --> 00:32:39.030
getting really noisy
data, and we want to try

00:32:39.030 --> 00:32:40.710
to infer the true signal.

00:32:40.710 --> 00:32:43.170
So imagine, for example,
the temperature probe.

00:32:43.170 --> 00:32:47.100
The baby's true
temperature might be 98.5,

00:32:47.100 --> 00:32:50.640
but for whatever reason-- we'll
see a few reasons here today--

00:32:50.640 --> 00:32:53.197
maybe you're getting
an observation of 93.

00:32:53.197 --> 00:32:54.030
And you didn't know.

00:32:54.030 --> 00:32:56.190
Is that actually the
true baby temperature?

00:32:56.190 --> 00:32:57.300
In which case we--

00:32:57.300 --> 00:32:59.250
it would be in a lot of trouble.

00:32:59.250 --> 00:33:01.080
Or is that an anomalous reading?

00:33:01.080 --> 00:33:03.288
So we like t be able to
distinguish between those two

00:33:03.288 --> 00:33:04.440
things.

00:33:04.440 --> 00:33:09.090
And in other cases, we are
interested in not necessarily

00:33:09.090 --> 00:33:12.030
fully understanding what's going
on with the baby along each

00:33:12.030 --> 00:33:15.660
of those axes, but we
just want to use that data

00:33:15.660 --> 00:33:17.880
for predictive purposes,
for risk stratification,

00:33:17.880 --> 00:33:19.367
for example.

00:33:19.367 --> 00:33:21.200
And so the type of
machine learning approach

00:33:21.200 --> 00:33:25.520
that we'll take here will depend
on the following three factors.

00:33:25.520 --> 00:33:28.350
First, do we have
label data available?

00:33:28.350 --> 00:33:30.470
For example, do we
know the ground truth

00:33:30.470 --> 00:33:34.130
of what the baby's
true temperature was,

00:33:34.130 --> 00:33:38.630
at least for a few of the
babies in the training set?

00:33:38.630 --> 00:33:39.680
Second.

00:33:39.680 --> 00:33:43.310
Do we have a good mechanistic
or statistical model

00:33:43.310 --> 00:33:46.113
of how this data might
evolve across time?

00:33:46.113 --> 00:33:47.780
We know a lot about
hearts, for example.

00:33:47.780 --> 00:33:49.655
Cardiology is one of
those fields of medicine

00:33:49.655 --> 00:33:51.500
where it's really well studied.

00:33:51.500 --> 00:33:53.360
There are good
simulators of hearts,

00:33:53.360 --> 00:33:54.950
and how they beat
across time, and how

00:33:54.950 --> 00:34:01.150
that affects the electrical
stimulation across the body.

00:34:01.150 --> 00:34:03.970
And if we have these
good mechanistic

00:34:03.970 --> 00:34:05.770
or statistical
models, that can often

00:34:05.770 --> 00:34:08.800
allow one to trade off not
having much label data,

00:34:08.800 --> 00:34:11.540
or just not having
much data period.

00:34:11.540 --> 00:34:13.850
And it's really
these three points

00:34:13.850 --> 00:34:16.429
which I want to illustrate
the extremes of in today's

00:34:16.429 --> 00:34:16.955
lecture--

00:34:16.955 --> 00:34:18.830
what do you do when you
don't have much data,

00:34:18.830 --> 00:34:20.000
and what you do
when-- what you can

00:34:20.000 --> 00:34:21.395
do when you have a ton of data.

00:34:21.395 --> 00:34:24.054
And I think it's going to
be really informative for us

00:34:24.054 --> 00:34:26.179
as we go out into the world
and will have to tackle

00:34:26.179 --> 00:34:27.304
each of those two settings.

00:34:30.159 --> 00:34:33.500
So here's an example of two
different babies with very

00:34:33.500 --> 00:34:35.150
different trajectories.

00:34:35.150 --> 00:34:38.449
One in the x-axis here
is time in seconds.

00:34:38.449 --> 00:34:41.688
The y-axis here--

00:34:41.688 --> 00:34:42.980
I think seconds, maybe minutes.

00:34:42.980 --> 00:34:46.130
The y-axis here is beats per
minute of the baby's heart

00:34:46.130 --> 00:34:50.630
rate, and you see in
some cases it's really

00:34:50.630 --> 00:34:51.938
fluctuating a lot up and down.

00:34:51.938 --> 00:34:54.230
In some cases, it's sort of
going in a similar-- in one

00:34:54.230 --> 00:34:58.700
direction, and in all cases,
the short term observations

00:34:58.700 --> 00:35:03.562
are very different from the
long range trajectories.

00:35:03.562 --> 00:35:05.020
So the first problem
that I want us

00:35:05.020 --> 00:35:10.450
to think about is one
of trying to understand,

00:35:10.450 --> 00:35:13.972
how do we deconvolve between the
truth of what's going on with,

00:35:13.972 --> 00:35:15.680
for example, the
patient's blood pressure

00:35:15.680 --> 00:35:20.163
or oxygen versus interventions
that are happening to them?

00:35:20.163 --> 00:35:21.580
So on the bottom
here, I'm showing

00:35:21.580 --> 00:35:24.750
examples of interventions.

00:35:24.750 --> 00:35:27.810
Here in this oxygen
uptake, we notice

00:35:27.810 --> 00:35:31.047
how between roughly 1,000
and 2,000 seconds suddenly

00:35:31.047 --> 00:35:32.255
there's no signal whatsoever.

00:35:32.255 --> 00:35:34.213
And that's an example of
what's called dropout.

00:35:36.520 --> 00:35:39.650
Over here, we see a
different type of--

00:35:39.650 --> 00:35:42.430
the effect of a
different intervention,

00:35:42.430 --> 00:35:44.770
which is due to a
probe recalibration.

00:35:44.770 --> 00:35:46.870
Now, at that time,
there was a drop out

00:35:46.870 --> 00:35:50.170
followed by a sudden
change in the values,

00:35:50.170 --> 00:35:52.720
and that's really happening
due to a recalibration step.

00:35:52.720 --> 00:35:55.710
And in both of
these cases, what's

00:35:55.710 --> 00:35:58.132
going on with the individual
might be relatively

00:35:58.132 --> 00:36:00.090
constant across time,
but what's being observed

00:36:00.090 --> 00:36:04.240
is dramatically affected
by those interventions.

00:36:04.240 --> 00:36:06.070
So we want to ask
the question, can we

00:36:06.070 --> 00:36:08.788
identify those
artifactual processes?

00:36:08.788 --> 00:36:11.080
Can we identify that these
interventions were happening

00:36:11.080 --> 00:36:12.080
at those points in time?

00:36:15.680 --> 00:36:18.000
And then, if we
could identify them,

00:36:18.000 --> 00:36:21.120
then we could potentially
subtract their effect out.

00:36:21.120 --> 00:36:27.210
So we could impute the
data, which we know-- now

00:36:27.210 --> 00:36:30.390
know to be missing, and then
have this much higher quality

00:36:30.390 --> 00:36:33.130
signal used for some
downstream predictive purpose,

00:36:33.130 --> 00:36:34.910
for example.

00:36:34.910 --> 00:36:37.510
And the second reason why
this can be really important

00:36:37.510 --> 00:36:40.660
is to tackle this problem
called alarm fatigue.

00:36:43.370 --> 00:36:47.030
Alarm fatigue is one of the
most important challenges facing

00:36:47.030 --> 00:36:48.500
medicine today.

00:36:48.500 --> 00:36:52.370
As we get better and better
in doing risk stratification,

00:36:52.370 --> 00:36:58.700
as we come up with more and
more diagnostic tools and tests,

00:36:58.700 --> 00:37:02.090
that means these red flags
are being raised more and more

00:37:02.090 --> 00:37:03.690
often.

00:37:03.690 --> 00:37:08.170
And each one of these has some
associated false positive rate

00:37:08.170 --> 00:37:09.800
for it.

00:37:09.800 --> 00:37:13.510
And so the more tests you have--

00:37:13.510 --> 00:37:15.250
suppose the false
positive rate is

00:37:15.250 --> 00:37:18.160
kept constant-- the more tests
you have, the more likely

00:37:18.160 --> 00:37:20.140
it is that the union
of all of those

00:37:20.140 --> 00:37:24.568
is going to be some error.

00:37:24.568 --> 00:37:27.540
And so when you're in
an intensive care unit,

00:37:27.540 --> 00:37:29.500
there are alarms going
off all the time.

00:37:29.500 --> 00:37:31.630
And something that happens
is that nurses end up

00:37:31.630 --> 00:37:35.110
starting to ignore those
alarms, because so often

00:37:35.110 --> 00:37:37.480
those alarms are
false positives,

00:37:37.480 --> 00:37:39.700
are due to, for
example, artifacts

00:37:39.700 --> 00:37:41.835
like what I'm showing you here.

00:37:41.835 --> 00:37:43.960
And so if we had techniques,
such as the ones we'll

00:37:43.960 --> 00:37:47.680
talk about right now,
which could recognize when,

00:37:47.680 --> 00:37:50.470
for example, the sudden drop
in a patient's heart rate

00:37:50.470 --> 00:37:54.940
is due to an artifact and not
due to the patient's true heart

00:37:54.940 --> 00:37:56.958
rate dropping--

00:37:56.958 --> 00:37:58.500
if we had enough
confidence in that--

00:37:58.500 --> 00:37:59.958
in distinguishing
those two things,

00:37:59.958 --> 00:38:03.150
then we might not decide
to raise that red flag.

00:38:03.150 --> 00:38:06.430
And that might reduce the
amount of false alarms,

00:38:06.430 --> 00:38:09.150
and that then might reduce
the amount of alarm fatigue.

00:38:09.150 --> 00:38:11.850
And that could have a very
big impact on health care.

00:38:15.980 --> 00:38:19.150
So the technique which
we'll talk about today

00:38:19.150 --> 00:38:24.170
goes by the name of switching
linear dynamical systems.

00:38:24.170 --> 00:38:25.820
Who here has seen
a picture like this

00:38:25.820 --> 00:38:29.630
on-- this picture on
the bottom before.

00:38:29.630 --> 00:38:32.173
About half of the room.

00:38:32.173 --> 00:38:33.590
So for the other
half of the room,

00:38:33.590 --> 00:38:36.620
I'm going to give
a bit of a recap

00:38:36.620 --> 00:38:38.960
into probabilistic modeling.

00:38:38.960 --> 00:38:43.830
All of you are now familiar
with general probabilities.

00:38:43.830 --> 00:38:48.230
So you're used to thinking
about, for example,

00:38:48.230 --> 00:38:51.230
univariate Gaussian
distributions.

00:38:51.230 --> 00:38:54.050
We talked about how one
could model survival, which

00:38:54.050 --> 00:38:57.440
was an example of
such a distribution,

00:38:57.440 --> 00:38:59.088
but for today's
lecture, we're going

00:38:59.088 --> 00:39:01.130
to be thinking now about
multivariate probability

00:39:01.130 --> 00:39:01.820
distributions.

00:39:01.820 --> 00:39:05.870
In particular, we'll be thinking
about how a patient's state--

00:39:05.870 --> 00:39:08.120
let's say their true
blood pressure--

00:39:08.120 --> 00:39:09.990
evolves across time.

00:39:09.990 --> 00:39:14.570
And so now we're interested in
not just the random variable

00:39:14.570 --> 00:39:16.740
at one point in time, but
that same random variable

00:39:16.740 --> 00:39:18.782
at the second point in
time, third point in time,

00:39:18.782 --> 00:39:21.488
fourth point in time, fifth
point in time, and so on.

00:39:21.488 --> 00:39:23.030
So what I'm showing
you here is known

00:39:23.030 --> 00:39:26.270
as a graphical model, also
known as a Bayesian network.

00:39:26.270 --> 00:39:29.050
And it's one way of illustrating
a multivariate probability

00:39:29.050 --> 00:39:31.460
distribution that has particular
conditional independence

00:39:31.460 --> 00:39:33.490
properties.

00:39:33.490 --> 00:39:40.690
Specifically, in
this model, one node

00:39:40.690 --> 00:39:42.260
corresponds to one
random variable.

00:39:42.260 --> 00:39:46.840
So this is describing a
joint distribution on x1

00:39:46.840 --> 00:39:55.117
through x6, y1 through y6.

00:39:55.117 --> 00:39:56.700
So it's this
multivariate distribution

00:39:56.700 --> 00:40:00.570
on 12 random variables.

00:40:00.570 --> 00:40:03.600
The fact that this
is shaded in simply

00:40:03.600 --> 00:40:07.110
denotes that, at test time, when
we use these models, typically

00:40:07.110 --> 00:40:09.780
these y variables are observed.

00:40:09.780 --> 00:40:13.410
Whereas our goal is usually
to infer the x variables.

00:40:13.410 --> 00:40:16.950
Those are typically unobserved,
meaning that our typical task

00:40:16.950 --> 00:40:20.340
is one of doing posterior
inference to infer

00:40:20.340 --> 00:40:22.725
the x's given the y's.

00:40:25.470 --> 00:40:28.860
Now, associated with
this graph, I already

00:40:28.860 --> 00:40:31.740
told you the nodes correspond
to random variables.

00:40:31.740 --> 00:40:36.330
The graph tells us how is this
joint distribution factorized.

00:40:36.330 --> 00:40:41.130
In particular, it's
going to be factorized

00:40:41.130 --> 00:40:42.240
in the following way--

00:40:42.240 --> 00:40:45.210
as the product over
random variables

00:40:45.210 --> 00:40:49.000
of the probability of
the i-th random variable.

00:40:49.000 --> 00:40:51.840
I'm going to use z to just
denote a random variable.

00:40:51.840 --> 00:40:55.680
Think of z as the
union of x and y.

00:40:55.680 --> 00:40:59.610
zi conditioned on the parents--

00:40:59.610 --> 00:41:01.800
the values of the parents of zi.

00:41:05.820 --> 00:41:10.080
So I'm going to assume
this factorization,

00:41:10.080 --> 00:41:13.800
and in particular for this
graphical model, which

00:41:13.800 --> 00:41:15.870
goes by the name
of a Markov model,

00:41:15.870 --> 00:41:18.810
it has a very specific
factorization.

00:41:18.810 --> 00:41:22.180
And we're just going to read
it off from this definition.

00:41:22.180 --> 00:41:26.340
So we're going to go in
order-- first x1, then y1,

00:41:26.340 --> 00:41:28.410
then x2, then y2,
and so on, which

00:41:28.410 --> 00:41:36.630
is going based on
a root to children

00:41:36.630 --> 00:41:39.340
transversal of this graph.

00:41:39.340 --> 00:41:44.410
So the first random
variable is x1.

00:41:44.410 --> 00:41:50.230
Second variable is y2, and
what are the parents of y--

00:41:50.230 --> 00:41:51.757
sorry, what are
the parents of y1.

00:41:51.757 --> 00:41:52.840
Everyone can say out loud.

00:41:52.840 --> 00:41:54.070
AUDIENCE: x1.

00:41:54.070 --> 00:41:55.090
DAVID SONTAG: x1.

00:41:55.090 --> 00:42:01.450
So y1 in this factorization
is only going to depend on x1.

00:42:01.450 --> 00:42:02.740
Next we have x2.

00:42:02.740 --> 00:42:03.940
What are the parents of x2?

00:42:03.940 --> 00:42:05.390
Everyone say out loud?

00:42:05.390 --> 00:42:06.370
AUDIENCE: x1.

00:42:06.370 --> 00:42:07.840
DAVID SONTAG: x1.

00:42:07.840 --> 00:42:09.790
Then we have y2.

00:42:09.790 --> 00:42:11.633
What are the parents of y2.

00:42:11.633 --> 00:42:12.550
Everyone say out loud.

00:42:12.550 --> 00:42:14.080
AUDIENCE: x2.

00:42:14.080 --> 00:42:16.960
DAVID SONTAG: x2 and so on.

00:42:16.960 --> 00:42:20.920
So this joint
distribution is going

00:42:20.920 --> 00:42:23.560
to have a particularly
simple form, which

00:42:23.560 --> 00:42:26.280
is given to by this
factorization shown here.

00:42:26.280 --> 00:42:28.420
And this factorization
corresponds one to one

00:42:28.420 --> 00:42:32.400
with the particular graph in
the way that I just told you.

00:42:32.400 --> 00:42:35.760
And in this way, we can define
a very complex probability

00:42:35.760 --> 00:42:39.900
distribution by a number of much
simpler conditional probability

00:42:39.900 --> 00:42:41.220
distributions.

00:42:41.220 --> 00:42:44.740
For example, if each of the
random variables were binary,

00:42:44.740 --> 00:42:48.840
then to describe
probability of y1 given x1,

00:42:48.840 --> 00:42:50.250
we only need two numbers.

00:42:50.250 --> 00:42:52.840
For each value of
x1, either 0 or 1,

00:42:52.840 --> 00:42:55.290
we give the probability
of y1 equals 1.

00:42:55.290 --> 00:42:59.530
And then, of course, probably y1
equals 0 is just 1 minus that.

00:42:59.530 --> 00:43:02.290
So we can describe that very
complicated joint distribution

00:43:02.290 --> 00:43:07.200
by a number of much
smaller distributions.

00:43:07.200 --> 00:43:10.700
Now, the reason why I'm
drawing it in this way

00:43:10.700 --> 00:43:13.940
is because we're making some
really strong assumptions

00:43:13.940 --> 00:43:18.020
about the temporal
dynamics in this problem.

00:43:18.020 --> 00:43:23.360
In particular, the
fact that x3 only

00:43:23.360 --> 00:43:27.720
has an arrow from
x2 and not from x1

00:43:27.720 --> 00:43:32.540
implies that x3 is
conditionally independent of x1.

00:43:32.540 --> 00:43:34.400
If you knew x2's value.

00:43:34.400 --> 00:43:37.970
So in some sense, think
about this as cutting.

00:43:37.970 --> 00:43:40.700
If you're to take
x2 out of the model

00:43:40.700 --> 00:43:43.040
and remove all edges
incident on it,

00:43:43.040 --> 00:43:46.490
then x1 and x3 are now
separated from one another.

00:43:46.490 --> 00:43:48.110
They're independent.

00:43:48.110 --> 00:43:51.740
Now, for those of you who
do know graphical models,

00:43:51.740 --> 00:43:54.770
you'll recognize that that type
of independent statement that I

00:43:54.770 --> 00:43:56.480
made is only true
for Markov models,

00:43:56.480 --> 00:43:58.605
and the semantics
for Bayesian networks

00:43:58.605 --> 00:43:59.730
are a little bit different.

00:43:59.730 --> 00:44:02.058
But actually for this
model, it's-- they're one

00:44:02.058 --> 00:44:02.600
and the same.

00:44:05.910 --> 00:44:08.990
So we're going to make
the following assumptions

00:44:08.990 --> 00:44:12.890
for the conditional
distributions shown here.

00:44:12.890 --> 00:44:16.850
First, we're going to suppose
that xt is given to you

00:44:16.850 --> 00:44:19.490
by a Gaussian distribution.

00:44:19.490 --> 00:44:23.570
Remember xt-- t is
denoting a time step.

00:44:23.570 --> 00:44:26.815
Let's say 3-- it only
depends in this picture--

00:44:26.815 --> 00:44:28.190
the conditional
distribution only

00:44:28.190 --> 00:44:30.650
depends on the previous
time step's value, x2,

00:44:30.650 --> 00:44:32.310
or xt minus 1.

00:44:32.310 --> 00:44:34.850
So you'll notice how
I'm going to say here

00:44:34.850 --> 00:44:36.620
xt is going to
distribute as something,

00:44:36.620 --> 00:44:38.690
but the only random
variables in this something

00:44:38.690 --> 00:44:42.680
can be xt minus 1, according
to these assumptions.

00:44:42.680 --> 00:44:44.180
In particular, we're
going to assume

00:44:44.180 --> 00:44:47.930
that it's some Gaussian
distribution, whose mean is

00:44:47.930 --> 00:44:51.020
some linear transformation
of xt minus 1,

00:44:51.020 --> 00:44:55.240
and which has a fixed
covariance matrix q.

00:44:55.240 --> 00:45:00.310
So at each step of this process,
the next random variable

00:45:00.310 --> 00:45:03.700
is some random walk from
the previous random variable

00:45:03.700 --> 00:45:07.833
where you're moving according
to some Gaussian distribution.

00:45:07.833 --> 00:45:09.250
In a very similar
way, we're going

00:45:09.250 --> 00:45:17.410
to assume that yt is drawn also
as a Gaussian distribution,

00:45:17.410 --> 00:45:20.550
but now depending on xt.

00:45:20.550 --> 00:45:24.120
So I want you to think
about xt as the true state

00:45:24.120 --> 00:45:25.410
of the patient.

00:45:25.410 --> 00:45:28.590
It's a vector that's
summarizing their blood

00:45:28.590 --> 00:45:31.200
pressure, their
oxygen saturation,

00:45:31.200 --> 00:45:33.150
a whole bunch of
other parameters,

00:45:33.150 --> 00:45:35.460
or maybe even just one of those.

00:45:35.460 --> 00:45:39.300
And y1 are the observations
that you do observe.

00:45:39.300 --> 00:45:41.890
So let's say x1 is the
patient's true blood pressure.

00:45:41.890 --> 00:45:43.980
y1 is the observed
blood pressure,

00:45:43.980 --> 00:45:47.010
what comes from your monitor.

00:45:47.010 --> 00:45:48.660
So then a reasonable
assumption would

00:45:48.660 --> 00:45:52.350
be that, well, if
all this were equal,

00:45:52.350 --> 00:45:53.910
if it was a true
observation, then

00:45:53.910 --> 00:45:55.750
y1 should be very close to x1.

00:45:55.750 --> 00:45:58.680
So you might assume that
this covariance matrix is--

00:45:58.680 --> 00:46:01.460
the covariance is-- the
variance is very, very small.

00:46:01.460 --> 00:46:07.280
y1 should be very close to x1
if it's a good observation.

00:46:07.280 --> 00:46:10.100
And of course, if it's
a noisy observation--

00:46:10.100 --> 00:46:15.680
like, for example, if the probe
was disconnected from the baby,

00:46:15.680 --> 00:46:19.790
then y1 should have
no relationship to x1.

00:46:19.790 --> 00:46:23.460
And that dependence on the
actual state of the world

00:46:23.460 --> 00:46:26.730
I'm denoting here by these
superscripts, s of t.

00:46:26.730 --> 00:46:28.730
I'm ignoring that right
now, and I'll bring that

00:46:28.730 --> 00:46:31.910
in in the next slide.

00:46:31.910 --> 00:46:36.230
Similarly, the relationship
between x2 and x1

00:46:36.230 --> 00:46:38.510
should be one which captures
some of the dynamics

00:46:38.510 --> 00:46:42.140
that I showed in the previous
slides, where I showed over

00:46:42.140 --> 00:46:46.040
here now this is the patient's
true heart rate evolving

00:46:46.040 --> 00:46:48.080
across time, let's say.

00:46:48.080 --> 00:46:51.800
Notice how, if you
look very locally,

00:46:51.800 --> 00:46:56.720
it looks like there are some
very, very big local dynamics.

00:46:56.720 --> 00:46:58.790
Whereas if you
look more globally,

00:46:58.790 --> 00:47:01.340
again, there's some smoothness,
but there are some-- again,

00:47:01.340 --> 00:47:03.590
it looks like some random
changes across time.

00:47:03.590 --> 00:47:10.070
And so those-- that
drift has to somehow

00:47:10.070 --> 00:47:13.550
be summarized in this model
by that A random variable.

00:47:13.550 --> 00:47:16.130
And I'll get into more detail
about that in just a moment.

00:47:18.750 --> 00:47:20.990
So what I just showed
you was an example

00:47:20.990 --> 00:47:23.360
of a linear dynamical
system, but it

00:47:23.360 --> 00:47:27.170
was assuming that there were
none of these events happening,

00:47:27.170 --> 00:47:30.082
none of these
artifacts happening.

00:47:30.082 --> 00:47:31.540
The actual model
that we were going

00:47:31.540 --> 00:47:33.040
to want to be able
to use then is

00:47:33.040 --> 00:47:34.330
going to also
incorporate the fact

00:47:34.330 --> 00:47:35.320
that there might be artifacts.

00:47:35.320 --> 00:47:36.640
And to model that,
we need to introduce

00:47:36.640 --> 00:47:38.473
additional random
variables corresponding to

00:47:38.473 --> 00:47:40.250
whether those artifacts
occurred or not.

00:47:40.250 --> 00:47:42.290
And so that's now this model.

00:47:42.290 --> 00:47:45.370
So I'm going to let these S's--

00:47:45.370 --> 00:47:47.850
these are other
random variables,

00:47:47.850 --> 00:47:51.310
which are denoting
artifactual events.

00:47:51.310 --> 00:47:52.970
They are also
evolving with time.

00:47:52.970 --> 00:47:55.420
For example, if there's
artifactual factual event

00:47:55.420 --> 00:47:57.875
at three seconds, maybe there's
also an artifactual event

00:47:57.875 --> 00:47:58.720
at four seconds.

00:47:58.720 --> 00:48:00.887
And we like to model the
relationship between those.

00:48:00.887 --> 00:48:02.600
That's why you
have these arrows.

00:48:02.600 --> 00:48:08.180
And then the way that we
interpret the observations

00:48:08.180 --> 00:48:12.620
that we do get depends
on both the true value

00:48:12.620 --> 00:48:14.340
of what's going on
with the patient

00:48:14.340 --> 00:48:17.612
and whether there was an
artifactual event or not.

00:48:17.612 --> 00:48:19.070
And you'll notice
that there's also

00:48:19.070 --> 00:48:20.780
an edge going from
the artifactual events

00:48:20.780 --> 00:48:23.270
to the true values
to note the fact

00:48:23.270 --> 00:48:27.680
that those interventions
might actually

00:48:27.680 --> 00:48:29.030
be affecting the patient.

00:48:29.030 --> 00:48:31.040
For example, if you
give them a medication

00:48:31.040 --> 00:48:36.800
to change their blood
pressure, then that procedure

00:48:36.800 --> 00:48:39.895
is going to affect the next time
step's value of the patient's

00:48:39.895 --> 00:48:40.520
blood pressure.

00:48:44.360 --> 00:48:47.917
So when one wants
to learn this model,

00:48:47.917 --> 00:48:49.750
you have to ask yourself,
what types of data

00:48:49.750 --> 00:48:51.167
do you have available?

00:48:54.370 --> 00:48:59.680
Unfortunately, it's very hard
to get data on both the ground

00:48:59.680 --> 00:49:02.210
truth, what's going
on with the patient,

00:49:02.210 --> 00:49:06.530
and whether these artifacts
truly occurred or not.

00:49:06.530 --> 00:49:09.530
Instead, what we actually have
are just these observations.

00:49:09.530 --> 00:49:13.450
We get these very noisy blood
pressure draws across time.

00:49:13.450 --> 00:49:16.500
So what this paper does is
it uses a maximum likelihood

00:49:16.500 --> 00:49:18.797
estimation approach,
where it recognizes

00:49:18.797 --> 00:49:20.880
that we're going to be
learning from missing data.

00:49:20.880 --> 00:49:23.940
We're going to explicitly
think of these x's and the s's

00:49:23.940 --> 00:49:25.875
as latent variables.

00:49:25.875 --> 00:49:27.990
And we're going to
maximize the likelihood

00:49:27.990 --> 00:49:31.820
of the whole entire model,
marginalizing over x and s.

00:49:31.820 --> 00:49:34.485
So just maximizing the marginal
likelihood over the y's.

00:49:37.240 --> 00:49:39.740
Now, for those of you who have
studied unsupervised learning

00:49:39.740 --> 00:49:43.570
before, you might recognize
that as a very hard learning

00:49:43.570 --> 00:49:44.070
problem.

00:49:44.070 --> 00:49:47.780
In fact, it's-- that
likelihood is non-convex.

00:49:47.780 --> 00:49:51.990
And one could imagine all sorts
of a heuristics for learning,

00:49:51.990 --> 00:49:55.460
such as gradient descent,
or, as this paper uses,

00:49:55.460 --> 00:49:59.180
expectation maximization, and
because of that non-convexity,

00:49:59.180 --> 00:50:00.750
each of these
algorithms typically

00:50:00.750 --> 00:50:04.040
will only reach a local
maxima of the likelihood.

00:50:04.040 --> 00:50:08.420
So this paper uses EM,
which intuitively iterates

00:50:08.420 --> 00:50:14.420
between inferring those missing
variables-- so imputing the x's

00:50:14.420 --> 00:50:17.210
and the s's given
the current model,

00:50:17.210 --> 00:50:20.300
and doing posterior inference
to infer the missing

00:50:20.300 --> 00:50:22.760
variables given the
observed variables, using

00:50:22.760 --> 00:50:24.140
the current model.

00:50:24.140 --> 00:50:27.020
And then, once you've
imputed those variables,

00:50:27.020 --> 00:50:28.910
attempting to refit the model.

00:50:28.910 --> 00:50:30.920
So that's called the
m-step for maximization,

00:50:30.920 --> 00:50:32.900
which updates the model and
just iterates between those two

00:50:32.900 --> 00:50:33.400
things.

00:50:33.400 --> 00:50:36.590
That's one learning
algorithm which

00:50:36.590 --> 00:50:39.650
is guaranteed to reach a
local maxima of the likelihood

00:50:39.650 --> 00:50:42.830
under some regularity
assumptions.

00:50:42.830 --> 00:50:44.690
And so this paper
uses that algorithm,

00:50:44.690 --> 00:50:46.520
but you need to be
asking yourself,

00:50:46.520 --> 00:50:50.270
if all you ever
observe are the y's,

00:50:50.270 --> 00:50:54.830
then will this algorithm
ever recover anything

00:50:54.830 --> 00:50:56.600
close to the true model?

00:50:56.600 --> 00:50:58.310
For example, there
might be large amounts

00:50:58.310 --> 00:51:00.080
of non-identifiability here.

00:51:00.080 --> 00:51:04.490
It could be that you
could swap the meaning

00:51:04.490 --> 00:51:10.170
of the s's, and you'd get a
similar likelihood on the y's.

00:51:10.170 --> 00:51:14.010
That's where bringing in domain
knowledge becomes critical.

00:51:14.010 --> 00:51:17.670
So this is going to be an
example where we have no label

00:51:17.670 --> 00:51:22.948
data or very little label data.

00:51:22.948 --> 00:51:24.740
And we're going to do
unsupervised learning

00:51:24.740 --> 00:51:26.282
of this model, but
we're going to use

00:51:26.282 --> 00:51:28.790
a ton of domain knowledge
in order to constrain

00:51:28.790 --> 00:51:31.050
the model as much as possible.

00:51:31.050 --> 00:51:33.490
So what is that
domain knowledge?

00:51:33.490 --> 00:51:37.730
Well, first we're
going to use the fact

00:51:37.730 --> 00:51:47.200
that we know that a true heart
rate evolves in a fashion that

00:51:47.200 --> 00:51:53.530
can be very well modeled by
an autoregressive process.

00:51:53.530 --> 00:51:56.260
So the autoregressive process
that's used in this paper

00:51:56.260 --> 00:51:58.630
is used to model the
normal heart rate dynamics.

00:51:58.630 --> 00:52:01.060
In a moment, I'll tell you how
to model the abnormal heart

00:52:01.060 --> 00:52:03.370
rate observations.

00:52:03.370 --> 00:52:05.530
And intuitively-- I'll
first go over the intuition,

00:52:05.530 --> 00:52:06.850
then I'll give you the math.

00:52:06.850 --> 00:52:08.650
Intuitively what it
does is it recognizes

00:52:08.650 --> 00:52:14.060
that this complicated signal can
be decomposed into two pieces.

00:52:14.060 --> 00:52:18.020
The first piece shown here
is called a baseline signal,

00:52:18.020 --> 00:52:20.315
and that, if you
squint your eyes

00:52:20.315 --> 00:52:22.700
and you sort or ignore the
very local fluctuations,

00:52:22.700 --> 00:52:24.860
this is what you get out.

00:52:24.860 --> 00:52:27.230
And then you can
look at the residual

00:52:27.230 --> 00:52:32.330
of subtracting this signal,
subtracting this baseline

00:52:32.330 --> 00:52:33.710
from the signal.

00:52:33.710 --> 00:52:36.250
And what you get
out looks like this.

00:52:36.250 --> 00:52:39.770
Notice here it's around 0 mean.

00:52:39.770 --> 00:52:42.585
So it's a 0 mean signal with
some random fluctuations,

00:52:42.585 --> 00:52:44.210
and the fluctuations
are happening here

00:52:44.210 --> 00:52:47.210
at a much faster rate than--

00:52:47.210 --> 00:52:49.830
and for the original baseline.

00:52:49.830 --> 00:52:56.910
And so the sum of bt and
this residual is a very--

00:52:56.910 --> 00:53:00.200
it looks-- is exactly equal
to the true heart rate.

00:53:00.200 --> 00:53:03.290
And each of these two things
we can model very well.

00:53:03.290 --> 00:53:08.210
This we can model by
a random walk with--

00:53:08.210 --> 00:53:10.970
which goes very
slowly, and this we

00:53:10.970 --> 00:53:15.297
can model by a random walk
which goes very quickly.

00:53:15.297 --> 00:53:17.630
And that is exactly what I'm
now going to show over here

00:53:17.630 --> 00:53:19.180
on the left hand side.

00:53:19.180 --> 00:53:22.880
bt, this baseline
signal, we're going

00:53:22.880 --> 00:53:26.540
to model as a Gaussian
distribution, which

00:53:26.540 --> 00:53:29.600
is parameterized as a function
of not just bt minus 1,

00:53:29.600 --> 00:53:32.480
but also bt minus
2, and bt minus 3.

00:53:32.480 --> 00:53:34.940
And so we're going to be
taking a weighted average

00:53:34.940 --> 00:53:39.560
of the previous few time steps,
where we're smoothing out,

00:53:39.560 --> 00:53:45.220
in essence, the observation--
the previous few observations.

00:53:45.220 --> 00:53:47.970
If you were to--

00:53:47.970 --> 00:53:50.310
if you're being a
keen observer, you'll

00:53:50.310 --> 00:53:53.790
notice that this is no
longer a Markov model.

00:54:04.870 --> 00:54:11.460
For example, if this p1
and p2 are equal to 2,

00:54:11.460 --> 00:54:14.790
this then corresponds to a
second order Markov model,

00:54:14.790 --> 00:54:18.600
because each random variable
depends on the previous two

00:54:18.600 --> 00:54:24.530
time steps of the Markov chain.

00:54:24.530 --> 00:54:31.790
And so after-- so you would
model now bt by this process,

00:54:31.790 --> 00:54:34.880
and you would
probably be averaging

00:54:34.880 --> 00:54:36.920
over a large number
of previous time steps

00:54:36.920 --> 00:54:39.020
to get this smooth property.

00:54:39.020 --> 00:54:45.620
And then you'd model xt minus bt
by this autoregressive process,

00:54:45.620 --> 00:54:47.780
where you might,
for example, just

00:54:47.780 --> 00:54:50.313
be looking at just the
previous couple of time steps.

00:54:50.313 --> 00:54:51.980
And you recognize
that you're just doing

00:54:51.980 --> 00:54:55.600
much more random fluctuations.

00:54:55.600 --> 00:54:59.480
And then-- so that's how one
would now model normal heart

00:54:59.480 --> 00:55:00.650
rate dynamics.

00:55:00.650 --> 00:55:02.900
And again, it's just--

00:55:02.900 --> 00:55:04.730
this is an example of
a statistical model.

00:55:04.730 --> 00:55:06.110
There is no
mechanistic knowledge

00:55:06.110 --> 00:55:08.540
of hearts being
used here, but we

00:55:08.540 --> 00:55:13.710
can fit the data of normal
hearts pretty well using this.

00:55:13.710 --> 00:55:15.960
But the next question and
the most interesting one

00:55:15.960 --> 00:55:20.510
is, how does one now
model artifactual events?

00:55:20.510 --> 00:55:26.120
So for that, that's where some
mechanistic knowledge comes in.

00:55:26.120 --> 00:55:30.180
So one models that
the probe dropouts

00:55:30.180 --> 00:55:35.120
are given by recognizing
that, if a probe

00:55:35.120 --> 00:55:39.020
is removed from the baby, then
there should no longer be--

00:55:39.020 --> 00:55:41.253
or at least if you-- after
a small amount of time,

00:55:41.253 --> 00:55:42.920
there should no longer
be any dependence

00:55:42.920 --> 00:55:44.450
on the true value of the baby.

00:55:44.450 --> 00:55:48.080
For example, the blood pressure,
once the blood pressure probe

00:55:48.080 --> 00:55:50.870
is removed, is no longer
related to the baby's true blood

00:55:50.870 --> 00:55:52.910
pressure.

00:55:52.910 --> 00:55:57.130
But there might be some delay
to that lack of dependence.

00:55:57.130 --> 00:55:59.450
And so-- and that is going
to be encoded in some domain

00:55:59.450 --> 00:55:59.950
knowledge.

00:55:59.950 --> 00:56:01.840
So for example, in
the temperature probe,

00:56:01.840 --> 00:56:04.480
when you remove the temperature
probe from the baby,

00:56:04.480 --> 00:56:07.682
it starts heating up again--
or it starts cooling, so

00:56:07.682 --> 00:56:09.640
assuming that the ambient
temperature is cooler

00:56:09.640 --> 00:56:11.280
than the baby's temperature.

00:56:11.280 --> 00:56:12.790
So you take it off the baby.

00:56:12.790 --> 00:56:14.170
It starts cooling down.

00:56:14.170 --> 00:56:15.692
How fast does it cool down?

00:56:15.692 --> 00:56:17.400
Well, you could assume
that it cools down

00:56:17.400 --> 00:56:20.320
with some exponential decay
from the baby's temperature.

00:56:20.320 --> 00:56:22.750
And this is something
that is very reasonable,

00:56:22.750 --> 00:56:24.490
and you could
imagine, maybe if you

00:56:24.490 --> 00:56:26.530
had label data for just
a few of the babies,

00:56:26.530 --> 00:56:28.780
you could try to fit the
parameters of the exponential

00:56:28.780 --> 00:56:30.840
very quickly.

00:56:30.840 --> 00:56:33.160
And in this way, now, we
parameterize the conditional

00:56:33.160 --> 00:56:39.040
distribution of the temperature
probe, given both the state

00:56:39.040 --> 00:56:42.220
and whether the artifact
occurred or not,

00:56:42.220 --> 00:56:45.710
using this very simple
exponential decay.

00:56:45.710 --> 00:56:49.957
And in this paper, they give
a very similar type of--

00:56:49.957 --> 00:56:51.790
they make similar types
of-- analogous types

00:56:51.790 --> 00:56:54.588
of assumptions for all of
the other artifactual probes.

00:56:54.588 --> 00:56:56.380
You should think about
this as constraining

00:56:56.380 --> 00:56:58.757
these conditional distributions
I showed you here.

00:56:58.757 --> 00:57:01.090
They're no longer allowed to
be arbitrary distributions,

00:57:01.090 --> 00:57:03.910
and so that, when one does
now expectation maximization

00:57:03.910 --> 00:57:06.573
to try to maximize the marginal
likelihood of the data,

00:57:06.573 --> 00:57:07.990
you've now constrained
it in a way

00:57:07.990 --> 00:57:10.073
that you hopefully are
moved on to identifyability

00:57:10.073 --> 00:57:11.310
of the learning problem.

00:57:11.310 --> 00:57:13.330
It makes all of the
difference in learning here.

00:57:18.130 --> 00:57:21.730
So in this paper,
their evaluation

00:57:21.730 --> 00:57:23.830
did a little bit of fine
tuning for each baby.

00:57:23.830 --> 00:57:26.650
In particular, they assumed
that the first 30 minutes

00:57:26.650 --> 00:57:31.150
near the start consists
of normal dynamics

00:57:31.150 --> 00:57:33.190
so that's there
are no artifacts.

00:57:33.190 --> 00:57:34.750
That's, of course,
a big assumption,

00:57:34.750 --> 00:57:39.100
but they use that to try to
fine tune the dynamic model

00:57:39.100 --> 00:57:43.540
to fine tune it for each
baby and for themselves.

00:57:43.540 --> 00:57:45.070
And then they looked
at the ability

00:57:45.070 --> 00:57:47.357
to try to identify
artifactual processes.

00:57:47.357 --> 00:57:49.690
Now, I want to go a little
bit slowly through this plot,

00:57:49.690 --> 00:57:52.350
because it's quite interesting.

00:57:52.350 --> 00:57:57.990
So what I'm showing
you here is a ROC curve

00:57:57.990 --> 00:58:00.292
of the ability to
predict each of the four

00:58:00.292 --> 00:58:01.500
different types of artifacts.

00:58:01.500 --> 00:58:03.810
For example, at any
one point in time,

00:58:03.810 --> 00:58:05.990
was there a blood sample
being taken or not?

00:58:05.990 --> 00:58:07.890
At any one point
in time, was there

00:58:07.890 --> 00:58:12.270
a core temperature disconnect
of the core temperature probe?

00:58:12.270 --> 00:58:13.770
And to evaluate it,
they're assuming

00:58:13.770 --> 00:58:18.850
that they have some label data
for evaluation purposes only.

00:58:18.850 --> 00:58:22.110
And of course, you want to be
at the very far top left corner

00:58:22.110 --> 00:58:23.866
up here.

00:58:23.866 --> 00:58:27.820
And what we're showing here
are three different curves--

00:58:27.820 --> 00:58:31.120
the very faint
dotted line, which

00:58:31.120 --> 00:58:34.780
I'm going to trace out with
my cursor, is the baseline.

00:58:34.780 --> 00:58:39.068
Think of that as a
much worse algorithm.

00:58:41.640 --> 00:58:42.140
Sorry.

00:58:42.140 --> 00:58:44.523
That's that line over there.

00:58:44.523 --> 00:58:45.190
Everyone see it?

00:58:49.030 --> 00:58:52.110
And this approach are
the other two lines.

00:58:52.110 --> 00:58:54.800
Now, what's differentiating
those other two lines

00:58:54.800 --> 00:58:57.940
corresponds to the particular
type of approximate inference

00:58:57.940 --> 00:59:00.120
algorithm that's used.

00:59:00.120 --> 00:59:05.640
To do this posterior
inference, to infer

00:59:05.640 --> 00:59:10.290
the true value of the x's
given your noisy observations

00:59:10.290 --> 00:59:14.160
in the model given here is
actually a very hard inference

00:59:14.160 --> 00:59:15.920
problem.

00:59:15.920 --> 00:59:18.330
Mathematically, I
think one can show

00:59:18.330 --> 00:59:21.692
that it's an NP-hard
computational problem.

00:59:21.692 --> 00:59:23.650
And so they have to
approximate it in some way,

00:59:23.650 --> 00:59:26.010
and they use two different
approximations here.

00:59:26.010 --> 00:59:28.400
The first approximation
is based on what they're

00:59:28.400 --> 00:59:31.110
calling a Gaussian
sum approximation,

00:59:31.110 --> 00:59:33.420
and it's a deterministic
approximation.

00:59:33.420 --> 00:59:37.240
The second approximation is
based on a Monte Carlo method.

00:59:37.240 --> 00:59:40.290
And what you see here is that
the Gaussian sum approximation

00:59:40.290 --> 00:59:41.970
is actually dramatically better.

00:59:41.970 --> 00:59:43.920
So for example, in
this blood sample one,

00:59:43.920 --> 00:59:48.750
that the ROC curve looks like
this for the Gaussian sum

00:59:48.750 --> 00:59:49.640
approximation.

00:59:49.640 --> 00:59:51.390
Whereas for the Monte
Carlo approximation,

00:59:51.390 --> 00:59:54.510
it's actually
significantly lower.

00:59:54.510 --> 00:59:56.400
And this is just to
point out that, even

00:59:56.400 --> 01:00:03.660
in this setting, where
we have very little data,

01:00:03.660 --> 01:00:06.780
we're using a lot of domain
knowledge, the actual details

01:00:06.780 --> 01:00:09.053
of how one does the
math-- in particular,

01:00:09.053 --> 01:00:10.470
the proximate
inference-- can make

01:00:10.470 --> 01:00:13.047
a really big difference in the
performance of this system.

01:00:13.047 --> 01:00:14.880
And so it's something
that one should really

01:00:14.880 --> 01:00:16.047
think deeply about, as well.

01:00:18.666 --> 01:00:21.700
I'm going to skip that
slide, and then just mention

01:00:21.700 --> 01:00:23.170
very briefly this one.

01:00:23.170 --> 01:00:28.640
This is showing an
inference of the events.

01:00:28.640 --> 01:00:34.600
So here I'm showing you
three different observations.

01:00:34.600 --> 01:00:39.130
And on the bottom here,
I'm showing the prediction

01:00:39.130 --> 01:00:43.950
of when artifact-- two different
artifactual events happened.

01:00:43.950 --> 01:00:46.020
And these predictions
were actually quite good,

01:00:46.020 --> 01:00:48.180
using this model.

01:00:48.180 --> 01:00:52.210
So I'm done with that
first example, and--

01:00:52.210 --> 01:00:55.380
and the-- just to recap
the important points

01:00:55.380 --> 01:01:01.300
of that example, it was that
we had almost no label data.

01:01:01.300 --> 01:01:05.470
We're tackling this problem
using a cleverly chosen

01:01:05.470 --> 01:01:08.780
statistical model with some
domain knowledge built in,

01:01:08.780 --> 01:01:12.040
and that can go really far.

01:01:12.040 --> 01:01:14.500
So now we'll shift gears to
talk about a different type

01:01:14.500 --> 01:01:18.340
of problem involving
physiological data,

01:01:18.340 --> 01:01:22.570
and that's of detecting
atrial fibrillation.

01:01:22.570 --> 01:01:26.280
So what I'm showing you
here is an AliveCore device.

01:01:26.280 --> 01:01:27.850
I own one of these.

01:01:27.850 --> 01:01:30.540
So if you want to drop
by my E25 545 office,

01:01:30.540 --> 01:01:32.860
you can-- you can
play around with it.

01:01:32.860 --> 01:01:35.930
And if you attach it
to your mobile phone,

01:01:35.930 --> 01:01:43.800
it'll show you your electric
conductance through your heart

01:01:43.800 --> 01:01:46.710
as measured through
your two fingers

01:01:46.710 --> 01:01:48.670
touching this device
shown over here.

01:01:48.670 --> 01:01:51.270
And from that, one can try to
detect whether the patient has

01:01:51.270 --> 01:01:52.990
atrial fibrillation.

01:01:52.990 --> 01:01:54.941
So what is atrial fibrillation?

01:01:58.617 --> 01:01:59.200
Good question.

01:01:59.200 --> 01:02:00.284
It's [INAUDIBLE].

01:02:04.240 --> 01:02:10.270
So this is from the
American Heart Association.

01:02:10.270 --> 01:02:13.810
They defined atrial fibrillation
as a quivering or irregular

01:02:13.810 --> 01:02:16.450
heartbeat, also
known as arrhythmia.

01:02:16.450 --> 01:02:18.220
And one of the big
challenges is that it

01:02:18.220 --> 01:02:21.030
could lead to blood clot,
stroke, heart failure, and so

01:02:21.030 --> 01:02:21.530
on.

01:02:21.530 --> 01:02:23.980
So here is how a
patient might describe

01:02:23.980 --> 01:02:26.020
having atrial fibrillation.

01:02:26.020 --> 01:02:28.180
My heart flip-flops,
skips beats,

01:02:28.180 --> 01:02:31.150
feels like it's banging
against my chest wall,

01:02:31.150 --> 01:02:33.790
particularly when I'm
carrying stuff up my stairs

01:02:33.790 --> 01:02:35.542
or bending down.

01:02:35.542 --> 01:02:37.250
Now let's try to look
at a picture of it.

01:02:48.040 --> 01:02:55.330
So this is a normal heartbeat.

01:02:55.330 --> 01:02:59.860
Hearts move-- pumping like this.

01:02:59.860 --> 01:03:03.130
And if you were to
look at the signal

01:03:03.130 --> 01:03:04.810
output of the EKG of
a normal heartbeat,

01:03:04.810 --> 01:03:05.620
it would look like this.

01:03:05.620 --> 01:03:07.735
And it's roughly corresponding
to the different--

01:03:07.735 --> 01:03:09.840
the signal is corresponding
to different cycles

01:03:09.840 --> 01:03:12.420
of the heartbeat.

01:03:12.420 --> 01:03:15.000
Now for a patient who
has atrial fibrillation,

01:03:15.000 --> 01:03:16.290
it looks more like this.

01:03:21.650 --> 01:03:25.677
So much more obviously abnormal,
at least in this figure.

01:03:25.677 --> 01:03:27.510
And if you look at the
corresponding signal,

01:03:27.510 --> 01:03:29.382
it also looks very different.

01:03:29.382 --> 01:03:31.590
So this is just to give you
some intuition about what

01:03:31.590 --> 01:03:33.577
I mean by atrial fibrillation.

01:03:36.990 --> 01:03:39.930
So what we're going to try
to do now is to detect it.

01:03:39.930 --> 01:03:44.090
So we're going to
take data like that

01:03:44.090 --> 01:03:48.580
and try to classify it into a
number of different categories.

01:03:48.580 --> 01:03:52.630
Now this is something which
has been studied for decades,

01:03:52.630 --> 01:03:57.430
and last year, 2017,
there was a competition

01:03:57.430 --> 01:04:01.450
run by Professor Roger
Mark, who is here

01:04:01.450 --> 01:04:04.390
at MIT, which is trying
to see, well, how could--

01:04:04.390 --> 01:04:06.460
how good are we at
trying to figure out

01:04:06.460 --> 01:04:09.940
which patients have different
types of heart rhythms

01:04:09.940 --> 01:04:11.780
based on data that
looks like this?

01:04:11.780 --> 01:04:13.300
So this is a normal
rhythm, which

01:04:13.300 --> 01:04:16.700
is also called a sinus rhythm.

01:04:16.700 --> 01:04:18.750
And over here it's atrial--

01:04:18.750 --> 01:04:22.120
this is an example one patient
who has atrial fibrillation.

01:04:22.120 --> 01:04:25.200
This is another type of rhythm
that's not atrial fibrillation,

01:04:25.200 --> 01:04:26.590
but is abnormal.

01:04:26.590 --> 01:04:29.670
And this is a noisy recording--
for example, if a patient's--

01:04:29.670 --> 01:04:32.220
doesn't really have their
two fingers very well put

01:04:32.220 --> 01:04:35.180
on to the two leads
of the device.

01:04:35.180 --> 01:04:41.040
So given one of these
categories, can we predict--

01:04:41.040 --> 01:04:42.760
one of these signals,
could predict

01:04:42.760 --> 01:04:45.355
which category it came from?

01:04:45.355 --> 01:04:47.230
So if you looked at
this, you might recognize

01:04:47.230 --> 01:04:48.970
that they look a bit different.

01:04:48.970 --> 01:04:53.380
So could some of
you guess what might

01:04:53.380 --> 01:04:55.780
be predictive features
that differentiate

01:04:55.780 --> 01:04:59.440
one of these signals
from the other?

01:04:59.440 --> 01:05:00.303
In the back?

01:05:00.303 --> 01:05:01.720
AUDIENCE: The
presence and absence

01:05:01.720 --> 01:05:07.065
of one of the peaks the QRS
complex are [INAUDIBLE]..

01:05:07.065 --> 01:05:08.440
DAVID SONTAG: So
speak in English

01:05:08.440 --> 01:05:10.722
for people who don't know
what these terms mean.

01:05:10.722 --> 01:05:12.680
AUDIENCE: There is one
large piece, which can--

01:05:12.680 --> 01:05:16.730
probably we can consider one
mV and there is another peak,

01:05:16.730 --> 01:05:18.520
which is sort of like--

01:05:18.520 --> 01:05:20.630
they have reverse polarity
between normal rhythm

01:05:20.630 --> 01:05:21.310
and [INAUDIBLE].

01:05:21.310 --> 01:05:22.102
DAVID SONTAG: Good.

01:05:22.102 --> 01:05:23.820
So are you a cardiologist?

01:05:23.820 --> 01:05:24.710
AUDIENCE: No.

01:05:24.710 --> 01:05:26.440
DAVID SONTAG: No, OK.

01:05:26.440 --> 01:05:29.050
So what the student
suggested is one

01:05:29.050 --> 01:05:31.660
could look for sort
of these inversions

01:05:31.660 --> 01:05:34.670
to try to describe it a
little bit differently.

01:05:34.670 --> 01:05:41.290
So here you're suggesting
the lack of those inversions

01:05:41.290 --> 01:05:45.430
is predictive of
an abnormal rhythm.

01:05:45.430 --> 01:05:47.655
What about another feature
that could be predictive?

01:05:47.655 --> 01:05:48.155
Yep?

01:05:48.155 --> 01:05:49.840
AUDIENCE: The spacing
between the peaks

01:05:49.840 --> 01:05:52.030
is more irregular with the AF.

01:05:52.030 --> 01:05:53.740
DAVID SONTAG: The
spacing between beats

01:05:53.740 --> 01:05:56.853
is more irregular
with the AF rhythm.

01:05:56.853 --> 01:05:58.270
So you're sort of
looking at this.

01:05:58.270 --> 01:06:00.160
You see how here
this spacing is very

01:06:00.160 --> 01:06:01.538
different from this spacing.

01:06:01.538 --> 01:06:03.580
Whereas in the normal
rhythm, sort of the spacing

01:06:03.580 --> 01:06:05.690
looks pretty darn regular.

01:06:05.690 --> 01:06:07.060
All right, good.

01:06:07.060 --> 01:06:11.050
So if I was to show you
40 examples of these

01:06:11.050 --> 01:06:12.940
and then ask you to
classify some new ones,

01:06:12.940 --> 01:06:15.280
how well do you think
you'll be able to do?

01:06:15.280 --> 01:06:15.780
Pretty well?

01:06:20.970 --> 01:06:23.550
I would be surprised if
you couldn't do reasonably

01:06:23.550 --> 01:06:26.250
well at least distinguishing
between normal rhythm and AF

01:06:26.250 --> 01:06:30.510
rhythm, because there seem to be
some pretty clear signals here.

01:06:30.510 --> 01:06:32.580
Of course, as you get
into alternatives,

01:06:32.580 --> 01:06:34.848
then the story gets
much more complex.

01:06:34.848 --> 01:06:36.390
But let me dig in
a little bit deeper

01:06:36.390 --> 01:06:37.980
into what I mean by this.

01:06:37.980 --> 01:06:39.600
So let's define
some of these terms.

01:06:39.600 --> 01:06:44.430
Well, cardiologists have studied
this for a really long time,

01:06:44.430 --> 01:06:46.530
and they have-- so
what I'm showing

01:06:46.530 --> 01:06:49.380
you here is one heart cycle.

01:06:49.380 --> 01:06:53.220
And they've-- you can put names
to each of the peaks that you

01:06:53.220 --> 01:06:55.860
would see in a regular heart
cycle-- so that-- for example,

01:06:55.860 --> 01:06:59.250
that very high peak is
known as the R peak.

01:06:59.250 --> 01:07:03.060
And you could look at, for
example, the interval--

01:07:03.060 --> 01:07:06.720
so this is one beat.

01:07:06.720 --> 01:07:10.320
You could look at the interval
between the R peak of one beat

01:07:10.320 --> 01:07:13.050
and the R peak of
another peak, and define

01:07:13.050 --> 01:07:15.440
that to be the RR interval.

01:07:15.440 --> 01:07:18.050
In a similar way,
one could take--

01:07:18.050 --> 01:07:21.060
one could find different
distinctive elements

01:07:21.060 --> 01:07:22.140
of the signal--

01:07:22.140 --> 01:07:23.032
by the way, each--

01:07:25.680 --> 01:07:28.110
each time step
corresponds to the heart

01:07:28.110 --> 01:07:30.410
being in a different position.

01:07:30.410 --> 01:07:33.860
For a healthy heart, these
are relatively deterministic.

01:07:33.860 --> 01:07:36.330
And so you could look at
other distances and derive

01:07:36.330 --> 01:07:38.010
features from those
distances, as well,

01:07:38.010 --> 01:07:40.160
just like we were talking
about, both within a beat

01:07:40.160 --> 01:07:42.220
and across beats.

01:07:42.220 --> 01:07:42.895
Yep?

01:07:42.895 --> 01:07:44.312
AUDIENCE: So what's
the difference

01:07:44.312 --> 01:07:46.090
between a segment and
an interval again?

01:07:48.333 --> 01:07:50.250
DAVID SONTAG: I don't
know what the difference

01:07:50.250 --> 01:07:51.420
between a segment
and an interval is.

01:07:51.420 --> 01:07:52.070
Does anyone else know?

01:07:52.070 --> 01:07:54.070
I mean, I guess the
interval is between probably

01:07:54.070 --> 01:07:56.490
the heads of peaks, whereas
segments might refer to

01:07:56.490 --> 01:07:59.193
within a interval.

01:07:59.193 --> 01:07:59.860
That's my guess.

01:07:59.860 --> 01:08:00.902
Does someone know better?

01:08:04.190 --> 01:08:05.630
For the purpose
of today's class,

01:08:05.630 --> 01:08:07.366
that's a good enough
understanding.

01:08:10.940 --> 01:08:14.060
The point is this
is well understood.

01:08:14.060 --> 01:08:16.093
One could derive
features from this.

01:08:16.093 --> 01:08:16.776
AUDIENCE: By us.

01:08:16.776 --> 01:08:17.609
DAVID SONTAG: By us.

01:08:20.180 --> 01:08:23.399
So what would a traditional
approach be to this problem?

01:08:23.399 --> 01:08:24.020
So this is--

01:08:24.020 --> 01:08:27.050
I'm pulling this figure
from a paper from 2002.

01:08:27.050 --> 01:08:30.200
What it'll do is it'll
take in that signal.

01:08:30.200 --> 01:08:32.960
It'll do some filtering of it.

01:08:32.960 --> 01:08:35.750
Then it'll run a peak
detection logic, which

01:08:35.750 --> 01:08:38.840
will find these
peaks, and then it'll

01:08:38.840 --> 01:08:43.939
measure intervals between
these peaks and within a beat.

01:08:43.939 --> 01:08:48.069
And it'll take
those computations

01:08:48.069 --> 01:08:49.760
or make some
decision based on it.

01:08:49.760 --> 01:08:51.590
So that's a
traditional algorithm,

01:08:51.590 --> 01:08:54.310
and they work pretty reasonably.

01:08:54.310 --> 01:08:56.560
And so what do I mean
by signal processing?

01:08:56.560 --> 01:08:58.790
Well, this is an
example of that.

01:08:58.790 --> 01:09:01.880
I encourage any of you to go
home today and try to code up

01:09:01.880 --> 01:09:03.140
a peaked finding algorithm.

01:09:03.140 --> 01:09:06.819
It's not that hard, at
least not to get an OK one.

01:09:06.819 --> 01:09:11.149
You might imagine
keeping a running tab

01:09:11.149 --> 01:09:13.811
of what's the highest
signal you've seen so far.

01:09:13.811 --> 01:09:16.019
Then you look to see what
is the first time it drops,

01:09:16.019 --> 01:09:18.394
and the second time-- and the
next time it goes up larger

01:09:18.394 --> 01:09:22.064
than, let's say, the previous--

01:09:22.064 --> 01:09:22.939
suppose that one of--

01:09:22.939 --> 01:09:26.689
you want to look for when the
drop is-- the maximum value--

01:09:26.689 --> 01:09:28.790
recent maximum
value divided by 2.

01:09:28.790 --> 01:09:31.279
And then you-- then you reset.

01:09:31.279 --> 01:09:33.800
And you can imagine in this
way very quickly coding up

01:09:33.800 --> 01:09:37.755
a peak finding algorithm.

01:09:37.755 --> 01:09:39.380
And so this is just,
again, to give you

01:09:39.380 --> 01:09:43.130
some intuition behind what a
traditional approach would be.

01:09:43.130 --> 01:09:46.790
And then you can very
quickly see that that--

01:09:46.790 --> 01:09:49.729
once you start to look at
some intervals between peaks,

01:09:49.729 --> 01:09:52.880
that alone is often good
enough for predicting

01:09:52.880 --> 01:09:55.050
whether a patient has
atrial fibrillation.

01:09:55.050 --> 01:09:58.940
So this is a figure
taken from paper in 2001

01:09:58.940 --> 01:10:01.310
showing a single
patient's time series.

01:10:01.310 --> 01:10:04.940
So the x-axis is for
that single patient,

01:10:04.940 --> 01:10:07.250
their heart beats across time.

01:10:07.250 --> 01:10:09.830
The y-axis is just
showing the RR interval

01:10:09.830 --> 01:10:14.300
between the previous beat
and the current beat.

01:10:14.300 --> 01:10:18.080
And down here in the
bottom is the ground truth

01:10:18.080 --> 01:10:20.990
of whether the patient
is assessed to have--

01:10:20.990 --> 01:10:27.650
to be in-- to have a normal
rhythm or atrial fibrillation,

01:10:27.650 --> 01:10:30.630
which is noted as this
higher value here.

01:10:30.630 --> 01:10:33.830
So these are AF rhythms.

01:10:33.830 --> 01:10:34.710
This is normal.

01:10:34.710 --> 01:10:36.800
This is AF again.

01:10:36.800 --> 01:10:40.670
And what you can see is that
the RR interval actually

01:10:40.670 --> 01:10:41.640
gets you pretty far.

01:10:41.640 --> 01:10:44.210
You notice how it's
pretty high up here.

01:10:44.210 --> 01:10:46.130
Suddenly it drops.

01:10:46.130 --> 01:10:47.930
The RR interval
drops for a while,

01:10:47.930 --> 01:10:50.450
and that's when
the patient has AF.

01:10:50.450 --> 01:10:51.860
Then it goes up again.

01:10:51.860 --> 01:10:54.780
Then it drops again, and so on.

01:10:54.780 --> 01:10:56.780
And so it's not deterministic,
the relationship,

01:10:56.780 --> 01:10:59.143
but there's definitely a lot
of signal just from that.

01:10:59.143 --> 01:11:00.560
So you might say,
OK, well, what's

01:11:00.560 --> 01:11:02.480
the next thing we could do
to try to clean up the signal

01:11:02.480 --> 01:11:03.230
a little bit more?

01:11:03.230 --> 01:11:11.210
So flash backwards from 2001 to
1970 here at MIT, studied by--

01:11:11.210 --> 01:11:13.760
actually, no, this is not MIT.

01:11:13.760 --> 01:11:16.070
This is somewhere else, sorry.

01:11:16.070 --> 01:11:21.398
But still 1970-- where they
used a Markov model very

01:11:21.398 --> 01:11:23.690
similar to the Markov models
we were just talking about

01:11:23.690 --> 01:11:30.410
in the previous example to model
what a sequence of normal RR

01:11:30.410 --> 01:11:34.310
intervals looks like versus
what a sequence of abnormal,

01:11:34.310 --> 01:11:37.370
for example, AF RR
intervals looks like.

01:11:37.370 --> 01:11:39.590
And in that way,
one can recognize

01:11:39.590 --> 01:11:42.980
that, for any one
observation of an RR interval

01:11:42.980 --> 01:11:45.540
might not by itself be
perfectly predictive,

01:11:45.540 --> 01:11:47.480
but if you look at sort
of a sequence of them

01:11:47.480 --> 01:11:50.480
for a patient with
atrial fibrillation,

01:11:50.480 --> 01:11:53.420
there is some common
pattern to it.

01:11:53.420 --> 01:11:56.090
And you can-- one can detect it
by just looking at likelihood

01:11:56.090 --> 01:11:59.450
of that sequence under each
of these two different models,

01:11:59.450 --> 01:12:01.230
normal and abnormal.

01:12:01.230 --> 01:12:04.070
And that did pretty well--
even better than the previous

01:12:04.070 --> 01:12:05.310
approaches for--

01:12:05.310 --> 01:12:08.370
for predicting
atrial fibrillation.

01:12:08.370 --> 01:12:11.790
This is the paper I
wanted to say from MIT.

01:12:11.790 --> 01:12:15.880
Now 1991, this is also
from Roger Mark's group.

01:12:15.880 --> 01:12:19.480
Now this is a neural network
based approach, where it says,

01:12:19.480 --> 01:12:22.108
OK, we're going to take
a bunch of these things.

01:12:22.108 --> 01:12:24.150
We're going to derive a
bunch of these intervals,

01:12:24.150 --> 01:12:25.890
and then we're going to throw
that through a black box

01:12:25.890 --> 01:12:27.432
supervised machine
learning algorithm

01:12:27.432 --> 01:12:30.240
to predict whether a
patient has AF or not.

01:12:30.240 --> 01:12:32.220
So these are very--

01:12:32.220 --> 01:12:34.890
first of all, there are
some simple approaches here

01:12:34.890 --> 01:12:36.540
that work reasonably well.

01:12:36.540 --> 01:12:42.280
Using neural networks in this
domain is not a new thing,

01:12:42.280 --> 01:12:44.140
but where are we as a field?

01:12:44.140 --> 01:12:46.920
So as I mentioned, there was
this competition last year,

01:12:46.920 --> 01:12:48.887
and what I'm showing
you here-- the citation

01:12:48.887 --> 01:12:50.470
is from one of the
winning approaches.

01:12:50.470 --> 01:12:52.845
And this winning approach
really brings the two paradigms

01:12:52.845 --> 01:12:53.910
together.

01:12:53.910 --> 01:12:57.600
It extracts a large number
of expert derived features--

01:12:57.600 --> 01:12:59.342
so shown here.

01:12:59.342 --> 01:13:01.050
And these are exactly
the types of things

01:13:01.050 --> 01:13:06.390
you might think, like
proportion, median RR

01:13:06.390 --> 01:13:11.417
interval of regular rhythms,
max RR irregularity measure.

01:13:11.417 --> 01:13:13.500
And there's just a whole
range of different things

01:13:13.500 --> 01:13:16.160
that you can imagine manually
deriving from the data.

01:13:16.160 --> 01:13:17.910
And you throw all
of these features

01:13:17.910 --> 01:13:21.840
into a machine
learning algorithm,

01:13:21.840 --> 01:13:25.040
maybe a random forest, maybe a
neural network, doesn't matter.

01:13:25.040 --> 01:13:27.180
And what you get out is a
slightly better algorithm

01:13:27.180 --> 01:13:28.555
than what if you
had just come up

01:13:28.555 --> 01:13:30.510
with a simple rule on your own.

01:13:30.510 --> 01:13:33.470
That was the winning
algorithm then.

01:13:33.470 --> 01:13:36.970
And in the summary paper, they
conjectured that, well, maybe

01:13:36.970 --> 01:13:39.357
it's the case that they were--

01:13:39.357 --> 01:13:41.440
they'd expected that
convolutional neural networks

01:13:41.440 --> 01:13:42.443
would win.

01:13:42.443 --> 01:13:44.860
And they were surprised that
none of the winning solutions

01:13:44.860 --> 01:13:47.070
involved convolution
neural networks.

01:13:47.070 --> 01:13:50.297
And they conjectured that may be
the reason why is because maybe

01:13:50.297 --> 01:13:52.630
with these 8,000 patients
that they had [INAUDIBLE] that

01:13:52.630 --> 01:13:56.590
just wasn't enough to give the
more complex models advantage.

01:13:56.590 --> 01:14:00.370
So flip forward now to
this year and the article

01:14:00.370 --> 01:14:05.840
that you read in your
readings in Nature Medicine,

01:14:05.840 --> 01:14:07.420
where the Stanford
group now showed

01:14:07.420 --> 01:14:10.540
how a convolutional neural
network approach, which

01:14:10.540 --> 01:14:13.960
is, in many ways, extremely
naive-- all it does

01:14:13.960 --> 01:14:17.870
is it takes the
sequence data in.

01:14:17.870 --> 01:14:20.710
It makes no attempt at trying
to understand the underlying

01:14:20.710 --> 01:14:23.800
physiology, and just
predicts from that--

01:14:23.800 --> 01:14:25.647
can do really, really well.

01:14:25.647 --> 01:14:27.230
And so there are
couple of differences

01:14:27.230 --> 01:14:29.590
that I want to emphasize
to the previous work.

01:14:29.590 --> 01:14:31.360
First, the censor is different.

01:14:31.360 --> 01:14:35.580
Whereas the previous work
used this alive core censor,

01:14:35.580 --> 01:14:37.420
in this paper from
Stanford, they're

01:14:37.420 --> 01:14:40.870
using a different censor
called the Zio patch, which

01:14:40.870 --> 01:14:44.110
is attached to the human
body and conceivably

01:14:44.110 --> 01:14:45.580
much less noisy.

01:14:45.580 --> 01:14:47.560
So that's one big difference.

01:14:47.560 --> 01:14:49.810
The second big difference
is that there's dramatically

01:14:49.810 --> 01:14:50.770
more data.

01:14:50.770 --> 01:14:52.510
Instead of 8,000
patients to train from,

01:14:52.510 --> 01:14:54.790
now they have over
90,000 records

01:14:54.790 --> 01:14:58.060
from 50,000 different
patients to train from.

01:14:58.060 --> 01:14:59.740
The third major
difference is that now,

01:14:59.740 --> 01:15:02.740
rather than just trying to
classify into four categories--

01:15:02.740 --> 01:15:06.723
normal, abnormal,
other, or noisy--

01:15:06.723 --> 01:15:08.140
now we're going
to try to classify

01:15:08.140 --> 01:15:09.880
into 14 different categories.

01:15:09.880 --> 01:15:12.850
We're, in essence, breaking
apart that other class

01:15:12.850 --> 01:15:15.610
into much finer grain
detail of different types

01:15:15.610 --> 01:15:17.780
of abnormal rhythms.

01:15:17.780 --> 01:15:20.110
And so here are some of
those other abnormal rhythms,

01:15:20.110 --> 01:15:28.140
things like complete
heart block,

01:15:28.140 --> 01:15:31.650
and a bunch of other
names I can't pronounce.

01:15:31.650 --> 01:15:34.472
And from each one of these,
they gathered a lot of data.

01:15:34.472 --> 01:15:35.430
And that actually did--

01:15:35.430 --> 01:15:36.870
so it's not described
in the paper,

01:15:36.870 --> 01:15:38.160
but I've talked to
the authors, and they

01:15:38.160 --> 01:15:40.690
did-- they gathered this data
in a very interesting way.

01:15:40.690 --> 01:15:42.720
So they sort of-- they did
their training iteratively.

01:15:42.720 --> 01:15:44.460
They looked to see
where their errors were,

01:15:44.460 --> 01:15:46.752
and then they went and gathered
more data from patients

01:15:46.752 --> 01:15:48.180
with that subcategory.

01:15:48.180 --> 01:15:51.930
So many of these
other categories

01:15:51.930 --> 01:15:54.267
are very under-- might
be underrepresented

01:15:54.267 --> 01:15:56.100
in the general population,
but they actually

01:15:56.100 --> 01:15:57.810
gather a lot of
patients of that type

01:15:57.810 --> 01:16:00.520
in their data set for
training purposes.

01:16:00.520 --> 01:16:02.700
And so I think those
three things ended up

01:16:02.700 --> 01:16:05.320
making a very big difference.

01:16:05.320 --> 01:16:07.050
So what is their
convolutional network?

01:16:07.050 --> 01:16:10.180
Well, first of all,
it's a 1-D signal.

01:16:10.180 --> 01:16:12.180
So it's a little bit
different from the con nets

01:16:12.180 --> 01:16:13.380
you typically see
in computer vision,

01:16:13.380 --> 01:16:15.088
and I'll show you an
illustration of that

01:16:15.088 --> 01:16:16.080
in the next slide.

01:16:16.080 --> 01:16:17.430
It's a very deep model.

01:16:17.430 --> 01:16:20.100
So it's 34 layers.

01:16:20.100 --> 01:16:23.010
So the input comes in on the
very top in this picture.

01:16:23.010 --> 01:16:26.730
It's passed through
a number of layers.

01:16:26.730 --> 01:16:30.210
Each layer consists of
convolution followed

01:16:30.210 --> 01:16:33.600
by rectified linear
units, and there is sub

01:16:33.600 --> 01:16:35.790
sampling at every
other layer so that you

01:16:35.790 --> 01:16:38.010
go from a very wide signal--

01:16:38.010 --> 01:16:39.645
so a very long--

01:16:39.645 --> 01:16:40.770
I can't remember how long--

01:16:40.770 --> 01:16:43.830
1 second long signal
summarized down

01:16:43.830 --> 01:16:47.165
into sort of much-- just many
smaller number of dimensions,

01:16:47.165 --> 01:16:49.290
which you then have a sort
of fully connected layer

01:16:49.290 --> 01:16:52.770
at the bottom to do
for your predictions.

01:16:52.770 --> 01:16:55.590
And then they also have
these shortcut connections,

01:16:55.590 --> 01:16:58.770
which allow you to pass
information from earlier layers

01:16:58.770 --> 01:17:00.630
down to the very
end of the network,

01:17:00.630 --> 01:17:02.255
or even into
intermediate layers.

01:17:02.255 --> 01:17:04.380
And for those of you who
are familiar with residual

01:17:04.380 --> 01:17:06.850
networks, it's the same idea.

01:17:06.850 --> 01:17:08.340
So what is a 1D convolution?

01:17:08.340 --> 01:17:10.270
Well, it looks a
little bit like this.

01:17:10.270 --> 01:17:12.960
So this is the signal.

01:17:12.960 --> 01:17:15.570
I'm going to just approximate
it by a bunch of 1's and 0's.

01:17:15.570 --> 01:17:16.560
I'll say this is a 1.

01:17:16.560 --> 01:17:17.360
This is a 0.

01:17:17.360 --> 01:17:18.480
This is a 1, 1, so on.

01:17:21.620 --> 01:17:25.280
A convolutional network has
a filter associated with it.

01:17:25.280 --> 01:17:28.070
That filter is then
applied in a 1D model.

01:17:28.070 --> 01:17:29.630
It's applied in
a linear fashion.

01:17:29.630 --> 01:17:32.240
It's just taken a dot product
with the filter's values,

01:17:32.240 --> 01:17:35.150
with the values of the
signal at each point in time.

01:17:35.150 --> 01:17:38.130
So it looks a little
bit like this,

01:17:38.130 --> 01:17:39.450
and this is what you get out.

01:17:39.450 --> 01:17:42.330
So this is the convolution
of a single filter

01:17:42.330 --> 01:17:44.760
with the whole signal.

01:17:44.760 --> 01:17:47.140
And the computation I did
there-- so for example,

01:17:47.140 --> 01:17:49.860
this first number came
from taking the dot product

01:17:49.860 --> 01:17:51.360
of the first three numbers--

01:17:51.360 --> 01:17:53.370
1, 0, 1-- with the filter.

01:17:53.370 --> 01:18:01.548
So it's 1 times 2 plus 3 times
0 plus 1 times 1, which is 3.

01:18:01.548 --> 01:18:03.090
And so each of the
subsequent numbers

01:18:03.090 --> 01:18:04.900
was computed in the same way.

01:18:04.900 --> 01:18:09.060
And I usually have you figure
out what this last one is,

01:18:09.060 --> 01:18:12.440
but I'll leave that
for you to do at home.

01:18:12.440 --> 01:18:14.097
And that's what a
1D convolution is.

01:18:14.097 --> 01:18:16.680
And so they have-- they do this
for lots of different filters.

01:18:16.680 --> 01:18:19.155
Each of those filters might
be of varying lengths,

01:18:19.155 --> 01:18:21.030
and each of those will
detect different types

01:18:21.030 --> 01:18:23.040
of signal patterns.

01:18:23.040 --> 01:18:25.800
And in this way, after
having many layers of these,

01:18:25.800 --> 01:18:28.320
one can, in an
automatic fashion,

01:18:28.320 --> 01:18:31.080
extract many of the same types
of signals used in that earlier

01:18:31.080 --> 01:18:32.997
work, but also be much
more flexible to detect

01:18:32.997 --> 01:18:34.420
some new ones, as well.

01:18:34.420 --> 01:18:37.120
Hold your question,
because I need to wrap up.

01:18:37.120 --> 01:18:38.710
So in the paper
that you read, they

01:18:38.710 --> 01:18:41.902
talked about how
they evaluated this.

01:18:41.902 --> 01:18:44.110
And so I'm not going to go
into much depth in it now.

01:18:44.110 --> 01:18:46.330
I just want to point out
two different metrics

01:18:46.330 --> 01:18:47.320
that they used.

01:18:47.320 --> 01:18:48.910
So the first metric
they used was

01:18:48.910 --> 01:18:52.690
what they called a
sequential error metric.

01:18:52.690 --> 01:18:55.990
What that looked at is you
had this very long sequence

01:18:55.990 --> 01:19:00.670
for each patient, and
they labeled different one

01:19:00.670 --> 01:19:02.350
second intervals
of that sequence

01:19:02.350 --> 01:19:05.690
into abnormal,
normal, and so on.

01:19:05.690 --> 01:19:07.113
So you could ask,
how good are we

01:19:07.113 --> 01:19:08.780
at labeling each of
the different points

01:19:08.780 --> 01:19:09.600
along the sequence?

01:19:09.600 --> 01:19:11.720
And that's the sequence metric.

01:19:11.720 --> 01:19:14.510
The different-- the second
metric is the set metric,

01:19:14.510 --> 01:19:16.520
and that looks at,
if the patient has

01:19:16.520 --> 01:19:19.730
something that's abnormal
anywhere, did you detect it?

01:19:19.730 --> 01:19:22.040
So that's, in essence,
taking an or of

01:19:22.040 --> 01:19:23.510
each of those 1
second intervals,

01:19:23.510 --> 01:19:25.310
and then looking
across patients.

01:19:25.310 --> 01:19:27.410
And from a clinical
diagnostic perspective,

01:19:27.410 --> 01:19:29.510
the set metric might be
most useful, but then

01:19:29.510 --> 01:19:31.340
when you want to
introspect and understand

01:19:31.340 --> 01:19:34.370
where is that happening,
then the sequential metric is

01:19:34.370 --> 01:19:35.600
important.

01:19:35.600 --> 01:19:38.300
And the key take home message
from the paper is that,

01:19:38.300 --> 01:19:41.240
if you compared the model's
predictions-- this is, I think,

01:19:41.240 --> 01:19:44.990
using an f1 metric--

01:19:44.990 --> 01:19:49.790
to what you would get from
a panel of cardiologists,

01:19:49.790 --> 01:19:53.510
these models are doing as well,
if not better than these panels

01:19:53.510 --> 01:19:54.500
of cardiologists.

01:19:54.500 --> 01:19:56.930
So this is extremely exciting.

01:19:56.930 --> 01:19:58.700
This is technology--
or variance of this

01:19:58.700 --> 01:20:02.240
is technology that you're
going to see deployed now.

01:20:02.240 --> 01:20:04.760
So for those of you who have
purchased these Apple watches,

01:20:04.760 --> 01:20:07.220
these Samsung watches, I don't
know exactly what they're

01:20:07.220 --> 01:20:08.637
using, but I
wouldn't be surprised

01:20:08.637 --> 01:20:10.580
if they're using
techniques similar to this.

01:20:10.580 --> 01:20:12.390
And you're going to see much
more of that in the future.

01:20:12.390 --> 01:20:14.030
So this is going to be
really the first example

01:20:14.030 --> 01:20:15.447
in this course so
far of something

01:20:15.447 --> 01:20:18.280
that's really been deployed.

01:20:18.280 --> 01:20:20.660
And so in summary,
we're very often

01:20:20.660 --> 01:20:22.450
in the realm of not enough data.

01:20:22.450 --> 01:20:24.860
And in this lecture today,
we gave two examples

01:20:24.860 --> 01:20:26.030
how you can deal with that.

01:20:26.030 --> 01:20:31.340
First, you can try to use
mechanistic and statistical

01:20:31.340 --> 01:20:38.150
models to try to work
in settings where

01:20:38.150 --> 01:20:39.590
you don't have much data.

01:20:39.590 --> 01:20:42.333
And in other extremes,
you do have a lot of data,

01:20:42.333 --> 01:20:44.000
and you can try to
ignore that, and just

01:20:44.000 --> 01:20:45.292
use these black box approaches.

01:20:45.292 --> 01:20:46.930
That's all for today.