WEBVTT

00:00:14.710 --> 00:00:16.460
PROFESSOR: So I'm going
to begin by trying

00:00:16.460 --> 00:00:20.540
to build some intuition for how
one might be able to do staging

00:00:20.540 --> 00:00:24.400
from cross-sectional
data, and we'll

00:00:24.400 --> 00:00:27.910
return to this question of
combined staging subtyping only

00:00:27.910 --> 00:00:28.480
much later.

00:00:31.880 --> 00:00:34.760
So imagine that we had data
that lived in one dimension.

00:00:34.760 --> 00:00:37.840
Here, each data point
is an individual,

00:00:37.840 --> 00:00:40.610
we observe their data at
just one point in time,

00:00:40.610 --> 00:00:44.250
and suppose we knew exactly
which biomarker to look at.

00:00:44.250 --> 00:00:44.750
Right?

00:00:44.750 --> 00:00:46.333
So I gave you an
example of that here,

00:00:46.333 --> 00:00:50.280
when you might look at some
antibody expression level,

00:00:50.280 --> 00:00:54.878
and that might be what
I call biomarker A,

00:00:54.878 --> 00:00:56.920
is if you knew exactly
what biomarker to look at,

00:00:56.920 --> 00:01:00.730
you might just put each
person along this line

00:01:00.730 --> 00:01:04.629
and you might conjecture that
maybe on one side of the line,

00:01:04.629 --> 00:01:07.120
this is the early disease, and
that the other sort of line,

00:01:07.120 --> 00:01:09.480
maybe that's the late disease.

00:01:09.480 --> 00:01:11.371
Why might that be a
reasonable conjecture?

00:01:14.020 --> 00:01:16.088
What would be an
alternative conjecture?

00:01:24.500 --> 00:01:26.000
Why don't you talk
to your neighbors

00:01:26.000 --> 00:01:27.500
and see if you guys
can come up with

00:01:27.500 --> 00:01:29.840
some alternative conjectures.

00:01:29.840 --> 00:01:30.800
Let's go.

00:01:37.050 --> 00:01:38.150
All right, that's enough.

00:01:38.150 --> 00:01:39.830
So hopefully simple
questions, so I

00:01:39.830 --> 00:01:41.250
won't give you too much time.

00:01:41.250 --> 00:01:43.280
All right, so what would
be another conjecture?

00:01:43.280 --> 00:01:47.197
So again, our goal is we have
one observation per individual,

00:01:47.197 --> 00:01:49.530
each individual is in some
unknown stage of the disease,

00:01:49.530 --> 00:01:51.863
we would like to be able to
sort individuals and turn it

00:01:51.863 --> 00:01:55.872
into early and late
stages of the disease.

00:01:55.872 --> 00:01:58.080
I give you one conjecture
of how to do that, sorting,

00:01:58.080 --> 00:02:00.203
what would be another
reasonable conjecture?

00:02:00.203 --> 00:02:00.870
Raise your hand.

00:02:03.420 --> 00:02:04.030
Yep?

00:02:04.030 --> 00:02:04.790
AUDIENCE: That there's
the different--

00:02:04.790 --> 00:02:07.270
that they have different
types of the same diseases.

00:02:07.270 --> 00:02:09.919
They all have the same
disease and it could--

00:02:09.919 --> 00:02:13.103
just one of the subtypes
might be sort of the--

00:02:13.103 --> 00:02:13.770
PROFESSOR: Yeah.

00:02:13.770 --> 00:02:18.230
So you're going back to the
example I gave here where you

00:02:18.230 --> 00:02:20.300
could conflate these things.

00:02:20.300 --> 00:02:22.240
I want to stick with
a simpler story,

00:02:22.240 --> 00:02:25.903
let's suppose there's only
one subtype of the disease.

00:02:25.903 --> 00:02:28.070
What would be another way
to sort the patients given

00:02:28.070 --> 00:02:32.740
this data where the data is
these points that you see here?

00:02:32.740 --> 00:02:34.261
Yeah?

00:02:34.261 --> 00:02:37.140
AUDIENCE: For any disease in
the middle range, and then

00:02:37.140 --> 00:02:40.418
as you [INAUDIBLE]

00:02:40.418 --> 00:02:41.960
PROFESSOR: OK, so
maybe early disease

00:02:41.960 --> 00:02:49.010
is right over here, and when
things get bad, the patient--

00:02:49.010 --> 00:02:51.170
this biomarker starts
to become abnormal,

00:02:51.170 --> 00:02:53.060
and abnormality,
for whatever reason,

00:02:53.060 --> 00:02:55.100
might be sort of to the
right or to the left.

00:02:59.190 --> 00:03:02.990
Now I think that is a
conjecture one could have.

00:03:02.990 --> 00:03:04.730
I would argue that
that's perhaps

00:03:04.730 --> 00:03:06.230
not a very natural
conjecture given

00:03:06.230 --> 00:03:08.960
what we know about common
biomarkers that are measured

00:03:08.960 --> 00:03:11.270
from the human body
and the way that they

00:03:11.270 --> 00:03:13.965
respond to disease progression.

00:03:13.965 --> 00:03:16.340
Unless you're in the situation
of having multiple disease

00:03:16.340 --> 00:03:18.675
subtypes where, for example,
going to the right marker

00:03:18.675 --> 00:03:20.300
might correspond to
one disease subtype

00:03:20.300 --> 00:03:22.175
and going to the left
marker might correspond

00:03:22.175 --> 00:03:24.640
to another disease subtype.

00:03:24.640 --> 00:03:26.273
What would be
another conjecture?

00:03:31.696 --> 00:03:33.390
You guys are missing
the easy one.

00:03:33.390 --> 00:03:34.140
Yeah, in the back.

00:03:34.140 --> 00:03:35.420
AUDIENCE: Well, it
might just be one

00:03:35.420 --> 00:03:37.350
where the high values
are [INAUDIBLE] stage

00:03:37.350 --> 00:03:38.780
and low values are later ones?

00:03:38.780 --> 00:03:39.710
PROFESSOR: Exactly.

00:03:39.710 --> 00:03:44.350
So this might be early disease
and that might be late disease.

00:03:44.350 --> 00:03:46.580
AUDIENCE: It says vice
versa on this slide.

00:03:46.580 --> 00:03:48.030
PROFESSOR: Oh, does it really?

00:03:48.030 --> 00:03:49.044
Oh shoot.

00:03:49.044 --> 00:03:50.200
[LAUGHTER]

00:03:50.200 --> 00:03:54.840
AUDIENCE: [INAUDIBLE]

00:03:54.840 --> 00:03:56.300
PROFESSOR: Right, right, OK, OK.

00:03:56.300 --> 00:03:56.630
Thank you.

00:03:56.630 --> 00:03:58.547
Next time I'll take out
that lower vice versa.

00:03:58.547 --> 00:03:59.085
[LAUGHTER]

00:03:59.085 --> 00:04:00.710
That's why you guys
aren't saying that.

00:04:00.710 --> 00:04:02.420
OK.

00:04:02.420 --> 00:04:04.670
OK, so this is good.

00:04:04.670 --> 00:04:06.590
Now I think we're
all on the same page,

00:04:06.590 --> 00:04:09.170
and we had some idea of what
are some of the assumptions

00:04:09.170 --> 00:04:10.670
that one might need
to make in order

00:04:10.670 --> 00:04:11.880
to actually do anything here.

00:04:11.880 --> 00:04:16.019
Like for example,
we are making some--

00:04:16.019 --> 00:04:20.089
we'll probably have to make some
assumption about continuity,

00:04:20.089 --> 00:04:23.300
that there might be some gradual
progression of the biomarker

00:04:23.300 --> 00:04:26.660
relevance from early to late,
and it might be getting larger,

00:04:26.660 --> 00:04:29.000
it might be getting smaller.

00:04:29.000 --> 00:04:32.390
If it's indeed the scenario that
we talked about earlier where

00:04:32.390 --> 00:04:34.460
we said like early
disease might be here

00:04:34.460 --> 00:04:38.450
and late disease might be going
to either side, in that case,

00:04:38.450 --> 00:04:41.840
I think one could easily argue
that with just information

00:04:41.840 --> 00:04:45.110
we have here, disease
progression-- disease stage

00:04:45.110 --> 00:04:46.770
is unidentifiable, right?

00:04:46.770 --> 00:04:50.240
Because you wouldn't
know where would it--

00:04:50.240 --> 00:04:53.840
where should you-- where should
that transition point be?

00:04:53.840 --> 00:04:55.787
So here, here, here,
here, here, here.

00:04:55.787 --> 00:04:57.370
In fact, the same
problem arises here.

00:04:57.370 --> 00:04:58.960
Like you don't know,
is it early disease--

00:04:58.960 --> 00:05:00.620
is it going this way or
is it going that way?

00:05:00.620 --> 00:05:02.990
What would be one way to
try to disentangle this just

00:05:02.990 --> 00:05:05.540
to try to get us all on
the same page, right?

00:05:05.540 --> 00:05:08.828
So suppose it was only
going this direction

00:05:08.828 --> 00:05:10.370
or going that
direction, how could we

00:05:10.370 --> 00:05:11.453
figure out which is which?

00:05:20.470 --> 00:05:21.370
Yeah?

00:05:21.370 --> 00:05:28.260
AUDIENCE: Maybe we had data
on low key and other data

00:05:28.260 --> 00:05:32.023
about how much time we had taken

00:05:32.023 --> 00:05:32.690
PROFESSOR: Yeah.

00:05:32.690 --> 00:05:33.398
No, that's great.

00:05:33.398 --> 00:05:38.230
So maybe we have data on
let's say death information,

00:05:38.230 --> 00:05:40.480
or even just age.

00:05:40.480 --> 00:05:44.140
And if we started from a
very, very rough assumption

00:05:44.140 --> 00:05:53.050
that disease stage let's say
grows monotonically with age,

00:05:53.050 --> 00:05:55.060
then--

00:05:55.060 --> 00:05:57.310
and if you had made an
additional assumption

00:05:57.310 --> 00:06:01.215
that the disease stages are--

00:06:01.215 --> 00:06:02.590
that the people
who are coming in

00:06:02.590 --> 00:06:07.120
are uniformly drawn from across
disease stages, with those two

00:06:07.120 --> 00:06:09.460
assumptions alone, then you
could, for example, look

00:06:09.460 --> 00:06:12.970
at the average age of
individuals over here

00:06:12.970 --> 00:06:15.580
and the average age of
individuals over here,

00:06:15.580 --> 00:06:17.980
and you'd say, the one
with the larger average age

00:06:17.980 --> 00:06:21.810
is the late disease one.

00:06:21.810 --> 00:06:26.090
Or you could look at time to
death if you had for each--

00:06:26.090 --> 00:06:27.900
for each data
point you also knew

00:06:27.900 --> 00:06:30.210
how long until that
individual died,

00:06:30.210 --> 00:06:31.920
you could look at
average time to death

00:06:31.920 --> 00:06:33.330
for these individuals
versus those individuals

00:06:33.330 --> 00:06:34.913
and try to tease it
apart in that way.

00:06:37.270 --> 00:06:39.436
That's what you meant.

00:06:39.436 --> 00:06:41.930
OK, so I'm just trying to give
you some intuition for how

00:06:41.930 --> 00:06:43.550
this might be possible.

00:06:43.550 --> 00:06:45.635
What about if your
data looked like this?

00:06:51.700 --> 00:06:53.530
So now you have two biomarkers.

00:06:53.530 --> 00:06:58.040
So we've only gone up
by one dimension only,

00:06:58.040 --> 00:07:00.500
and we want to figure out
where's early, where's late?

00:07:03.118 --> 00:07:05.160
Already starts to become
more challenging, right?

00:07:07.780 --> 00:07:10.702
So the intuition that
I want you to have

00:07:10.702 --> 00:07:12.160
is that we're going
to have to make

00:07:12.160 --> 00:07:16.090
some assumptions about
disease progression,

00:07:16.090 --> 00:07:17.970
such as the ones we
were just discussing,

00:07:17.970 --> 00:07:19.750
and we also have to
get lucky in some way.

00:07:19.750 --> 00:07:22.750
So for example, one
way of getting lucky

00:07:22.750 --> 00:07:25.490
would be to have a
real lot of data.

00:07:25.490 --> 00:07:27.310
So if you had a
ton, ton of data,

00:07:27.310 --> 00:07:30.820
and you made an additional
assumption that your data lives

00:07:30.820 --> 00:07:33.832
in some low dimensional manifold
where on one side of manifold

00:07:33.832 --> 00:07:35.790
is early disease and the
other side of manifold

00:07:35.790 --> 00:07:38.170
is late disease,
then you might be

00:07:38.170 --> 00:07:43.858
able to discover that
manifold from this data,

00:07:43.858 --> 00:07:46.150
and you might conjecture that
the manifold is something

00:07:46.150 --> 00:07:49.120
like that, that trajectory
that I'm outlining there

00:07:49.120 --> 00:07:50.800
with my hand.

00:07:50.800 --> 00:07:52.870
But for you to be able
to do that, of course,

00:07:52.870 --> 00:07:55.240
you need to have
enough data, all right?

00:07:55.240 --> 00:07:58.570
So it's going to be now
a trade-off between just

00:07:58.570 --> 00:08:01.360
having cross-sectional
data, it might be OK so long

00:08:01.360 --> 00:08:03.010
as you have a real
lot of that data

00:08:03.010 --> 00:08:04.630
so you can sort of
fill in the spaces

00:08:04.630 --> 00:08:08.300
and really identify
that manifold.

00:08:08.300 --> 00:08:11.210
A different approach
might be, well maybe you

00:08:11.210 --> 00:08:13.460
don't have just pure
cross-sectional data,

00:08:13.460 --> 00:08:17.000
maybe you have two or
maybe three samples

00:08:17.000 --> 00:08:18.290
from each patient.

00:08:18.290 --> 00:08:19.915
And then you can
color code this.

00:08:19.915 --> 00:08:24.875
So you might say, OK,
green is patient 1--

00:08:24.875 --> 00:08:28.220
or patient A, we'll call it,
and this is the first time

00:08:28.220 --> 00:08:31.040
point from patient A, second
time point from patient A,

00:08:31.040 --> 00:08:34.009
third and fourth time
points from patient A. Red

00:08:34.009 --> 00:08:37.039
is patient B, and you have
two time points for patient B,

00:08:37.039 --> 00:08:38.539
and blue here is
patient C, and you

00:08:38.539 --> 00:08:44.120
have 1, 2, 3 time points
from patient C. OK?

00:08:44.120 --> 00:08:46.580
Now again, it's not
very dense data,

00:08:46.580 --> 00:08:49.337
we can't really draw
curves out, But now

00:08:49.337 --> 00:08:51.170
we can start to get a
sense of the ordering.

00:08:51.170 --> 00:08:55.280
And again, now we can-- even
though we don't-- we're not

00:08:55.280 --> 00:08:57.240
in a dense setting like
we were here, here,

00:08:57.240 --> 00:08:59.657
we'd still nonetheless be able
to figure out that probably

00:08:59.657 --> 00:09:02.558
the manifold looks a little
bit like this, right?

00:09:05.480 --> 00:09:09.250
And so again, I'm just
trying to build intuition

00:09:09.250 --> 00:09:11.260
around when disease
progression modeling

00:09:11.260 --> 00:09:14.030
for cross-sectional
data might be possible,

00:09:14.030 --> 00:09:17.080
but this is a wide open field.

00:09:17.080 --> 00:09:18.880
And so today, I'll
be telling you

00:09:18.880 --> 00:09:20.585
about a few algorithms
that try to build

00:09:20.585 --> 00:09:22.960
on some of these intuitions
for doing disease progression

00:09:22.960 --> 00:09:27.090
modeling, but they will
break down very, very easily.

00:09:27.090 --> 00:09:29.730
They'll break down
when these assumptions

00:09:29.730 --> 00:09:32.550
I gave you don't hold,
they'll break down

00:09:32.550 --> 00:09:34.530
when your data is
high dimensional,

00:09:34.530 --> 00:09:37.073
they'll break down
when your data looks

00:09:37.073 --> 00:09:39.240
like this where you don't
just have a single subtype

00:09:39.240 --> 00:09:41.440
of perhaps a multiple subtypes.

00:09:41.440 --> 00:09:44.530
and so this is a really very
active area of research,

00:09:44.530 --> 00:09:47.430
and it's an area that I think
we can make a lot of progress

00:09:47.430 --> 00:09:51.530
on in the field in the
next several years.

00:09:51.530 --> 00:09:57.060
So I'll begin with one case
study coming from my own work

00:09:57.060 --> 00:09:59.100
where we developed an
algorithm for learning

00:09:59.100 --> 00:10:03.590
from cross-sectional
data, and we valued it

00:10:03.590 --> 00:10:08.460
in the context of chronic
obstructive pulmonary disorder

00:10:08.460 --> 00:10:10.030
or COPD.

00:10:10.030 --> 00:10:13.890
COPD is a condition of
the lungs typically caused

00:10:13.890 --> 00:10:20.260
by air pollution or smoking,
and it has a reasonably good

00:10:20.260 --> 00:10:22.210
staging mechanism.

00:10:22.210 --> 00:10:25.960
One uses what's called a
spirometry device in order

00:10:25.960 --> 00:10:29.080
to measure the lung function
of individual at any one point

00:10:29.080 --> 00:10:29.780
in time.

00:10:29.780 --> 00:10:32.230
So for example, you take
this spirometry device,

00:10:32.230 --> 00:10:38.230
you stick it in your
mouth, and you breathe in,

00:10:38.230 --> 00:10:43.550
and then you exhale,
and one measure

00:10:43.550 --> 00:10:47.770
is how long it takes in
order to exhale all your air,

00:10:47.770 --> 00:10:49.970
and that is going to
be a measure of how

00:10:49.970 --> 00:10:52.310
good your lungs are.

00:10:52.310 --> 00:10:57.500
And so then one can take
that measure of your function

00:10:57.500 --> 00:11:02.990
and one can stage how
severe the person's COPD is,

00:11:02.990 --> 00:11:06.020
and that goes by what's
called the gold criteria.

00:11:06.020 --> 00:11:09.140
So for example, in
stage 1 of the COPD,

00:11:09.140 --> 00:11:13.460
common treatments
involve just vaccinations

00:11:13.460 --> 00:11:16.250
using a short-acting
bronchodilator only when

00:11:16.250 --> 00:11:18.140
needed.

00:11:18.140 --> 00:11:22.360
When the disease stage gets
much more severe, like stage 4,

00:11:22.360 --> 00:11:25.460
than often treatment
is recommended

00:11:25.460 --> 00:11:28.040
to be inhaling
glucocorticosteroids

00:11:28.040 --> 00:11:31.260
if there are repeated
aspirations of the disease,

00:11:31.260 --> 00:11:36.920
long-term oxygen If respiratory
failure occurs, and so on.

00:11:36.920 --> 00:11:39.710
And so this is a disease that's
reasonably well-understood

00:11:39.710 --> 00:11:44.160
because there exists a
good staging mechanism.

00:11:44.160 --> 00:11:46.180
And I would argue
that when we want

00:11:46.180 --> 00:11:48.340
to understand how to
do disease staging

00:11:48.340 --> 00:11:51.550
in a data-driven
fashion, we should first

00:11:51.550 --> 00:11:54.933
start by working with
either synthetic data,

00:11:54.933 --> 00:11:57.100
or we should start with
working with a disease where

00:11:57.100 --> 00:12:01.103
we have some idea of what the
actual true disease staging is.

00:12:01.103 --> 00:12:03.520
And that way, we can look to
see what our algorithms would

00:12:03.520 --> 00:12:05.110
recover in those
scenarios, and does

00:12:05.110 --> 00:12:08.110
it align with what
we would expect

00:12:08.110 --> 00:12:10.030
either from the way
the data was generated

00:12:10.030 --> 00:12:12.810
or from the existing
medical literature.

00:12:12.810 --> 00:12:14.170
And that's why we chose COPD.

00:12:14.170 --> 00:12:17.200
Because it is
well-understood, and there's

00:12:17.200 --> 00:12:20.440
a wealth of literature
on it, and because we

00:12:20.440 --> 00:12:23.650
have data on it which is much
messier than the type of data

00:12:23.650 --> 00:12:26.038
that went into the original
studies, and we could ask,

00:12:26.038 --> 00:12:27.580
could we come to
the same conclusions

00:12:27.580 --> 00:12:28.720
as those original studies?

00:12:35.180 --> 00:12:38.540
So in this work, we're
going to use data

00:12:38.540 --> 00:12:41.860
from the electronic
medical record.

00:12:41.860 --> 00:12:44.500
We're only going to look
at a subset of the EMR,

00:12:44.500 --> 00:12:47.200
in particular,
diagnosis codes that

00:12:47.200 --> 00:12:50.223
are recorded for a patient
at any point in time,

00:12:50.223 --> 00:12:52.390
and we're going to assume
that we do not have access

00:12:52.390 --> 00:12:53.800
to spirometry data at all.

00:12:53.800 --> 00:12:57.470
So we don't have any obvious
way of staging the patient's

00:12:57.470 --> 00:12:57.970
disease.

00:13:00.412 --> 00:13:01.870
The general approach
is going to be

00:13:01.870 --> 00:13:09.840
to build a generative model
for disease progression.

00:13:09.840 --> 00:13:13.290
At a very high level,
this is a Markov model.

00:13:13.290 --> 00:13:19.380
It's a model that specifies the
distribution of the patient's

00:13:19.380 --> 00:13:21.600
data, which is shown
here in the bottom,

00:13:21.600 --> 00:13:25.750
as it evolves over time.

00:13:25.750 --> 00:13:28.180
According to a number
of hidden variables

00:13:28.180 --> 00:13:30.550
that are shown in the top,
these S variables that

00:13:30.550 --> 00:13:33.640
denote disease stages, and
these X variables that denote

00:13:33.640 --> 00:13:35.140
comorbidities that
the patient might

00:13:35.140 --> 00:13:39.780
have at that point in time,
these X and S variables

00:13:39.780 --> 00:13:41.550
are always assumed
to the unobserved.

00:13:41.550 --> 00:13:44.040
So if you were to clump them
together into one variable,

00:13:44.040 --> 00:13:47.860
this would look exactly
like a hidden Markov model.

00:13:47.860 --> 00:13:50.800
And moreover, we're
not going to assume

00:13:50.800 --> 00:13:53.320
that we have a lot
of longitudinal data

00:13:53.320 --> 00:13:54.520
for a patient.

00:13:54.520 --> 00:13:58.720
In particular, COPD evolves over
a 10 to 20 years, and the data

00:13:58.720 --> 00:14:02.140
that we'll be learning from
here has data only over one

00:14:02.140 --> 00:14:04.360
to three-year time range.

00:14:04.360 --> 00:14:08.200
The challenge will be, can
we take data in this one

00:14:08.200 --> 00:14:11.975
to three-year time range and
somehow stitch it together

00:14:11.975 --> 00:14:13.600
across large numbers
of patients to get

00:14:13.600 --> 00:14:15.933
a picture of what the 20-year
progression of the disease

00:14:15.933 --> 00:14:17.332
might look like?

00:14:17.332 --> 00:14:18.790
The way that we're
going to do that

00:14:18.790 --> 00:14:22.372
is by learning the parameters
of this probabilistic model.

00:14:22.372 --> 00:14:23.830
And then from the
parameters, we're

00:14:23.830 --> 00:14:27.310
going to either infer the
patient's actual disease stage

00:14:27.310 --> 00:14:30.760
and thus sort them, or
actually simulate data

00:14:30.760 --> 00:14:33.370
from this model to see what
a 20-year trajectory might

00:14:33.370 --> 00:14:35.180
look like.

00:14:35.180 --> 00:14:35.930
Is the goal clear?

00:14:38.660 --> 00:14:39.160
All right.

00:14:39.160 --> 00:14:40.577
So now what I'm
going to do is I'm

00:14:40.577 --> 00:14:43.090
going to step into
this model piece

00:14:43.090 --> 00:14:45.460
by piece to tell you what
each of these components

00:14:45.460 --> 00:14:49.390
are, and I'll start out with a
very topmost piece shown here

00:14:49.390 --> 00:14:51.790
by the red box.

00:14:51.790 --> 00:14:56.280
So this is the model of the
patient's disease progression

00:14:56.280 --> 00:14:58.060
at any one point in time.

00:14:58.060 --> 00:15:00.180
So this variable,
S1, for example,

00:15:00.180 --> 00:15:06.810
might denote the patient's
disease stage on March 2011;

00:15:06.810 --> 00:15:10.470
S2 might denote the patient's
disease stage April 2011;

00:15:10.470 --> 00:15:12.870
S capital T might denote
the patient's disease stage

00:15:12.870 --> 00:15:14.850
June 2012.

00:15:14.850 --> 00:15:18.420
So we're going to have
one random variable

00:15:18.420 --> 00:15:22.620
for each observation of the
patient's data that we have.

00:15:22.620 --> 00:15:24.870
And notice that the observations
of the patient's data

00:15:24.870 --> 00:15:27.610
might be at very
irregular time intervals,

00:15:27.610 --> 00:15:30.240
and that's going to be OK
with this approach, OK?

00:15:30.240 --> 00:15:33.750
So notice that there is a
one-month gap between S1

00:15:33.750 --> 00:15:42.220
and S2, but a four-month gap
between St minus 1 and St, OK?

00:15:42.220 --> 00:15:44.350
So we're going to model
the patient's disease stage

00:15:44.350 --> 00:15:45.725
at the point in
time when we have

00:15:45.725 --> 00:15:49.510
an observation for the patient.

00:15:49.510 --> 00:15:54.820
S denotes a discrete
disease stage in this model.

00:15:54.820 --> 00:16:00.910
So S might be a value from
1 up to 4, maybe 1 up to 10

00:16:00.910 --> 00:16:07.150
where 1 is denoting a early
disease stage and 4 or 10

00:16:07.150 --> 00:16:08.920
might denote a much
later disease stage.

00:16:11.430 --> 00:16:13.980
If we have a sequence of
observations per patient--

00:16:13.980 --> 00:16:16.650
for example, we might have
an observation on March

00:16:16.650 --> 00:16:19.620
and then in April, we're
going to denote the disease

00:16:19.620 --> 00:16:23.010
stage by S1 and S2, what this
model is going to talk about

00:16:23.010 --> 00:16:26.340
is the probability distribution
of transitioning from whatever

00:16:26.340 --> 00:16:28.560
the disease stage
at S1 is to whatever

00:16:28.560 --> 00:16:31.200
the disease stage at S2.

00:16:31.200 --> 00:16:34.140
Now because the time
interval is between stages

00:16:34.140 --> 00:16:39.260
are not homogeneous, we have to
have a transition distribution

00:16:39.260 --> 00:16:43.290
that takes into
consideration that time gap.

00:16:43.290 --> 00:16:46.180
And to do that, we use what's
known as a continuous time

00:16:46.180 --> 00:16:48.700
Markov process.

00:16:48.700 --> 00:16:54.790
Formally, we say that the
transition distribution--

00:16:54.790 --> 00:16:59.560
so the probability of
transitioning from stage I

00:16:59.560 --> 00:17:06.710
at time t minus 1 to
state j at time t,

00:17:06.710 --> 00:17:11.960
given as input the difference
in time intervals--

00:17:11.960 --> 00:17:14.060
the difference in
time points delta--

00:17:14.060 --> 00:17:18.390
so delta is the number of months
between the two observations.

00:17:18.390 --> 00:17:20.569
So this conditional
distribution is

00:17:20.569 --> 00:17:26.599
going to be given by the
matrix exponential of this time

00:17:26.599 --> 00:17:35.560
interval times a matrix Q.

00:17:35.560 --> 00:17:38.890
And then here, the matrix
Q gives us the parameters

00:17:38.890 --> 00:17:40.360
that we want to learn.

00:17:40.360 --> 00:17:42.070
So let me contrast
this to things

00:17:42.070 --> 00:17:45.700
that you might
already be used to.

00:17:45.700 --> 00:17:51.720
In a typical hidden
Markov model,

00:17:51.720 --> 00:17:57.125
you might have
asked t goes to St--

00:17:57.125 --> 00:18:00.730
or St minus 1 goes to St,
and you might imagine just

00:18:00.730 --> 00:18:07.570
parametrizing St given St
minus 1 just by a lookup table.

00:18:07.570 --> 00:18:11.600
So for example, if
the number of states--

00:18:11.600 --> 00:18:17.130
for each running variable is
3, then you would have a 3

00:18:17.130 --> 00:18:23.076
by 3 table where for
each state St minus 1,

00:18:23.076 --> 00:18:25.110
you have some
probability of transition

00:18:25.110 --> 00:18:30.730
to the corresponding state
St, so this might be something

00:18:30.730 --> 00:18:40.510
like 0.9, 0.9,
0.9, where notice,

00:18:40.510 --> 00:18:43.060
I'm having a very large
value along the diagonal,

00:18:43.060 --> 00:18:46.260
because if, let's say,
a very small period--

00:18:46.260 --> 00:18:48.940
so a priori, we might
believe that patients

00:18:48.940 --> 00:18:51.070
stay in the same
disease, and then

00:18:51.070 --> 00:18:53.230
one might imagine
that the probably

00:18:53.230 --> 00:18:59.080
transitioning from state
1 at time t minus 1

00:18:59.080 --> 00:19:08.075
to state 2 at time t might
be something like 0.09,

00:19:08.075 --> 00:19:10.900
and the probability of skipping
state 2, going directly

00:19:10.900 --> 00:19:12.880
to state 3 from state
1 might be something

00:19:12.880 --> 00:19:15.880
much smaller like 0.01, OK?

00:19:15.880 --> 00:19:18.130
And we might say something
like that the probability--

00:19:18.130 --> 00:19:19.672
we might imagine
that the probability

00:19:19.672 --> 00:19:21.310
of going in a
backwards direction,

00:19:21.310 --> 00:19:25.660
going from stage 2
at time t minus 1

00:19:25.660 --> 00:19:30.160
to let's say stage 1 at time
t, that might be 0 all right?

00:19:30.160 --> 00:19:35.810
So you might imagine that
actually this is the model,

00:19:35.810 --> 00:19:39.260
and what that's saying is
something like you never

00:19:39.260 --> 00:19:42.560
go in the backwards
direction, and you're

00:19:42.560 --> 00:19:44.690
more likely to
transition to the state

00:19:44.690 --> 00:19:47.270
immediately adjacent
to the current stage

00:19:47.270 --> 00:19:49.783
and very unlikely
to skip a stage.

00:19:49.783 --> 00:19:51.200
So this would be
an example of how

00:19:51.200 --> 00:19:54.230
you would parametrize the
transition distribution

00:19:54.230 --> 00:19:58.450
in a typical discrete
time Markov model,

00:19:58.450 --> 00:20:00.970
and the story here is going
to be different specifically

00:20:00.970 --> 00:20:03.530
because we don't know
the time intervals.

00:20:03.530 --> 00:20:07.840
So intuitively, if a lot of
time has passed between the two

00:20:07.840 --> 00:20:11.120
observations, then we want
to allow for an accelerated

00:20:11.120 --> 00:20:11.620
process.

00:20:11.620 --> 00:20:13.203
We want to allow for
the fact that you

00:20:13.203 --> 00:20:15.280
might want to skip many
different stages to go

00:20:15.280 --> 00:20:17.738
to your next time step, to go
to the stage of the next time

00:20:17.738 --> 00:20:19.800
step, because so
much time has passed.

00:20:19.800 --> 00:20:22.960
And that intuitively is what
this scaling of this matrix

00:20:22.960 --> 00:20:25.270
Q by delta corresponds to.

00:20:25.270 --> 00:20:27.730
So the number of parameters
in this parameterization

00:20:27.730 --> 00:20:29.950
is actually identical to
the number of parameters

00:20:29.950 --> 00:20:31.300
in this parametrization, right?

00:20:31.300 --> 00:20:36.790
So you have a matrix Q which
is given to you in essence

00:20:36.790 --> 00:20:39.160
by the number of
states squared--

00:20:39.160 --> 00:20:40.667
really, the number of states--

00:20:40.667 --> 00:20:42.250
there's an additional
redundancy there

00:20:42.250 --> 00:20:45.780
because it has to sum up to
1, but that's irrelevant.

00:20:45.780 --> 00:20:47.980
And so the same
story here, but we're

00:20:47.980 --> 00:20:50.500
going to now
parametrize the process

00:20:50.500 --> 00:20:56.552
by in some sense the
infinitesimally small time

00:20:56.552 --> 00:20:57.760
probability of transitioning.

00:20:57.760 --> 00:21:00.250
So if you were to take the
derivative of this transition

00:21:00.250 --> 00:21:03.430
distribution as the
time interval shrinks,

00:21:03.430 --> 00:21:10.153
and then you were to integrate
over the time interval that

00:21:10.153 --> 00:21:12.820
was observed and the probability
of transitioning from any state

00:21:12.820 --> 00:21:16.420
to any other state with
that infinitesimally small

00:21:16.420 --> 00:21:18.340
probability transitioning,
what you get out

00:21:18.340 --> 00:21:20.790
is exactly this form.

00:21:20.790 --> 00:21:25.240
And I'll leave-- this paper
is in the optional readings

00:21:25.240 --> 00:21:27.240
for today's lecture, and
you can read through it

00:21:27.240 --> 00:21:32.080
to get more intuition about the
continuous time Markov process.

00:21:32.080 --> 00:21:33.804
Any questions so far?

00:21:33.804 --> 00:21:34.760
Yep?

00:21:34.760 --> 00:21:37.260
AUDIENCE: Those Q are the
same for both versions or--

00:21:37.260 --> 00:21:37.950
PROFESSOR: Yes.

00:21:37.950 --> 00:21:41.970
And this model Q is essentially
the same for all patients.

00:21:41.970 --> 00:21:45.120
And you might imagine, if there
were disease subtypes, which

00:21:45.120 --> 00:21:47.190
there aren't in this
approach, that Q might

00:21:47.190 --> 00:21:48.450
be different for each subtype.

00:21:48.450 --> 00:21:51.000
For example, you might
transition between stages

00:21:51.000 --> 00:21:54.820
much more quickly for some
subtypes than for others.

00:21:54.820 --> 00:21:55.968
Other questions?

00:21:59.880 --> 00:22:00.730
Yep?

00:22:00.730 --> 00:22:03.625
AUDIENCE: So-- OK, so
Q you said had like--

00:22:03.625 --> 00:22:06.058
it's just like a screen
number used beforehand you

00:22:06.058 --> 00:22:08.100
kind of like specified
these stages that you pick

00:22:08.100 --> 00:22:08.600
[INAUDIBLE]

00:22:08.600 --> 00:22:09.755
PROFESSOR: Correct.

00:22:09.755 --> 00:22:10.255
Yes.

00:22:10.255 --> 00:22:15.220
So you pre-specify the number of
stages that you want to model,

00:22:15.220 --> 00:22:19.925
and there are many ways to
try to choose that parameter.

00:22:19.925 --> 00:22:22.050
For example, you could look
at how about likelihood

00:22:22.050 --> 00:22:23.730
under this model,
which is learned

00:22:23.730 --> 00:22:26.100
for the different of stages.

00:22:26.100 --> 00:22:28.830
You could use typical
model selection techniques

00:22:28.830 --> 00:22:31.410
from machine learning
as another approach

00:22:31.410 --> 00:22:35.010
where you try to penalize
complexity in some way.

00:22:35.010 --> 00:22:37.582
Or, what we found here, because
of some of the other things

00:22:37.582 --> 00:22:39.540
that I'm about to tell
you, it doesn't actually

00:22:39.540 --> 00:22:41.220
matter that much.

00:22:41.220 --> 00:22:44.520
So similarly to when one does
hard [INAUDIBLE] clustering

00:22:44.520 --> 00:22:47.580
or even K-means clustering
or even learning

00:22:47.580 --> 00:22:49.680
a problematic
topic model, if you

00:22:49.680 --> 00:22:52.590
use a very small number of
topics or number of clusters,

00:22:52.590 --> 00:22:54.960
you tend to learn very
coarse-grained topics

00:22:54.960 --> 00:22:55.920
or clusters.

00:22:55.920 --> 00:22:57.530
If you use very many more--

00:22:57.530 --> 00:22:59.280
if you use a much
larger number of topics,

00:22:59.280 --> 00:23:01.830
you tend to learn much
more fine-grained topics.

00:23:01.830 --> 00:23:03.720
Same story is going
to happen here.

00:23:03.720 --> 00:23:05.638
If you use a small
number of disease stages,

00:23:05.638 --> 00:23:07.680
you're going to learn very
coarse-grained notions

00:23:07.680 --> 00:23:09.990
of disease stages; if you
use more disease stages,

00:23:09.990 --> 00:23:12.085
you're going to learn
a fine-grained notion;

00:23:12.085 --> 00:23:13.710
but the overall
sorting of the patients

00:23:13.710 --> 00:23:16.680
is going to end up
being very similar.

00:23:16.680 --> 00:23:18.180
But to make that
statement, we're

00:23:18.180 --> 00:23:20.513
going to need to make some
additional assumptions, which

00:23:20.513 --> 00:23:23.297
I'm going to show
you in a few minutes.

00:23:23.297 --> 00:23:24.130
Any other questions?

00:23:24.130 --> 00:23:27.200
These are great questions.

00:23:27.200 --> 00:23:28.100
Yep?

00:23:28.100 --> 00:23:30.800
AUDIENCE: So do we know
the staging of the disease

00:23:30.800 --> 00:23:31.440
because I

00:23:31.440 --> 00:23:33.320
PROFESSOR: No, and
that's critical here.

00:23:33.320 --> 00:23:37.060
So I'm assuming that these
variables-- these S's are all

00:23:37.060 --> 00:23:39.290
hidden variables here.

00:23:39.290 --> 00:23:41.980
And the way that we're
going to learn this model

00:23:41.980 --> 00:23:43.730
is by maximum
likelihood estimation

00:23:43.730 --> 00:23:46.090
where we marginalize over
the hidden variables,

00:23:46.090 --> 00:23:50.870
just like you would do
in any EM type algorithm.

00:23:50.870 --> 00:23:52.064
Any other questions?

00:23:55.382 --> 00:23:56.810
All right, so what
I've just shown

00:23:56.810 --> 00:24:00.180
you is the topmost
part of the model,

00:24:00.180 --> 00:24:02.530
now I'm going to talk
about a horizontal slice.

00:24:02.530 --> 00:24:06.110
So I'm going to talk about
one of these time points.

00:24:09.988 --> 00:24:15.840
So if you were to look
at the translation--

00:24:15.840 --> 00:24:18.720
the rotation of one of
those time points, what you

00:24:18.720 --> 00:24:20.730
would get out is this model.

00:24:20.730 --> 00:24:24.360
These X's are also
hidden variables,

00:24:24.360 --> 00:24:28.133
and we have pre-specified them
to characterize different axes

00:24:28.133 --> 00:24:30.300
by which we want to understand
the patient's disease

00:24:30.300 --> 00:24:32.655
progression.

00:24:32.655 --> 00:24:34.405
So in Thursday's
lecture, we characterized

00:24:34.405 --> 00:24:43.070
the patient's disease as
subtype by just a single number,

00:24:43.070 --> 00:24:46.325
and similarly in this example
is just by a single number,

00:24:46.325 --> 00:24:48.200
but we might want to
understand what's really

00:24:48.200 --> 00:24:50.150
unique about each subtype.

00:24:50.150 --> 00:24:53.730
So for example--
sorry, what's really

00:24:53.730 --> 00:24:56.080
unique about each disease stage.

00:24:56.080 --> 00:24:59.760
So for example, how is the
patient's endocrine function

00:24:59.760 --> 00:25:01.440
in that disease stage?

00:25:01.440 --> 00:25:09.645
How is the patient's psychiatric
status in that disease stage?

00:25:12.350 --> 00:25:15.170
Has the patient developed lung
cancer yet and that disease

00:25:15.170 --> 00:25:16.790
stage?

00:25:16.790 --> 00:25:18.270
And so on.

00:25:18.270 --> 00:25:20.720
And so we're going
to ask that we

00:25:20.720 --> 00:25:22.890
want to be able to read
out from this model

00:25:22.890 --> 00:25:24.940
according to these
axes, and this

00:25:24.940 --> 00:25:27.470
will become very clear at
the end of this section

00:25:27.470 --> 00:25:29.150
where I show you a
simulation of what

00:25:29.150 --> 00:25:33.500
20 years looks like for COPD
according to these quantities.

00:25:33.500 --> 00:25:35.930
When does the patient
typically develop diabetes,

00:25:35.930 --> 00:25:38.450
when does the patient
typically become depressed,

00:25:38.450 --> 00:25:41.730
when does the patient typically
develop cancer, and so on.

00:25:41.730 --> 00:25:43.310
So these are the
quantities in which

00:25:43.310 --> 00:25:45.290
we want to be able
to really talk

00:25:45.290 --> 00:25:48.650
about what happens to a patient
at any one disease stage,

00:25:48.650 --> 00:25:50.780
but the challenge is, we
never actually observe

00:25:50.780 --> 00:25:53.150
these quantities in
the data that we have.

00:25:53.150 --> 00:25:56.450
Rather, all we observe are
things like laboratory test

00:25:56.450 --> 00:25:58.820
results or diagnosis
codes or procedures

00:25:58.820 --> 00:26:00.770
that have been formed
and so on, which

00:26:00.770 --> 00:26:03.190
I'm going to call the clinical
findings in the bottom.

00:26:03.190 --> 00:26:05.930
And as we've been discussing
throughout this course,

00:26:05.930 --> 00:26:08.300
one could think about
things as diagnosis codes

00:26:08.300 --> 00:26:10.760
as giving you information
about the disease

00:26:10.760 --> 00:26:12.560
status of the patient,
but they're not

00:26:12.560 --> 00:26:14.660
one and the same
as the diagnosis,

00:26:14.660 --> 00:26:16.760
because there's so much
noise and bias that

00:26:16.760 --> 00:26:20.015
goes into the assigning of
diagnosis codes for patients.

00:26:22.600 --> 00:26:26.820
And so the way that we're
going to model the raw data

00:26:26.820 --> 00:26:28.470
as a function of
these hidden variables

00:26:28.470 --> 00:26:31.050
that we want to
characterize is using what's

00:26:31.050 --> 00:26:33.165
known as a noisy-OR network.

00:26:35.710 --> 00:26:37.620
So we're going to
suppose that there

00:26:37.620 --> 00:26:41.880
is some generative distribution
where the observations you

00:26:41.880 --> 00:26:44.730
see-- for example,
diagnosis codes

00:26:44.730 --> 00:26:48.510
are likely to be observed as a
function of whether the patient

00:26:48.510 --> 00:26:51.450
has these phenotypes
or comorbidities

00:26:51.450 --> 00:26:53.970
with some probability, and that
probability can be specified

00:26:53.970 --> 00:26:56.550
by these edge weights.

00:26:56.550 --> 00:27:00.930
So for example, a
diagnosis code for diabetes

00:27:00.930 --> 00:27:03.540
is very likely to be
observed in the patient data

00:27:03.540 --> 00:27:06.270
if the patient
truly has diabetes,

00:27:06.270 --> 00:27:07.860
but of course, it
may not be recorded

00:27:07.860 --> 00:27:10.170
in the data for every
single visit the patient has

00:27:10.170 --> 00:27:12.060
to a clinician, there
might be some visits

00:27:12.060 --> 00:27:14.790
to clinicians that have nothing
to do with their patients

00:27:14.790 --> 00:27:17.730
endocrine function and
diabetes-- the diagnosis

00:27:17.730 --> 00:27:19.920
code might not be
recorded for that visit.

00:27:19.920 --> 00:27:21.600
So it's going to
be a noisy process,

00:27:21.600 --> 00:27:24.240
and that noise rate is going
to be captured by that edge.

00:27:31.100 --> 00:27:33.080
So part of the
learning algorithm

00:27:33.080 --> 00:27:35.940
is going to be to learn that
transition distributions--

00:27:35.940 --> 00:27:38.570
for example, that Q matrix
I showed you in the earlier

00:27:38.570 --> 00:27:41.570
slide, but the other
role-- learning algorithm

00:27:41.570 --> 00:27:43.700
is to learn all
of the parameters

00:27:43.700 --> 00:27:46.490
of this noisy-OR distribution,
namely these edge weights.

00:27:46.490 --> 00:27:49.820
So that's going to be discovered
as part of the learning

00:27:49.820 --> 00:27:51.800
algorithm.

00:27:51.800 --> 00:27:53.930
And a key question
that you have to ask

00:27:53.930 --> 00:27:58.040
me is, if I know I want
to read out from the model

00:27:58.040 --> 00:28:01.390
according to these axes,
but these axes are never--

00:28:01.390 --> 00:28:03.140
I'm never assuming
that they're explicitly

00:28:03.140 --> 00:28:06.380
observed in the data, how
do I ground the learning

00:28:06.380 --> 00:28:09.050
algorithm to give meaning
to these hidden variables?

00:28:09.050 --> 00:28:12.350
Because otherwise if we left
them otherwise unconstrained

00:28:12.350 --> 00:28:14.960
and you did maximum
likelihood estimation just

00:28:14.960 --> 00:28:17.360
like in any factor
analysis-type model,

00:28:17.360 --> 00:28:18.920
you might discover
some factors here,

00:28:18.920 --> 00:28:21.040
but they might not be the
factors you care about,

00:28:21.040 --> 00:28:24.260
and if the learning problem
was not identifiable,

00:28:24.260 --> 00:28:26.300
as is often the case in
unsupervised learning,

00:28:26.300 --> 00:28:29.660
then you might not discover
what you're interested in.

00:28:29.660 --> 00:28:32.690
So to ground the
hidden variables,

00:28:32.690 --> 00:28:36.740
we introduced-- we used a
technique that you already

00:28:36.740 --> 00:28:41.480
saw in an earlier lecture
from lecture 8 called anchors.

00:28:41.480 --> 00:28:43.700
So a domain expert
is going to specify

00:28:43.700 --> 00:28:48.980
for each one of the comorbidites
one or more anchors, which

00:28:48.980 --> 00:28:52.010
are observations, which
we are going to conjecture

00:28:52.010 --> 00:28:57.590
could only have arisen from the
corresponding hidden variable.

00:28:57.590 --> 00:28:59.180
So notice here
that this diagnosis

00:28:59.180 --> 00:29:01.430
code, which is for
type 2 diabetes,

00:29:01.430 --> 00:29:05.270
has only an edge from X1.

00:29:05.270 --> 00:29:07.097
That is an assumption
that we're making

00:29:07.097 --> 00:29:08.180
in the learning algorithm.

00:29:08.180 --> 00:29:10.160
We are actually
explicitly zeroing

00:29:10.160 --> 00:29:14.210
out all of the other edges from
all of the other comorbidities

00:29:14.210 --> 00:29:16.230
to a 1.

00:29:16.230 --> 00:29:18.680
We're not going to pre-specify
what this edge rate is,

00:29:18.680 --> 00:29:21.140
we're going to allow for the
fact that this might be noisy,

00:29:21.140 --> 00:29:23.907
it's not always observed even
if the patient has diabetes,

00:29:23.907 --> 00:29:25.490
but we're going to
say, this could not

00:29:25.490 --> 00:29:29.830
be explained by any of
the other comorbidities.

00:29:29.830 --> 00:29:32.830
And so for each one of the
comorbidites or phenotypes

00:29:32.830 --> 00:29:34.510
that we want to
model, we're going

00:29:34.510 --> 00:29:37.270
to specify some small
number of anchors

00:29:37.270 --> 00:29:40.060
which correspond to a type
of sparsity assumption

00:29:40.060 --> 00:29:42.470
on that graph.

00:29:42.470 --> 00:29:46.210
And these are the anchors
that we chose for asthma, we

00:29:46.210 --> 00:29:48.670
chose a diagnosis code
corresponding to asthma;

00:29:48.670 --> 00:29:50.890
for lung cancer, we
chose several diagnosis

00:29:50.890 --> 00:29:53.320
codes correspond to lung
cancer; for obesity, we

00:29:53.320 --> 00:29:54.820
chose a diagnosis
code corresponding

00:29:54.820 --> 00:29:57.340
to morbid obesity; and so on.

00:29:57.340 --> 00:29:58.960
And so these are
ways that we're going

00:29:58.960 --> 00:30:01.180
to give meaning to
the hidden variables,

00:30:01.180 --> 00:30:03.580
but as you'll see in
just a few minutes,

00:30:03.580 --> 00:30:05.980
it is not going to pre-specify
too much of the model.

00:30:05.980 --> 00:30:07.897
The model's still going
to learn a whole bunch

00:30:07.897 --> 00:30:10.510
of other interesting things.

00:30:10.510 --> 00:30:14.340
By the way, the way that we
actually came up with this set

00:30:14.340 --> 00:30:17.290
was by an iterative process.

00:30:17.290 --> 00:30:21.840
We specified some of the hidden
variables to have anchors,

00:30:21.840 --> 00:30:24.780
but we also left some of
them to be unanchored,

00:30:24.780 --> 00:30:27.090
meaning free variables.

00:30:27.090 --> 00:30:29.282
We did our learning
algorithm, and just

00:30:29.282 --> 00:30:30.740
like you would do
in a topic model,

00:30:30.740 --> 00:30:33.257
we discovered that there
were some phenotypes that

00:30:33.257 --> 00:30:35.340
really seemed to be
characterized by the patient's

00:30:35.340 --> 00:30:36.060
disease--

00:30:36.060 --> 00:30:38.940
that seemed to characterize a
patient's disease progression.

00:30:38.940 --> 00:30:43.380
Then in order to really dig
deeper, working collaboratively

00:30:43.380 --> 00:30:46.150
with a domain expert, we
specified anchors for those

00:30:46.150 --> 00:30:47.580
and we iterated,
and in this way,

00:30:47.580 --> 00:30:50.310
we discovered the full set
of interesting variables

00:30:50.310 --> 00:30:51.380
that we wanted to model.

00:30:51.380 --> 00:30:51.880
Yep?

00:30:51.880 --> 00:30:55.240
AUDIENCE: Did you measure how
good an anchor these were?

00:30:55.240 --> 00:30:58.042
Like are some comorbidities
better anchors than others?

00:30:58.042 --> 00:30:58.750
PROFESSOR: Great.

00:30:58.750 --> 00:31:01.042
You'll see-- I think we'll
answer that question in just

00:31:01.042 --> 00:31:03.360
a few minutes when I show
you what the graph looks

00:31:03.360 --> 00:31:04.752
like that's learned.

00:31:04.752 --> 00:31:05.400
Yep?

00:31:05.400 --> 00:31:06.900
AUDIENCE: Were all
the other weights

00:31:06.900 --> 00:31:09.490
in that X to O network 0?

00:31:09.490 --> 00:31:12.103
They weren't part of it here.

00:31:12.103 --> 00:31:13.520
So it looks like
a pretty sparse--

00:31:13.520 --> 00:31:14.590
PROFESSOR: They're
explicitly nonzero, actually,

00:31:14.590 --> 00:31:15.920
it's opposite.

00:31:15.920 --> 00:31:22.350
So for an anchor, we say that
it only has a single parent.

00:31:22.350 --> 00:31:24.180
Everything that's
not an anchor can

00:31:24.180 --> 00:31:27.060
have arbitrarily many parents.

00:31:27.060 --> 00:31:28.166
Is that clear?

00:31:28.166 --> 00:31:29.080
OK.

00:31:29.080 --> 00:31:30.120
Yeah?

00:31:30.120 --> 00:31:32.900
AUDIENCE: Do the anchors that
you have in that linear table,

00:31:32.900 --> 00:31:35.360
you itereated yourself on
that or did the doctors say

00:31:35.360 --> 00:31:37.757
that these are the [INAUDIBLE]?

00:31:37.757 --> 00:31:39.590
PROFESSOR: We started
out with just a subset

00:31:39.590 --> 00:31:40.676
of these conditions.

00:31:40.676 --> 00:31:43.232
As things that we
wanted to model--

00:31:43.232 --> 00:31:44.690
things that we
wanted to understand

00:31:44.690 --> 00:31:46.327
what happens along
disease progression

00:31:46.327 --> 00:31:48.410
according to these axes,
but just a subset of them

00:31:48.410 --> 00:31:49.670
originally.

00:31:49.670 --> 00:31:53.240
And then we included a few
additional hidden variables

00:31:53.240 --> 00:31:55.460
that didn't have any
anchors associated to them,

00:31:55.460 --> 00:31:57.510
and after doing unsupervised
learning and just

00:31:57.510 --> 00:32:01.197
a preliminary development stage,
they discovered some topics

00:32:01.197 --> 00:32:03.530
and we realized, oh shoot,
we should have included those

00:32:03.530 --> 00:32:06.920
in there, and then we added them
in with corresponding anchors.

00:32:09.407 --> 00:32:11.740
And so you could think about
this as an exploratory data

00:32:11.740 --> 00:32:14.200
analysis pipeline.

00:32:14.200 --> 00:32:14.700
Yep?

00:32:14.700 --> 00:32:18.120
AUDIENCE: Is there a chance
that these aren't anchors?

00:32:18.120 --> 00:32:20.375
PROFESSOR: Yes.

00:32:20.375 --> 00:32:22.500
So there's definitely the
chance that these may not

00:32:22.500 --> 00:32:23.917
be anchors related
to the question

00:32:23.917 --> 00:32:25.150
was asked a second ago.

00:32:25.150 --> 00:32:31.040
So for example, there
might be some chance

00:32:31.040 --> 00:32:33.800
that the morbid
obesity diagnosis

00:32:33.800 --> 00:32:37.610
code might be
coded for a patient

00:32:37.610 --> 00:32:45.258
more often for a patient
who has, let's say, asthma--

00:32:45.258 --> 00:32:46.175
this is a bad example.

00:32:50.070 --> 00:32:53.270
And in which case, that would
correspond to there truly

00:32:53.270 --> 00:32:56.450
existing an edge from
asthma to this anchor, which

00:32:56.450 --> 00:32:58.980
would be a violation
of anchor assumption.

00:32:58.980 --> 00:33:03.680
All right, so we chose
these to make that unlikely,

00:33:03.680 --> 00:33:05.880
but it could happen.

00:33:05.880 --> 00:33:07.890
And it's not easily testable.

00:33:07.890 --> 00:33:10.280
So this is another example
of an untestable assumption,

00:33:10.280 --> 00:33:12.230
just like we saw lots
of other examples

00:33:12.230 --> 00:33:15.735
already in today's lecture and
the causal inference lectures.

00:33:15.735 --> 00:33:17.360
If we had some ground
truth data-- like

00:33:17.360 --> 00:33:19.652
if we had done chart review
for some number of patients

00:33:19.652 --> 00:33:21.540
and we actually label
these conditions,

00:33:21.540 --> 00:33:23.570
then we could test
that anchor assumption.

00:33:23.570 --> 00:33:26.195
But here, we're assuming that we
don't actually know the ground

00:33:26.195 --> 00:33:27.686
truth of these conditions.

00:33:27.686 --> 00:33:29.680
AUDIENCE: Is there a
reason why you choose

00:33:29.680 --> 00:33:31.390
such high-level comorbidites?

00:33:31.390 --> 00:33:33.528
Like I imagine you
could go more specific.

00:33:33.528 --> 00:33:34.945
Even, say, the
diabetes, you could

00:33:34.945 --> 00:33:38.000
try to subtype the diabetes
based on this other model, sort

00:33:38.000 --> 00:33:41.442
of use that as a single
layer, but it seems to--

00:33:41.442 --> 00:33:43.900
at least this model seems to
choose [INAUDIBLE] high level.

00:33:43.900 --> 00:33:45.000
I was just curious
of the reason.

00:33:45.000 --> 00:33:45.625
PROFESSOR: Yes.

00:33:45.625 --> 00:33:47.890
So that was a design
choice that we made.

00:33:47.890 --> 00:33:50.330
There are many, many
directions for follow-up work,

00:33:50.330 --> 00:33:53.720
one of which would be to use
a hierarchical model here.

00:33:53.720 --> 00:33:55.290
But we hadn't gone
that direction.

00:33:55.290 --> 00:33:57.230
Another obvious direction
for follow-up work

00:33:57.230 --> 00:34:01.070
would be to do something within
the subtyping with this staging

00:34:01.070 --> 00:34:03.050
by introducing another
random variable, which

00:34:03.050 --> 00:34:04.672
is, let's say, the
disease subtype,

00:34:04.672 --> 00:34:06.380
and making everything
a function of that.

00:34:11.080 --> 00:34:14.273
OK, so I've talked
about the vertical slice

00:34:14.273 --> 00:34:15.940
and I've talked about
the topmost slice,

00:34:15.940 --> 00:34:17.565
but what I still need
to tell you about

00:34:17.565 --> 00:34:23.874
is how these phenotypes relate
to the observed disease stage.

00:34:27.139 --> 00:34:28.325
So for this, we use--

00:34:33.090 --> 00:34:36.150
I don't remember the exact
technical terminology--

00:34:36.150 --> 00:34:38.639
a factored Markov model?

00:34:38.639 --> 00:34:41.130
Is that right, Pete?

00:34:41.130 --> 00:34:42.540
Factorized Markov
model-- I mean,

00:34:42.540 --> 00:34:44.707
this is a term that existed
in the graphical model's

00:34:44.707 --> 00:34:47.290
literature, but I don't
remember right now.

00:34:47.290 --> 00:34:51.580
So what we're saying is that
each of these Markov chains--

00:34:51.580 --> 00:35:06.880
so each of these X1 up to Xt--

00:35:12.500 --> 00:35:16.770
so this, will say, is the
first one I call diabetes.

00:35:19.730 --> 00:35:23.180
This is the second one which
I'll say is depression.

00:35:30.573 --> 00:35:32.990
We're going to assume that
each one of these Markov chains

00:35:32.990 --> 00:35:35.630
is conditionally independent
of each other given the disease

00:35:35.630 --> 00:35:36.710
stage.

00:35:36.710 --> 00:35:45.590
So it's the disease stage
which ties everything together.

00:35:45.590 --> 00:35:57.160
So the graphical model looks
like this, and so on, OK?

00:35:57.160 --> 00:36:00.130
So in particular, there are
no edges between, let's say,

00:36:00.130 --> 00:36:03.010
the diabetes variable and
the depression variable.

00:36:03.010 --> 00:36:05.470
All correlations
between these conditions

00:36:05.470 --> 00:36:09.760
is assumed to be mediated by
the disease stage variable.

00:36:09.760 --> 00:36:12.970
And that's a critical
assumption that we had to make.

00:36:12.970 --> 00:36:14.110
Does anyone know why?

00:36:17.540 --> 00:36:21.242
What would go wrong if we
didn't make that assumption?

00:36:21.242 --> 00:36:22.700
So for example,
what would go wrong

00:36:22.700 --> 00:36:28.560
if we had something
look like this, x1--

00:36:28.560 --> 00:36:31.580
what was my notation?

00:36:31.580 --> 00:36:41.060
X1,1, X1,2, X1,3, and suppose
we had edges between them,

00:36:41.060 --> 00:36:42.740
a complete graph, and
we had, let's say,

00:36:42.740 --> 00:36:49.130
also the S variable with
edges to everything?

00:36:49.130 --> 00:36:51.808
What would happen
in that case where

00:36:51.808 --> 00:36:54.350
we're not assuming that the X's
are conditionally independent

00:36:54.350 --> 00:36:55.160
given S?

00:37:02.180 --> 00:37:07.200
So I want you to think about
this in terms of distributions.

00:37:07.200 --> 00:37:11.520
So remember, we're
going to learn

00:37:11.520 --> 00:37:13.500
how to do disease
progression through learning

00:37:13.500 --> 00:37:14.760
the parameters of this model.

00:37:14.760 --> 00:37:17.093
And so if we set this up and--
if we set up the learning

00:37:17.093 --> 00:37:19.230
problem in a way which
is unidentifiable,

00:37:19.230 --> 00:37:20.220
then we're going to
be screwed, we're

00:37:20.220 --> 00:37:21.678
not going to able
to learn anything

00:37:21.678 --> 00:37:23.420
about disease progression.

00:37:23.420 --> 00:37:25.570
So what would happen
in this situation?

00:37:30.120 --> 00:37:31.830
Someone who hasn't
spoken today ideally.

00:37:38.760 --> 00:37:40.872
So any of you remember
from, let's say,

00:37:40.872 --> 00:37:42.330
perhaps an earlier
machine learning

00:37:42.330 --> 00:37:45.420
class what types
of distribution's

00:37:45.420 --> 00:37:47.160
a complete graph--

00:37:47.160 --> 00:37:49.658
a complete Bayesian
network could represent?

00:37:56.990 --> 00:38:01.140
So the answer is
all distributions,

00:38:01.140 --> 00:38:03.300
because it corresponds
to any factorization

00:38:03.300 --> 00:38:05.220
of the joint distribution.

00:38:05.220 --> 00:38:08.670
And so if you allowed
these x variables

00:38:08.670 --> 00:38:11.020
to be fully connected to
each other-- so for example,

00:38:11.020 --> 00:38:13.620
saying that depression depends
on diabetes in addition

00:38:13.620 --> 00:38:16.050
to the stage, then in
fact, you don't even

00:38:16.050 --> 00:38:17.580
need this stage
variable in here.

00:38:17.580 --> 00:38:20.340
The marginal-- you can
fit any distribution

00:38:20.340 --> 00:38:23.460
on these X variables even
without the S variable at all.

00:38:23.460 --> 00:38:26.160
And so the model could
learn to simply ignore

00:38:26.160 --> 00:38:30.240
the S variable, which would
be exactly not our goal,

00:38:30.240 --> 00:38:34.050
because our goal is to learn
something about the disease

00:38:34.050 --> 00:38:37.690
stage, and in fact,
we're going to be wanting

00:38:37.690 --> 00:38:41.430
to make assumptions on
the progression of disease

00:38:41.430 --> 00:38:44.990
stage, which is going
to help us learn.

00:38:44.990 --> 00:38:48.650
So by assuming conditional
independence between these X

00:38:48.650 --> 00:38:51.770
variables, it's going to
force all of the correlations

00:38:51.770 --> 00:38:54.230
to have to be mediated
by that S variable,

00:38:54.230 --> 00:38:56.805
and it's going to remove some
of that unidentifiability that

00:38:56.805 --> 00:38:59.840
would otherwise exist.

00:38:59.840 --> 00:39:01.760
It's this subtle but
very important point.

00:39:05.520 --> 00:39:07.580
So the way that we're
going to parametrize

00:39:07.580 --> 00:39:09.570
the conditional distribution--
so first of all,

00:39:09.570 --> 00:39:12.470
I'm going to assume
these X's are all binary.

00:39:12.470 --> 00:39:14.420
So either the
patient has diabetes

00:39:14.420 --> 00:39:16.900
or they don't have diabetes.

00:39:16.900 --> 00:39:18.490
I'm going to suppose that--

00:39:18.490 --> 00:39:21.520
and this is, again, another
assumption we're making,

00:39:21.520 --> 00:39:27.430
I'm going to suppose that once
you already have a comorbidity,

00:39:27.430 --> 00:39:28.430
then you always have it.

00:39:28.430 --> 00:39:31.632
So for example, once this is
1, then all subsequent ones

00:39:31.632 --> 00:39:32.590
are also going to be 1.

00:39:35.190 --> 00:39:37.710
Hold the questions
for just a second.

00:39:37.710 --> 00:39:39.390
I'm also going to
make an assumption

00:39:39.390 --> 00:39:42.420
that later stages of the disease
are more likely to develop

00:39:42.420 --> 00:39:43.680
the comorbidity.

00:39:43.680 --> 00:39:46.860
So in particular, one can
formalize that mathematically

00:39:46.860 --> 00:39:53.775
as probability of X--

00:39:56.360 --> 00:40:04.155
I'll just say Xi
being 1 given S--

00:40:04.155 --> 00:40:16.190
I'll say St equals little
s, comma, Xt minus 1 equals

00:40:16.190 --> 00:40:22.010
0, and suppose
that this is larger

00:40:22.010 --> 00:40:29.420
than or equal to probability
of Xt equals 1 given St equals

00:40:29.420 --> 00:40:45.200
S prime and Xt minus 1 equals
0 for all S prime less than S,

00:40:45.200 --> 00:40:46.190
OK?

00:40:46.190 --> 00:40:48.580
So I'm saying, as you get
further along in the disease

00:40:48.580 --> 00:40:52.450
stage, you're more
likely to observe

00:40:52.450 --> 00:40:54.280
one of these complications.

00:40:56.828 --> 00:40:58.620
And again, this is an
assumption that we're

00:40:58.620 --> 00:41:00.780
putting into the
learning algorithm,

00:41:00.780 --> 00:41:04.525
but what we found that
these types of assumptions

00:41:04.525 --> 00:41:06.900
are really critical in order
to learn disease progression

00:41:06.900 --> 00:41:11.270
models when you don't have
a large amount of data.

00:41:11.270 --> 00:41:16.350
And note that this is
just a linear inequality

00:41:16.350 --> 00:41:19.300
on the parameters of the model.

00:41:19.300 --> 00:41:24.690
And so one can use a convex
optimization algorithm

00:41:24.690 --> 00:41:27.525
during learning--

00:41:27.525 --> 00:41:29.400
during the maximum
likelihood estimation step

00:41:29.400 --> 00:41:32.610
with this algorithm, we
just put a linear inequality

00:41:32.610 --> 00:41:34.500
into the convex
optimization problem

00:41:34.500 --> 00:41:37.170
to enforce this constraint.

00:41:37.170 --> 00:41:38.640
There are a couple of questions.

00:41:38.640 --> 00:41:40.890
AUDIENCE: Is there generally
like a quick way to check

00:41:40.890 --> 00:41:43.965
whether a model is
unidentifiable or--

00:41:48.510 --> 00:41:51.480
PROFESSOR: So there
are ways to try

00:41:51.480 --> 00:41:54.960
to detect to see if a
model is unidentifiable.

00:41:54.960 --> 00:41:56.440
It's beyond the
scope of the class,

00:41:56.440 --> 00:41:58.980
but I'll just briefly mention
one of the techniques.

00:42:04.630 --> 00:42:08.530
So one could-- so you can ask
the identifiability question

00:42:08.530 --> 00:42:10.570
by looking at moments
of the distribution.

00:42:10.570 --> 00:42:12.430
For example, you
could talk about it

00:42:12.430 --> 00:42:14.320
as a function of
all of the observed

00:42:14.320 --> 00:42:17.227
moments of distribution
that you get from the data.

00:42:17.227 --> 00:42:19.810
Now the observed data here are
not the S's and X's, but rather

00:42:19.810 --> 00:42:20.728
the O's.

00:42:20.728 --> 00:42:22.270
So you look at the
joint distribution

00:42:22.270 --> 00:42:26.860
on the O's, and then you
can ask questions about--

00:42:26.860 --> 00:42:27.820
if I was to--

00:42:27.820 --> 00:42:30.280
so suppose I was to choose
a random set of parameters

00:42:30.280 --> 00:42:33.040
in the model, is
there any way to do

00:42:33.040 --> 00:42:35.440
a perturbation of the
parameters in the model which

00:42:35.440 --> 00:42:39.220
leave the observed marginal
distribution on the O's

00:42:39.220 --> 00:42:41.290
identical?

00:42:41.290 --> 00:42:44.830
And often when you're in the
setting of non-identifiability,

00:42:44.830 --> 00:42:50.712
you can take the gradient
of a function and see--

00:42:50.712 --> 00:42:53.170
and you could sort of find that
there is some wiggle space,

00:42:53.170 --> 00:42:56.590
and then you show that
OK, there are actually--

00:42:56.590 --> 00:42:59.530
this objective function is
actually unidentifiable.

00:42:59.530 --> 00:43:01.690
Now that type of
technique is widely

00:43:01.690 --> 00:43:03.940
used when studying what are
known as method of moments

00:43:03.940 --> 00:43:05.730
algorithms or
estimation algorithms

00:43:05.730 --> 00:43:08.650
in learning verbal models,
but they would be much, much

00:43:08.650 --> 00:43:13.892
harder to apply in this type of
setting because first of all,

00:43:13.892 --> 00:43:15.350
these are much more
complex models,

00:43:15.350 --> 00:43:17.330
and estimating the
corresponding moments

00:43:17.330 --> 00:43:21.350
is going to be very hard because
they're very high dimensional.

00:43:21.350 --> 00:43:25.088
And second, because they're--

00:43:25.088 --> 00:43:27.130
I'm actually conflating
two different things when

00:43:27.130 --> 00:43:28.760
I talk about identifiability.

00:43:28.760 --> 00:43:33.110
One statement is the infinite
data identifiability,

00:43:33.110 --> 00:43:37.670
and the second question
is your ability

00:43:37.670 --> 00:43:42.230
to actually learn a good model
from a small amount of data,

00:43:42.230 --> 00:43:44.660
which is a sample complexity.

00:43:44.660 --> 00:43:48.440
And these constraints
that I'm putting in,

00:43:48.440 --> 00:43:51.470
even if they don't affect
the actual identifiability

00:43:51.470 --> 00:43:53.415
of the model, they
could be extremely

00:43:53.415 --> 00:43:55.790
important for improving the
sample complexity of learning

00:43:55.790 --> 00:43:57.637
algorithm.

00:43:57.637 --> 00:43:58.720
Is there another question?

00:44:03.715 --> 00:44:05.090
So we valued to
this using a data

00:44:05.090 --> 00:44:10.050
set of almost 4,000
patients where,

00:44:10.050 --> 00:44:16.310
again, each patient we observed
for only a couple of years--

00:44:16.310 --> 00:44:18.310
one to three years.

00:44:18.310 --> 00:44:20.370
And the observations
that we observed

00:44:20.370 --> 00:44:23.220
were 264 diagnosis
codes, the presence

00:44:23.220 --> 00:44:25.950
or absence of each of
those diagnosis codes

00:44:25.950 --> 00:44:28.080
during any three-month interval.

00:44:28.080 --> 00:44:32.760
Overall, there were almost
200,000 observations

00:44:32.760 --> 00:44:35.172
of diagnosis codes
in this data set.

00:44:35.172 --> 00:44:36.630
The learning
algorithm that we used

00:44:36.630 --> 00:44:38.650
was expectation maximization.

00:44:38.650 --> 00:44:40.900
Remember, there are a number
of hidden variables here,

00:44:40.900 --> 00:44:42.330
and so if you want to
maximize the likely--

00:44:42.330 --> 00:44:44.372
if you want to learn the
parameters that maximize

00:44:44.372 --> 00:44:46.230
the likelihood of
those observations O,

00:44:46.230 --> 00:44:48.522
then you have to marginize
over those hidden variables,

00:44:48.522 --> 00:44:53.010
and EM is one way to try to
find a local optima of that

00:44:53.010 --> 00:44:59.500
likelihood function, with the
key caveat that one has to do

00:44:59.500 --> 00:45:02.620
approximate inference
during the E step here,

00:45:02.620 --> 00:45:05.500
because this model
is not tractable,

00:45:05.500 --> 00:45:07.330
there's no closed form--

00:45:07.330 --> 00:45:10.090
for example, dynamic programming
algorithm for doing posterior

00:45:10.090 --> 00:45:13.420
inference in this model
given its complexity.

00:45:13.420 --> 00:45:15.520
And so what we used
was a Gibbs sampler

00:45:15.520 --> 00:45:17.690
to do approximate inference
within that E step,

00:45:17.690 --> 00:45:19.090
and we used--

00:45:19.090 --> 00:45:21.460
we did block sampling
of the Markov chains

00:45:21.460 --> 00:45:23.590
where we combined
a Gibbs sampler

00:45:23.590 --> 00:45:25.420
with a dynamic programming
algorithm, which

00:45:25.420 --> 00:45:27.400
improved the mixing
rate of the Markov chain

00:45:27.400 --> 00:45:30.460
for those of you who are
familiar with those concepts.

00:45:30.460 --> 00:45:34.520
And in the end step of the
learning algorithm when

00:45:34.520 --> 00:45:38.150
one has to learn the
parameters of the distribution,

00:45:38.150 --> 00:45:40.070
the only complex
part of this model

00:45:40.070 --> 00:45:42.440
is the continuous
time Markov process,

00:45:42.440 --> 00:45:44.330
and there's actually
been previous literature

00:45:44.330 --> 00:45:47.540
from the physics community
which shows how you can really--

00:45:47.540 --> 00:45:49.610
which gives you analytic
closed-form solutions

00:45:49.610 --> 00:45:53.480
for that M step of that
continuous time Markov process.

00:45:53.480 --> 00:45:55.790
Now if I were to do
this again today,

00:45:55.790 --> 00:45:58.530
I would have done it a
little bit differently.

00:45:58.530 --> 00:46:00.680
I would still think about
modeling this problem

00:46:00.680 --> 00:46:03.230
in a very similar
way, but I would

00:46:03.230 --> 00:46:06.530
do learning using a
variational lower bound

00:46:06.530 --> 00:46:09.350
of the likelihood with
a recognition network

00:46:09.350 --> 00:46:12.590
in order to very quickly
get you a lower bound

00:46:12.590 --> 00:46:14.330
in the likelihood.

00:46:14.330 --> 00:46:16.580
And for those of
you who are familiar

00:46:16.580 --> 00:46:18.200
with variational
autoencoders, that's

00:46:18.200 --> 00:46:20.167
precisely the idea
that is used there

00:46:20.167 --> 00:46:21.750
for learning variational
autoencoders.

00:46:21.750 --> 00:46:22.880
So that's the way I
would approach this

00:46:22.880 --> 00:46:23.880
if I was to do it again.

00:46:26.230 --> 00:46:29.730
There's just one or two other
extensions I want to mention.

00:46:29.730 --> 00:46:32.500
The first one is
something which we--

00:46:32.500 --> 00:46:35.570
one, more customization
we made for COPD,

00:46:35.570 --> 00:46:38.900
which is that we enforced
monotonic stage progression.

00:46:38.900 --> 00:46:41.930
So we said that--

00:46:41.930 --> 00:46:45.195
so here I talked about
a type of monotonically

00:46:45.195 --> 00:46:47.070
in terms of the conditional
distribution of X

00:46:47.070 --> 00:46:54.540
given S, but one could also
put an assumption in the--

00:46:54.540 --> 00:46:58.020
I already talked about
that, but one could also

00:46:58.020 --> 00:47:01.005
put an assumption on P of S--

00:47:04.520 --> 00:47:07.890
S of t given S of t minus
1, which is implicitly

00:47:07.890 --> 00:47:09.870
an assumption on
Q, and I gave you

00:47:09.870 --> 00:47:12.333
a hint of how one might
do that over here when

00:47:12.333 --> 00:47:15.000
I said that you might put 0's to
the left-hand side, meaning you

00:47:15.000 --> 00:47:17.017
can never go to the left.

00:47:17.017 --> 00:47:18.600
And indeed, we did
something like that

00:47:18.600 --> 00:47:20.903
here as well, which is
another type of constraint.

00:47:24.220 --> 00:47:27.130
And finally, we regularize
the learning problem

00:47:27.130 --> 00:47:30.130
by asking about
that graph involving

00:47:30.130 --> 00:47:32.890
the conditions,
the comorbidities,

00:47:32.890 --> 00:47:35.440
and the diagnosis
codes be sparse,

00:47:35.440 --> 00:47:40.090
by putting a beta prior
on those edge weights.

00:47:40.090 --> 00:47:42.340
So here's what one learned.

00:47:42.340 --> 00:47:43.900
So the first thing
I'm going to do

00:47:43.900 --> 00:47:46.570
is I'm going to show you the--

00:47:46.570 --> 00:47:48.640
we talked about how
we specified anchors,

00:47:48.640 --> 00:47:51.550
but I told you that the anchors
weren't the whole story.

00:47:51.550 --> 00:47:54.340
That we were able to infer
much more interesting things

00:47:54.340 --> 00:47:57.130
about the hidden variables
given all of the observations

00:47:57.130 --> 00:47:58.420
we have.

00:47:58.420 --> 00:48:01.630
So here I'm showing you
several of the phenotypes that

00:48:01.630 --> 00:48:03.400
were learned by this
unsupervised learning

00:48:03.400 --> 00:48:04.150
algorithm.

00:48:04.150 --> 00:48:06.480
First, the phenotype
for kidney disease.

00:48:06.480 --> 00:48:10.660
In red here, I'm showing
you the anchor variables

00:48:10.660 --> 00:48:15.010
that we chose for kidney
disease, and what you'll notice

00:48:15.010 --> 00:48:16.250
are a couple of things.

00:48:16.250 --> 00:48:18.880
First, the weight, which
you should think about

00:48:18.880 --> 00:48:22.270
as being proportional
in some way

00:48:22.270 --> 00:48:25.360
to how often you would see
that diagnosis code given

00:48:25.360 --> 00:48:27.550
that the patient
had kidney disease,

00:48:27.550 --> 00:48:30.730
the weights are all far
less than one, all right?

00:48:30.730 --> 00:48:34.690
So there is some noise in this
process of when you observe

00:48:34.690 --> 00:48:37.340
a diagnosis code for a patient.

00:48:37.340 --> 00:48:39.090
The second thing you
observe is that there

00:48:39.090 --> 00:48:46.080
are a number of other diagnosis
codes that are observed to be--

00:48:46.080 --> 00:48:52.050
which are explained by this
kidney disease comorbidity,

00:48:52.050 --> 00:48:57.980
such as anemia, urinary
tract infections, and so on,

00:48:57.980 --> 00:49:01.220
and that aligns well with what's
known in the medical literature

00:49:01.220 --> 00:49:03.470
about kidney disease.

00:49:03.470 --> 00:49:06.050
Look at another example
for lung cancer.

00:49:06.050 --> 00:49:08.240
In red here I'm
showing you the anchors

00:49:08.240 --> 00:49:10.250
that we had pre-specified
for these, which

00:49:10.250 --> 00:49:14.030
mean that these diagnosis
codes could only be explained

00:49:14.030 --> 00:49:19.970
by the lung cancer
comorbidity, and these

00:49:19.970 --> 00:49:22.550
are the noise rates that
are learned for them,

00:49:22.550 --> 00:49:24.085
and that's everything else.

00:49:24.085 --> 00:49:26.660
And here's one more
example of lung infection

00:49:26.660 --> 00:49:29.390
where there was only a signal
anchor that we specified

00:49:29.390 --> 00:49:32.900
for pneumonia, and you see
all of the other things that

00:49:32.900 --> 00:49:34.640
are automatically
associated to that as

00:49:34.640 --> 00:49:36.914
by the unsupervised
learning algorithm.

00:49:36.914 --> 00:49:37.808
Yep?

00:49:37.808 --> 00:49:41.180
AUDIENCE: So how
do you [INAUDIBLE]

00:49:41.180 --> 00:49:42.282
for the mobidity, right?

00:49:42.282 --> 00:49:42.990
PROFESSOR: Right.

00:49:42.990 --> 00:49:45.490
So that's what the unsupervised
learning algorithm is doing.

00:49:45.490 --> 00:49:47.850
So these weights are
learned, and I'm showing you

00:49:47.850 --> 00:49:51.180
something like a point
estimate of the parameters

00:49:51.180 --> 00:49:53.460
that are learned by
the learning algorithm.

00:49:53.460 --> 00:49:54.630
AUDIENCE: And so we--

00:49:54.630 --> 00:49:56.922
PROFESSOR: Just like if you're
learning a Markov model,

00:49:56.922 --> 00:50:00.190
you learned some transition and
[INAUDIBLE],, same thing here.

00:50:00.190 --> 00:50:00.690
All right.

00:50:00.690 --> 00:50:02.190
And this should
look a lot like what

00:50:02.190 --> 00:50:04.770
you would see when you do topic
modeling on a text copora,

00:50:04.770 --> 00:50:05.270
right?

00:50:05.270 --> 00:50:08.880
You would discover a topic--
this is analogous to a topic.

00:50:08.880 --> 00:50:11.390
It's a discrete topic,
meaning it either occurs

00:50:11.390 --> 00:50:13.170
or it doesn't occur
for a patient.

00:50:13.170 --> 00:50:15.838
And you would discover some
word topic distribution.

00:50:15.838 --> 00:50:17.880
This is analogous to that
word topic distribution

00:50:17.880 --> 00:50:20.430
for a topic in a topic model.

00:50:24.090 --> 00:50:25.850
So one could then use
the model to answer

00:50:25.850 --> 00:50:28.280
a couple of the original
questions we set out to solve.

00:50:28.280 --> 00:50:32.750
The first one is given
a patient's data,

00:50:32.750 --> 00:50:35.880
which I'm illustrating
here on the bottom,

00:50:35.880 --> 00:50:37.910
I have artificially
separated out

00:50:37.910 --> 00:50:41.150
into three different
comorbidities,

00:50:41.150 --> 00:50:43.280
and a star denotes
an observation

00:50:43.280 --> 00:50:45.780
of a data type of that one.

00:50:45.780 --> 00:50:47.910
But this was
artificially done by us,

00:50:47.910 --> 00:50:50.180
it was not given to
learning algorithm.

00:50:50.180 --> 00:50:54.890
One can infer, when
the patient initiated--

00:50:54.890 --> 00:50:58.610
started with each one of these
comorbidites, and also, when--

00:50:58.610 --> 00:51:00.585
so for the full
three-year time range

00:51:00.585 --> 00:51:02.210
that we have data
for the patient, what

00:51:02.210 --> 00:51:04.670
stage was the patient in
in the disease at any one

00:51:04.670 --> 00:51:05.340
point in time?

00:51:05.340 --> 00:51:08.450
So this model infers that the
patient starts out in stage 1,

00:51:08.450 --> 00:51:12.230
and about half a year
through the data collection

00:51:12.230 --> 00:51:15.290
process, transitioned
into stage 2 of COPD.

00:51:18.400 --> 00:51:20.860
Another thing that one
could do using this model

00:51:20.860 --> 00:51:25.240
is to simulate from the model
and answer the question of what

00:51:25.240 --> 00:51:27.700
would, let's say, a 20-year
trajectory of the disease look

00:51:27.700 --> 00:51:28.986
like?

00:51:28.986 --> 00:51:31.070
And here, I'm showing
a 10-year trajectory.

00:51:31.070 --> 00:51:33.280
And again, only one
to three years of data

00:51:33.280 --> 00:51:36.700
was used for any one
patient during learning.

00:51:36.700 --> 00:51:41.670
So this is the first time
we see the those axes,

00:51:41.670 --> 00:51:43.620
those comorbidities
really start to show up

00:51:43.620 --> 00:51:46.440
as being important as the way
of reading out from the model.

00:51:46.440 --> 00:51:49.690
Here, we've thrown away those
O's, those diagnosis codes

00:51:49.690 --> 00:51:53.357
altogether, we only care about
what we conjecture is truly

00:51:53.357 --> 00:51:55.440
happening to the patient,
those X variables, which

00:51:55.440 --> 00:51:58.300
are unobserved during training.

00:51:58.300 --> 00:52:04.130
So what we conjecture is that
kidney disease is very uncommon

00:52:04.130 --> 00:52:08.810
in stage 1 of the disease,
and increases slowly as you

00:52:08.810 --> 00:52:13.160
transition from stage 2, stage
3, to stage 4 of the disease,

00:52:13.160 --> 00:52:15.645
and then really bumps
up towards stage 5

00:52:15.645 --> 00:52:16.770
and stage 6 of the disease.

00:52:16.770 --> 00:52:18.320
So you should read
this as saying

00:52:18.320 --> 00:52:22.280
that in stage 6 of
the disease, over 60%

00:52:22.280 --> 00:52:24.650
people have kidney disease.

00:52:24.650 --> 00:52:27.050
Now the time interval is here.

00:52:27.050 --> 00:52:30.860
So how I've chosen these--

00:52:30.860 --> 00:52:35.420
where to put these
triangles, I've

00:52:35.420 --> 00:52:39.050
chosen them based on the
average amount of time

00:52:39.050 --> 00:52:42.560
it takes to transition from
one stage to the next stage

00:52:42.560 --> 00:52:45.490
according to the learned
parameters of the model.

00:52:45.490 --> 00:52:48.160
And so you see that
stages 1, 2, and 3,

00:52:48.160 --> 00:52:50.890
and 4 take a long period of--

00:52:50.890 --> 00:52:54.440
amount of time to transition
between those four stages,

00:52:54.440 --> 00:52:56.280
and then there's a
very small amount

00:52:56.280 --> 00:52:58.030
of time between
transitioning from stage 5

00:52:58.030 --> 00:53:01.162
to stage 6 on average.

00:53:01.162 --> 00:53:02.370
So that's for kidney disease.

00:53:02.370 --> 00:53:06.400
One could also read this
out for other comorbidities.

00:53:06.400 --> 00:53:09.120
So in orange here--
in yellow here

00:53:09.120 --> 00:53:11.910
is diabetes, in black here is
musculoskeletal conditions,

00:53:11.910 --> 00:53:15.300
and in red here is
cardiovascular disease.

00:53:15.300 --> 00:53:18.530
And so one of the
interesting inferences

00:53:18.530 --> 00:53:20.030
made by this
learning algorithm is

00:53:20.030 --> 00:53:25.700
that even in stage 1 of COPD,
very early in the trajectory,

00:53:25.700 --> 00:53:27.800
we are seeing patients
with large amounts

00:53:27.800 --> 00:53:29.318
of cardiovascular disease.

00:53:29.318 --> 00:53:30.860
And again, this is
something that one

00:53:30.860 --> 00:53:32.277
can look at the
medical literature

00:53:32.277 --> 00:53:35.220
to see does it align
with what we expect?

00:53:35.220 --> 00:53:39.740
And it does, so even in patients
with mild to moderate COPD,

00:53:39.740 --> 00:53:43.310
the leading cause of morbidity
is cardiovascular disease.

00:53:43.310 --> 00:53:45.770
Again, this is
just a sanity check

00:53:45.770 --> 00:53:49.610
that what this model is learning
for a common disease actually

00:53:49.610 --> 00:53:54.020
aligns with the
medical knowledge.

00:53:54.020 --> 00:53:58.970
So that's all I want to say
about this probabilistic model

00:53:58.970 --> 00:54:00.680
approach to disease
progression modeling

00:54:00.680 --> 00:54:02.060
from cross-sectional data.

00:54:02.060 --> 00:54:03.560
I want you to hold
your questions

00:54:03.560 --> 00:54:05.435
so I can get through
the rest of the material

00:54:05.435 --> 00:54:07.760
and you can ask me after class.

00:54:07.760 --> 00:54:10.740
So next I want to talk about
these pseudo-time methods,

00:54:10.740 --> 00:54:13.500
which are a very different
approach for trying

00:54:13.500 --> 00:54:17.300
to align patients into
early-to-late disease stage.

00:54:17.300 --> 00:54:19.700
These approaches were really
popularized in the last five

00:54:19.700 --> 00:54:25.340
years due to the explosion
in single-cell sequencing

00:54:25.340 --> 00:54:28.660
experiments in the
biology community.

00:54:28.660 --> 00:54:31.900
Single-cell sequencing is
a way to really understand

00:54:31.900 --> 00:54:36.010
not just what is the
average gene expression,

00:54:36.010 --> 00:54:39.520
but on a cell-by-cell
basis can we

00:54:39.520 --> 00:54:43.270
understand what is
expressed in each cell.

00:54:43.270 --> 00:54:45.690
So at a very high level,
the way this works is

00:54:45.690 --> 00:54:51.270
you take a solid tissue, you
do a number of procedures

00:54:51.270 --> 00:54:54.090
in order to isolate
out individual cells

00:54:54.090 --> 00:54:56.670
from that tissue,
then you're going

00:54:56.670 --> 00:55:00.960
to extract the RNA from
those individual cells,

00:55:00.960 --> 00:55:03.360
you go through another
complex procedure

00:55:03.360 --> 00:55:07.350
which somehow barcodes
each of the RNA

00:55:07.350 --> 00:55:10.920
from each individual cell,
mixes them all together,

00:55:10.920 --> 00:55:14.070
does sequencing of it,
and then deconvolves

00:55:14.070 --> 00:55:17.610
it so that you can see
what was the original RNA

00:55:17.610 --> 00:55:20.355
expression for each of
the individual cells.

00:55:23.442 --> 00:55:28.450
Now the goal of these
pseudo-time algorithms

00:55:28.450 --> 00:55:31.420
is to take that type
of data and then

00:55:31.420 --> 00:55:37.510
to attempt to align
cells to some trajectory.

00:55:37.510 --> 00:55:40.780
So if you look at the very
top of this figure part figure

00:55:40.780 --> 00:55:45.100
a that's the picture that
you should have in your mind.

00:55:45.100 --> 00:55:50.180
In the real world, cells
are evolving with time--

00:55:50.180 --> 00:56:00.940
for example, B cells will have
a well-characterized evolution

00:56:00.940 --> 00:56:04.300
between different
cellular states,

00:56:04.300 --> 00:56:07.010
and what we'd like
to be understand,

00:56:07.010 --> 00:56:09.342
given that you have
cross-sectional data-- so you

00:56:09.342 --> 00:56:11.800
can imagine-- imagine you have
a whole collection of cells,

00:56:11.800 --> 00:56:14.080
each one in a different
part, a different stage

00:56:14.080 --> 00:56:18.350
of differentiation,
could you somehow

00:56:18.350 --> 00:56:21.140
order them into where they
were in different stages

00:56:21.140 --> 00:56:22.448
of differentiation?

00:56:22.448 --> 00:56:23.240
So that's the goal.

00:56:23.240 --> 00:56:25.040
We want to take this--

00:56:25.040 --> 00:56:27.350
so there exists some
true ordering that I'm

00:56:27.350 --> 00:56:29.960
showing from dark to light.

00:56:29.960 --> 00:56:32.343
The capture process
is going to ignore

00:56:32.343 --> 00:56:34.760
what the ordering information
was, because all we're doing

00:56:34.760 --> 00:56:37.790
is getting a collection of cells
that are in different stages.

00:56:37.790 --> 00:56:40.910
And then we're going to
use this pseudo-time method

00:56:40.910 --> 00:56:44.720
to try to re-sort them so
that you could figure out,

00:56:44.720 --> 00:56:46.670
oh, these were the
cells in the early stage

00:56:46.670 --> 00:56:48.740
and these are the cells
in the late stage.

00:56:48.740 --> 00:56:50.730
And of course,
there's an analogy

00:56:50.730 --> 00:56:52.730
here to the pictures I
showed you in the earlier

00:56:52.730 --> 00:56:54.520
part of the lecture.

00:56:54.520 --> 00:56:58.270
Once you have this alignment
of cells into stages,

00:56:58.270 --> 00:57:00.880
then you can answer some really
exciting scientific questions.

00:57:00.880 --> 00:57:05.380
For example, you could ask
a variety of different genes

00:57:05.380 --> 00:57:08.650
which genes are expressed
at which point in time.

00:57:08.650 --> 00:57:11.050
So you might see that
gene a is very highly

00:57:11.050 --> 00:57:15.700
expressed very early in
this cell's differentiation

00:57:15.700 --> 00:57:18.875
and is not expressed very
much towards the end,

00:57:18.875 --> 00:57:20.875
and that might give you
new biological insights.

00:57:23.570 --> 00:57:26.120
So these methods
could immediately

00:57:26.120 --> 00:57:28.850
be applied, I believe, to
disease progression modeling

00:57:28.850 --> 00:57:32.870
where I want you to think about
each cell as now a patient,

00:57:32.870 --> 00:57:38.360
and that patient has a number
of observations for this data.

00:57:38.360 --> 00:57:41.660
The observations are an
expression for that cell,

00:57:41.660 --> 00:57:43.580
but in our data,
the observations

00:57:43.580 --> 00:57:46.640
might be symptoms that we
observe for the patient,

00:57:46.640 --> 00:57:48.618
for example.

00:57:48.618 --> 00:57:50.660
And then the goal is,
given those cross-sectional

00:57:50.660 --> 00:57:54.050
observations, to sort them.

00:57:54.050 --> 00:57:56.060
And once you have
that sorting, then you

00:57:56.060 --> 00:57:58.400
could answer
scientific questions,

00:57:58.400 --> 00:58:01.820
such as, I mentioned, of a
number of different genes,

00:58:01.820 --> 00:58:04.430
which genes are expressed when.

00:58:04.430 --> 00:58:08.060
So here, I'm showing you
sort of the density of when

00:58:08.060 --> 00:58:09.860
this particular
gene is expressed

00:58:09.860 --> 00:58:12.610
as a function of pseudo-time.

00:58:12.610 --> 00:58:14.517
Analogously for disease
progression modeling,

00:58:14.517 --> 00:58:16.350
you should think of
that as being a symptom.

00:58:16.350 --> 00:58:19.340
You could ask, OK, suppose
there are some true progression

00:58:19.340 --> 00:58:25.580
of the disease, when do patients
typically develop diabetes

00:58:25.580 --> 00:58:27.360
or cardiovascular symptoms?

00:58:27.360 --> 00:58:29.825
And so for cardiovascular,
going back to the COPD example,

00:58:29.825 --> 00:58:32.450
you might imagine that there's
a peak very early in the disease

00:58:32.450 --> 00:58:36.510
stage; for diabetes, it might
be in a later disease stage.

00:58:36.510 --> 00:58:37.210
All right?

00:58:37.210 --> 00:58:39.220
So that-- is the analogy clear?

00:58:42.060 --> 00:58:47.970
So this community, which
has been developing methods

00:58:47.970 --> 00:58:51.090
for studying single-cell
gene expression data,

00:58:51.090 --> 00:58:55.770
has just exploded in
the last 10 years.

00:58:55.770 --> 00:58:58.140
So I lost count of how many
different methods there

00:58:58.140 --> 00:59:03.420
are, but if I had to guess, I'd
say 50 to 200 different methods

00:59:03.420 --> 00:59:06.047
for this problem.

00:59:06.047 --> 00:59:08.380
There was a paper, which is
one of the optional readings

00:59:08.380 --> 00:59:11.980
for today's lecture, that just
came out earlier this month

00:59:11.980 --> 00:59:14.860
which looks at a comparison
of these different trajectory

00:59:14.860 --> 00:59:18.220
inference methods,
and this picture

00:59:18.220 --> 00:59:19.943
gives a really
interesting illustration

00:59:19.943 --> 00:59:21.610
of what are some of
the assumptions made

00:59:21.610 --> 00:59:23.030
by these algorithms.

00:59:23.030 --> 00:59:25.990
So for example,
the first question,

00:59:25.990 --> 00:59:28.660
when you try to figure out which
method of these tons of methods

00:59:28.660 --> 00:59:31.450
to use is, do you expect
multiple disconnected

00:59:31.450 --> 00:59:33.808
trajectories?

00:59:33.808 --> 00:59:35.600
What might be a reason
why you would expect

00:59:35.600 --> 00:59:38.315
multiple disconnected
trajectories for disease

00:59:38.315 --> 00:59:39.190
progression modeling?

00:59:41.890 --> 00:59:43.160
TAs should not answer.

00:59:46.770 --> 00:59:48.695
AUDIENCE: [INAUDIBLE]

00:59:48.695 --> 00:59:53.120
PROFESSOR: Different
subtexts would be an example.

00:59:53.120 --> 00:59:56.360
So suppose the answer
is no, as we've

00:59:56.360 --> 01:00:00.395
been assuming for most of this
lecture, then you might ask,

01:00:00.395 --> 01:00:03.020
OK, there might only be a single
trajectory, because we're only

01:00:03.020 --> 01:00:04.940
assuming that a single
disease subtype,

01:00:04.940 --> 01:00:09.090
but do we expect a
particular topology?

01:00:09.090 --> 01:00:12.290
Now everything that we've been
talking about up until now

01:00:12.290 --> 01:00:14.750
has been a linear
topology, meaning

01:00:14.750 --> 01:00:16.520
there's a linear
projection, there's

01:00:16.520 --> 01:00:19.260
such notion of early
and late to stage,

01:00:19.260 --> 01:00:21.020
but in fact, the
linear trajectory

01:00:21.020 --> 01:00:23.120
may not be realistic.

01:00:23.120 --> 01:00:25.130
Maybe the trajectory
looks a little bit more

01:00:25.130 --> 01:00:27.420
like this bifurcation.

01:00:27.420 --> 01:00:31.370
Maybe patients look the same
very early in the disease

01:00:31.370 --> 01:00:34.280
stage, but then
suddenly something

01:00:34.280 --> 01:00:36.470
might happen which causes
some patients to go

01:00:36.470 --> 01:00:39.350
this way and some
patients to go that way.

01:00:39.350 --> 01:00:44.260
Any idea what that might
be in a clinical setting?

01:00:44.260 --> 01:00:45.177
AUDIENCE: A treatment?

01:00:45.177 --> 01:00:46.677
PROFESSOR: Treatments,
that's great.

01:00:46.677 --> 01:00:47.550
All right?

01:00:47.550 --> 01:00:50.340
So maybe these patients
got t equals 0,

01:00:50.340 --> 01:00:52.875
and maybe these patients
got t equal as 1, and maybe

01:00:52.875 --> 01:00:54.250
for whatever reason
wouldn't even

01:00:54.250 --> 01:00:56.208
have good data on what
treatments patients got,

01:00:56.208 --> 01:00:58.600
so we don't actually observe
the treatment, right?

01:00:58.600 --> 01:01:01.630
Then you might want to be able
to discover that bifurcation

01:01:01.630 --> 01:01:05.410
directly from the data, then
that might suggest, going back

01:01:05.410 --> 01:01:07.270
to the original source
of the data, to ask,

01:01:07.270 --> 01:01:09.970
what differentiated these
patients at this point in time?

01:01:09.970 --> 01:01:11.530
And you might
discover, oh, there

01:01:11.530 --> 01:01:13.530
was something in the data
that we didn't record,

01:01:13.530 --> 01:01:18.420
such as treatment, all right?

01:01:18.420 --> 01:01:19.810
So there are a
variety of methods

01:01:19.810 --> 01:01:23.980
to try to infer these
pseudo-times under a variety

01:01:23.980 --> 01:01:25.240
of different assumptions.

01:01:25.240 --> 01:01:26.740
What I'll do in the
next few minutes

01:01:26.740 --> 01:01:31.470
is just give you an inkling of
how two of the methods work.

01:01:31.470 --> 01:01:35.130
And I chose these to be
representative examples.

01:01:35.130 --> 01:01:38.330
The first example
is an approach based

01:01:38.330 --> 01:01:42.530
on building a minimum
spanning tree.

01:01:42.530 --> 01:01:45.140
And this algorithm
I'm going to describe

01:01:45.140 --> 01:01:46.910
goes by the name of Monocle.

01:01:46.910 --> 01:01:50.870
It was published in 2014 in
this paper by Trapnell et al,

01:01:50.870 --> 01:01:53.750
but it builds very heavily
on an earlier published

01:01:53.750 --> 01:01:58.210
paper from 2003 that
I mostly citing here.

01:01:58.210 --> 01:02:00.580
So the way that this
algorithm works is as follows.

01:02:00.580 --> 01:02:03.670
It starts with, as
we've been assuming

01:02:03.670 --> 01:02:06.310
all along, cross-sectional
data, which lives

01:02:06.310 --> 01:02:08.140
in some high-dimensional space.

01:02:08.140 --> 01:02:10.480
I'm drawing that in
the top-left here.

01:02:10.480 --> 01:02:16.323
Each data point corresponds
to some patient or some cell.

01:02:16.323 --> 01:02:17.740
The first step of
the algorithm is

01:02:17.740 --> 01:02:19.270
to do dimensionality reduction.

01:02:19.270 --> 01:02:21.010
And there are many ways to
do dimensionality reduction.

01:02:21.010 --> 01:02:22.900
You could do principal
components analysis,

01:02:22.900 --> 01:02:25.067
or, for example, you could
do independent components

01:02:25.067 --> 01:02:25.570
analysis.

01:02:25.570 --> 01:02:27.927
This paper uses the
independent component analysis.

01:02:27.927 --> 01:02:29.385
What ICA is going
to do, it's going

01:02:29.385 --> 01:02:31.718
to attempt to find a number
of different components that

01:02:31.718 --> 01:02:35.260
seem to be as independent
from one another as possible.

01:02:35.260 --> 01:02:37.420
Then you're going to
represent the data now

01:02:37.420 --> 01:02:41.140
in this low-dimensional space,
and in many of these papers,

01:02:41.140 --> 01:02:44.328
it's quite astonishing to me,
they actually use dimension 2.

01:02:44.328 --> 01:02:46.620
So they'll go all the way
down to two-dimensional space

01:02:46.620 --> 01:02:49.490
where you can actually
plot all of the data.

01:02:49.490 --> 01:02:51.990
It's not at all obvious to me
why you would want to do that,

01:02:51.990 --> 01:02:53.698
and for clinical data,
I think that might

01:02:53.698 --> 01:02:56.610
be a very poor choice.

01:02:56.610 --> 01:03:02.070
Then what they do is they build
a minimum spanning tree on all

01:03:02.070 --> 01:03:04.920
of the patient or cells.

01:03:04.920 --> 01:03:06.750
So the way that one
does that is you

01:03:06.750 --> 01:03:11.220
create a graph by drawing
an edge between every pair

01:03:11.220 --> 01:03:14.070
of nodes where the
weight of the edge

01:03:14.070 --> 01:03:17.625
is the Euclidean distance
between those two points.

01:03:20.130 --> 01:03:23.190
And then-- so for example, there
is this edge from here to here,

01:03:23.190 --> 01:03:26.130
there's an edge from
here to here and so on.

01:03:26.130 --> 01:03:28.740
And then given that
weighted graph,

01:03:28.740 --> 01:03:32.050
we're going to find the minimum
spanning tree of that graph,

01:03:32.050 --> 01:03:34.590
and what I'm showing you here
is the minimum spanning tree

01:03:34.590 --> 01:03:38.100
of the corresponding graph, OK?

01:03:38.100 --> 01:03:39.920
Next, what one
will do is go look

01:03:39.920 --> 01:03:45.440
for the longest
path in that tree.

01:03:45.440 --> 01:03:48.670
Remember, finding the
longest path in a graph--

01:03:48.670 --> 01:03:50.610
in an arbitrary
graph has a name,

01:03:50.610 --> 01:03:53.030
it's called the traveling
salesman problem

01:03:53.030 --> 01:03:54.897
and it's the NP-hard problem.

01:03:54.897 --> 01:03:56.230
How has that gotten around here?

01:03:56.230 --> 01:03:58.230
Well we're not-- this is
not an arbitrary graph,

01:03:58.230 --> 01:04:01.190
this is actually a tree.

01:04:01.190 --> 01:04:03.460
So here's that poor--

01:04:03.460 --> 01:04:08.420
here's a algorithm for
finding the longest path.

01:04:08.420 --> 01:04:11.360
I won't talk about that.

01:04:11.360 --> 01:04:14.990
So one finds along
this path in a tree--

01:04:14.990 --> 01:04:17.820
in the tree, and then
what one does is one says,

01:04:17.820 --> 01:04:21.200
OK, one side of the
path corresponds to,

01:04:21.200 --> 01:04:23.750
let's say, early disease stage
and the other side of the path

01:04:23.750 --> 01:04:25.958
corresponds to
late disease stage,

01:04:25.958 --> 01:04:27.500
and it allows for
the fact that there

01:04:27.500 --> 01:04:28.820
might be some bifurcation.

01:04:28.820 --> 01:04:30.480
So for example, you
see here that there

01:04:30.480 --> 01:04:33.530
is a bifurcation over here.

01:04:33.530 --> 01:04:35.120
And as we discussed
earlier, you have

01:04:35.120 --> 01:04:37.158
to have some way
of differentiating

01:04:37.158 --> 01:04:39.200
what the beginning is and
what the end should be,

01:04:39.200 --> 01:04:41.325
and that's where some side
information might become

01:04:41.325 --> 01:04:43.280
useful.

01:04:43.280 --> 01:04:45.590
So here's an illustration
of applying that method

01:04:45.590 --> 01:04:47.490
to some real data.

01:04:47.490 --> 01:04:52.640
So every point here
is a cell after doing

01:04:52.640 --> 01:04:54.200
dimensionality reduction.

01:04:54.200 --> 01:04:56.510
The edges between
those points correspond

01:04:56.510 --> 01:04:59.780
to the edges of the
minimum spanning tree,

01:04:59.780 --> 01:05:01.947
and now what the
authors have done

01:05:01.947 --> 01:05:04.280
is they've actually used some
side information that they

01:05:04.280 --> 01:05:07.490
had in order to color
each of the nodes

01:05:07.490 --> 01:05:11.390
based on what part of the
cell differentiation process

01:05:11.390 --> 01:05:14.480
is believed-- that cell
is believed to be in.

01:05:14.480 --> 01:05:18.140
And what one discovers,
that in fact, this

01:05:18.140 --> 01:05:21.410
is very sensible, that
all of these points

01:05:21.410 --> 01:05:25.160
are in a much
earlier disease stage

01:05:25.160 --> 01:05:27.110
than [INAUDIBLE]
than these points,

01:05:27.110 --> 01:05:33.050
and this is a
sensible bifurcation

01:05:33.050 --> 01:05:35.600
Next I want to talk about a
slightly different approach

01:05:35.600 --> 01:05:36.298
to--

01:05:36.298 --> 01:05:38.090
this is the whole story,
by the way, right?

01:05:38.090 --> 01:05:41.542
It's conceptually a very,
very simple approach.

01:05:41.542 --> 01:05:43.750
Next I want to talk about
a different approach, which

01:05:43.750 --> 01:05:48.430
now tries to return back to
the probabilistic approaches

01:05:48.430 --> 01:05:51.490
that we had earlier
in the lecture.

01:05:51.490 --> 01:05:57.293
This new approach is going to
be based on Gaussian processes.

01:05:57.293 --> 01:05:59.460
So Gaussian processes have
come up a couple of times

01:05:59.460 --> 01:06:01.020
in lecture, but
I've never actually

01:06:01.020 --> 01:06:02.280
formally defined them for you.

01:06:02.280 --> 01:06:04.655
So in order for what I'm going
to say next to make sense,

01:06:04.655 --> 01:06:08.230
I'm going to formally define for
you what a Gaussian process is.

01:06:08.230 --> 01:06:14.120
A Gaussian process mu for a
collection of time points,

01:06:14.120 --> 01:06:17.850
T1 through T capital
N, is defined

01:06:17.850 --> 01:06:23.310
by a joint distribution,
mu, of those time points,

01:06:23.310 --> 01:06:26.125
which is a Gaussian
distribution.

01:06:26.125 --> 01:06:27.750
So we're going to
say that the function

01:06:27.750 --> 01:06:31.290
value for these T different
time points is just

01:06:31.290 --> 01:06:33.880
a Gaussian, which for the
purpose of today's lecture

01:06:33.880 --> 01:06:35.910
I'm going to assume
is zero mean,

01:06:35.910 --> 01:06:38.190
and where the covariance
function is given

01:06:38.190 --> 01:06:43.170
to you by this capital K, it's a
covariance function of the time

01:06:43.170 --> 01:06:46.180
points-- of the input points.

01:06:46.180 --> 01:06:47.940
And so if you look--

01:06:47.940 --> 01:06:52.050
this has to be a matrix
of dimension capital N

01:06:52.050 --> 01:06:56.830
by capital N. And if you look
at the I1 and I2 of the N tree,

01:06:56.830 --> 01:07:00.090
if you look at the N
tree of that matrix,

01:07:00.090 --> 01:07:03.660
we're defining it to be given
to by the following kernel

01:07:03.660 --> 01:07:04.620
function.

01:07:04.620 --> 01:07:09.510
It looks at the exponential
of the negative Euclidean

01:07:09.510 --> 01:07:13.770
distance squared between
those two time points.

01:07:13.770 --> 01:07:15.450
Intuitively what
this is saying is

01:07:15.450 --> 01:07:17.760
that if you have
two time points that

01:07:17.760 --> 01:07:20.700
are very close to
one another, then

01:07:20.700 --> 01:07:23.730
this kernel function is
going to be very large.

01:07:23.730 --> 01:07:28.710
If you have two time points that
are very far from one another,

01:07:28.710 --> 01:07:32.140
then this is very large-- it's
a very large negative number,

01:07:32.140 --> 01:07:34.260
and so this is going
to be very small.

01:07:34.260 --> 01:07:36.010
So the kernel function
for two inputs that

01:07:36.010 --> 01:07:37.900
are very far from
another are very small;

01:07:37.900 --> 01:07:39.358
the kernel function
for inputs that

01:07:39.358 --> 01:07:41.020
are very close to
each other is large;

01:07:41.020 --> 01:07:42.910
and thus, what
we're saying here is

01:07:42.910 --> 01:07:46.960
that there's going to be some
correlation between nearby data

01:07:46.960 --> 01:07:48.702
points.

01:07:48.702 --> 01:07:50.660
And that's the way which
we're going to specify

01:07:50.660 --> 01:07:53.480
a distribution of functions.

01:07:53.480 --> 01:07:55.820
If one were to sample
from this Gaussian

01:07:55.820 --> 01:07:58.760
with a covariance function
specified in the way I did,

01:07:58.760 --> 01:08:03.130
what one gets out is something
that looks like this.

01:08:03.130 --> 01:08:07.560
So I'm assuming
here that every--

01:08:07.560 --> 01:08:11.227
that these curves
look really dense,

01:08:11.227 --> 01:08:13.810
and that's because I'm assuming
the N is extremely large here.

01:08:13.810 --> 01:08:15.643
If and what small, let's
say 3, there'd only

01:08:15.643 --> 01:08:18.660
be three time points here, OK?

01:08:18.660 --> 01:08:23.340
And so if you can make this
distribution of functions

01:08:23.340 --> 01:08:28.229
be arbitrarily complex by
playing with this little l--

01:08:28.229 --> 01:08:34.950
so for example, if you made
a little l be very small,

01:08:34.950 --> 01:08:37.319
then what you get are these
really spiky functions

01:08:37.319 --> 01:08:39.630
that I'm showing you
in a very light color.

01:08:39.630 --> 01:08:42.210
If you make a little
l be very large,

01:08:42.210 --> 01:08:45.920
you get these very
smooth functions, right?

01:08:45.920 --> 01:08:48.620
So this is a way to get a
function-- this is a way

01:08:48.620 --> 01:08:50.810
to get a distribution
over functions

01:08:50.810 --> 01:08:55.130
just by sampling from
this Gaussian process.

01:08:55.130 --> 01:08:57.770
What this paper does from
Campbell and Yau published

01:08:57.770 --> 01:09:01.870
two years ago in
Computational Biology

01:09:01.870 --> 01:09:07.790
is they assume that the
observations that you have

01:09:07.790 --> 01:09:13.700
are drawn from a Gaussian
distribution whose mean is

01:09:13.700 --> 01:09:16.740
given to you by the
Gaussian process.

01:09:16.740 --> 01:09:27.130
So if you think back to the
story that we drew earlier,

01:09:27.130 --> 01:09:33.229
suppose that the data lived
in just one dimension,

01:09:33.229 --> 01:09:36.149
and suppose we actually knew
the sorting of patients.

01:09:36.149 --> 01:09:39.130
So we actually
knew which patients

01:09:39.130 --> 01:09:45.200
are very early in time, which
patients are very late in time.

01:09:45.200 --> 01:09:51.380
You might imagine that that
single biomarker, biomarker A,

01:09:51.380 --> 01:09:55.820
you might imagine that the
function which tells you

01:09:55.820 --> 01:09:58.820
what the biomarker's value
is as a function of time

01:09:58.820 --> 01:10:01.640
might be something
like this, right?

01:10:01.640 --> 01:10:04.970
Or maybe it's something
like that, OK?

01:10:04.970 --> 01:10:06.770
So it might be a function
that's increasing

01:10:06.770 --> 01:10:09.920
or a function that's decreasing.

01:10:09.920 --> 01:10:12.740
And this function
is precisely what

01:10:12.740 --> 01:10:17.330
this mu, this Gaussian
process is meant to model.

01:10:17.330 --> 01:10:20.120
The only difference is that
now, one can model the Gaussian

01:10:20.120 --> 01:10:23.250
process-- instead of just
being a single dimension,

01:10:23.250 --> 01:10:25.500
one could imagine having
several different dimensions.

01:10:25.500 --> 01:10:29.990
So this P denotes the
number of dimensions, right?

01:10:29.990 --> 01:10:31.490
Which corresponds
to, in some sense,

01:10:31.490 --> 01:10:33.650
to the number of
synthetic biomarkers

01:10:33.650 --> 01:10:36.860
that you might conjecture exist.

01:10:36.860 --> 01:10:40.370
Now here, we truly don't
know the sorting of patients

01:10:40.370 --> 01:10:43.020
into early versus late stage.

01:10:43.020 --> 01:10:47.270
And so the time points
T are themselves

01:10:47.270 --> 01:10:49.610
assumed to be latent
variables that

01:10:49.610 --> 01:10:52.130
are drawn from a truncated
normal distribution that

01:10:52.130 --> 01:10:53.490
looks like this.

01:10:53.490 --> 01:10:55.580
So you might make
some assumption

01:10:55.580 --> 01:10:58.970
that the time intervals
for when a patient comes in

01:10:58.970 --> 01:11:02.022
might be, maybe patients
come in really typically very

01:11:02.022 --> 01:11:03.480
in the middle of
the disease stage,

01:11:03.480 --> 01:11:05.355
or maybe you're assuming
it's something flat,

01:11:05.355 --> 01:11:07.750
an so patients come in
throughout the disease stage.

01:11:07.750 --> 01:11:10.760
But the time point
itself is latent.

01:11:10.760 --> 01:11:13.200
So now the generative process
for the data is as follows.

01:11:13.200 --> 01:11:15.860
You first sample
a time point from

01:11:15.860 --> 01:11:18.620
this truncated normal
distribution, then

01:11:18.620 --> 01:11:21.280
you look to see--

01:11:21.280 --> 01:11:23.030
oh, and you sample
from the very beginning

01:11:23.030 --> 01:11:27.500
your sample this
curve mu, and then you

01:11:27.500 --> 01:11:30.260
look to see, what is the value
of mu for the sample time

01:11:30.260 --> 01:11:33.440
point, and that gives you
the expected value you should

01:11:33.440 --> 01:11:36.740
expect to see for that patient.

01:11:36.740 --> 01:11:41.670
And the one, then,
jointly optimizes

01:11:41.670 --> 01:11:44.730
this to try to find the most--

01:11:44.730 --> 01:11:48.450
the curve, the curve mu
which has highest posterior

01:11:48.450 --> 01:11:52.470
probability, and that is how
you read out from the model

01:11:52.470 --> 01:11:57.815
both what the latent
progression looks

01:11:57.815 --> 01:11:59.940
like, and if you look at
the posterior distribution

01:11:59.940 --> 01:12:02.065
over the T's that are
inferred for each individual,

01:12:02.065 --> 01:12:05.070
you get the inferred
location along the trajectory

01:12:05.070 --> 01:12:07.550
for each individual.

01:12:07.550 --> 01:12:09.200
And I'll stop there.

01:12:09.200 --> 01:12:11.200
I'll post the slides
online for this last piece,

01:12:11.200 --> 01:12:13.890
but I'll let you read
the paper on your own.