WEBVTT

00:00:14.825 --> 00:00:15.950
PETER SZOLOVITS: All right.

00:00:15.950 --> 00:00:17.900
Let's get started.

00:00:17.900 --> 00:00:20.120
Good afternoon.

00:00:20.120 --> 00:00:24.910
So last time, I started
talking about the use

00:00:24.910 --> 00:00:29.770
of natural language processing
to process clinical data.

00:00:29.770 --> 00:00:32.259
And things went a
little bit slowly.

00:00:32.259 --> 00:00:36.010
And so we didn't get through
a lot of the material.

00:00:36.010 --> 00:00:39.680
I'm going to try to
rush a bit more today.

00:00:39.680 --> 00:00:44.690
And as a result, I have
a lot of stuff to cover.

00:00:44.690 --> 00:00:49.510
So if you remember,
last time, I started

00:00:49.510 --> 00:00:54.010
by saying that a
lot of the NLP work

00:00:54.010 --> 00:00:57.040
involves coming up with
phrases that one might

00:00:57.040 --> 00:01:01.900
be interested in to help
identify the kinds of data

00:01:01.900 --> 00:01:06.320
that you want, and then just
looking for those in text.

00:01:06.320 --> 00:01:07.810
So that's a very simple method.

00:01:07.810 --> 00:01:10.460
But it's one that
works reasonably well.

00:01:10.460 --> 00:01:13.150
And then Kat Liao was
here to talk about some

00:01:13.150 --> 00:01:16.880
of the applications
of that kind of work

00:01:16.880 --> 00:01:20.410
in what she's been doing
in cohort selection.

00:01:20.410 --> 00:01:22.060
So what I want to
talk about today

00:01:22.060 --> 00:01:24.820
is more sophisticated
versions of that,

00:01:24.820 --> 00:01:29.110
and then move on to more
contemporary approaches

00:01:29.110 --> 00:01:31.360
to natural language processing.

00:01:31.360 --> 00:01:35.860
So this is a paper
that was given

00:01:35.860 --> 00:01:39.040
to you as one of the
optional readings last time.

00:01:39.040 --> 00:01:42.250
And it's work from
David Sontag's lab,

00:01:42.250 --> 00:01:46.250
where they said, well, how do
we make this more sophisticated?

00:01:46.250 --> 00:01:47.650
So they start the same way.

00:01:47.650 --> 00:01:52.270
They say, OK, Dr.
Liao, let's say,

00:01:52.270 --> 00:01:56.830
give me terms that are very
good indicators that I have

00:01:56.830 --> 00:01:59.950
the right kind of
patient, if I find them

00:01:59.950 --> 00:02:01.880
in the patient's notes.

00:02:01.880 --> 00:02:04.940
So these are things with
high predictive value.

00:02:04.940 --> 00:02:09.990
So you don't want to use a term
like sick, because that's going

00:02:09.990 --> 00:02:11.500
to find way too many people.

00:02:11.500 --> 00:02:13.090
But you want to
find something that

00:02:13.090 --> 00:02:17.980
is very specific but that
has a high predictive value

00:02:17.980 --> 00:02:20.860
that you are going to
find the right person.

00:02:20.860 --> 00:02:24.120
And then what they
did is they built

00:02:24.120 --> 00:02:28.990
a model that tries to
predict the presence

00:02:28.990 --> 00:02:33.640
of that word in the
text from everything

00:02:33.640 --> 00:02:36.600
else in the medical record.

00:02:36.600 --> 00:02:42.080
So now, this is an example of a
silver-standard way of training

00:02:42.080 --> 00:02:47.420
a model that says, well, I don't
have the energy or the time

00:02:47.420 --> 00:02:50.030
to get doctors to
look through thousands

00:02:50.030 --> 00:02:52.110
and thousands of records.

00:02:52.110 --> 00:02:55.350
But if I select these
anchors well enough,

00:02:55.350 --> 00:02:59.300
then I'm going to get a high
yield of correct responses

00:02:59.300 --> 00:03:00.410
from those.

00:03:00.410 --> 00:03:01.970
And then I train
a machine learning

00:03:01.970 --> 00:03:11.000
model that learns to
identify those same terms,

00:03:11.000 --> 00:03:14.420
or those same records that
have those terms in them.

00:03:14.420 --> 00:03:16.430
And by the way, from
that, we're going

00:03:16.430 --> 00:03:18.890
to learn a whole
bunch of other terms

00:03:18.890 --> 00:03:23.040
that are proxies for the
ones that we started with.

00:03:23.040 --> 00:03:27.410
So this is a way of
enlarging that set of terms

00:03:27.410 --> 00:03:30.140
automatically.

00:03:30.140 --> 00:03:32.930
And so there are a bunch
of technical details

00:03:32.930 --> 00:03:38.210
that you can find out
about by reading the paper.

00:03:38.210 --> 00:03:40.970
They used a relatively
simple representation,

00:03:40.970 --> 00:03:45.140
which is essentially a
bag-of-words representation.

00:03:45.140 --> 00:03:48.890
They then sort of
masked the three words

00:03:48.890 --> 00:03:52.760
around the word that
actually is the one they're

00:03:52.760 --> 00:03:54.710
trying to predict
just to get rid

00:03:54.710 --> 00:03:59.090
of short-term
syntactic correlations.

00:03:59.090 --> 00:04:02.660
And then they built an
L2-regularized logistic

00:04:02.660 --> 00:04:06.890
regression model that said, what
are the features that predict

00:04:06.890 --> 00:04:08.840
the occurrence of this word?

00:04:08.840 --> 00:04:12.380
And then they expanded
the search vocabulary

00:04:12.380 --> 00:04:14.870
to include those
features as well.

00:04:14.870 --> 00:04:16.610
And again, there
are tons of details

00:04:16.610 --> 00:04:21.079
about how to discretize
continuous values

00:04:21.079 --> 00:04:24.000
and things like that that
you can find out about.

00:04:24.000 --> 00:04:25.760
So you build a
phenotype estimator

00:04:25.760 --> 00:04:29.330
from the anchors and
the chosen predictors.

00:04:29.330 --> 00:04:32.030
They calculated a
calibration score

00:04:32.030 --> 00:04:34.640
for each of these
other predictors that

00:04:34.640 --> 00:04:37.790
told you how well it predicted.

00:04:37.790 --> 00:04:41.360
And then you can build
a joint estimator

00:04:41.360 --> 00:04:43.280
that uses all of these.

00:04:43.280 --> 00:04:45.920
And the bottom line is
that they did very well.

00:04:45.920 --> 00:04:51.440
So in order to
evaluate this, they

00:04:51.440 --> 00:04:55.220
looked at eight different
phenotypes for which

00:04:55.220 --> 00:04:58.970
they had human judgment data.

00:04:58.970 --> 00:05:00.980
And so this tells
you that they're

00:05:00.980 --> 00:05:07.130
getting AUCs of
between 0.83 and 0.95

00:05:07.130 --> 00:05:10.730
for these different phenotypes.

00:05:10.730 --> 00:05:13.170
So that's quite good.

00:05:13.170 --> 00:05:16.850
They, in fact, were estimating
not only these eight phenotypes

00:05:16.850 --> 00:05:19.040
but 40-something.

00:05:19.040 --> 00:05:22.890
I don't remember the exact
number, much larger number.

00:05:22.890 --> 00:05:25.400
But they didn't
have validated data

00:05:25.400 --> 00:05:27.620
against which to
test the others.

00:05:27.620 --> 00:05:30.470
But the expectation is that
if it does well on these,

00:05:30.470 --> 00:05:33.450
it probably does well
on the others as well.

00:05:33.450 --> 00:05:35.810
So this was a very nice idea.

00:05:35.810 --> 00:05:38.990
And just to illustrate, if
you start with something

00:05:38.990 --> 00:05:41.840
like diabetes as a
phenotype and you say,

00:05:41.840 --> 00:05:44.330
well, I'm going to
look for anchors

00:05:44.330 --> 00:05:48.710
that are a code of
250 diabetes mellitus,

00:05:48.710 --> 00:05:51.440
or I'm going to
look at medication

00:05:51.440 --> 00:05:55.110
history for diabetic therapy--

00:05:55.110 --> 00:06:00.170
so those are the silver-standard
goals that I'm looking at.

00:06:00.170 --> 00:06:03.710
And those, in fact, have a high
predictive value for somebody

00:06:03.710 --> 00:06:05.430
being in the cohort.

00:06:05.430 --> 00:06:08.870
And then they identify
all these other features

00:06:08.870 --> 00:06:12.230
that predict those,
and therefore, in turn,

00:06:12.230 --> 00:06:15.230
predict appropriate
selectors for the phenotype

00:06:15.230 --> 00:06:17.070
that they're interested in.

00:06:17.070 --> 00:06:19.910
And if you look at the
paper again, what you see

00:06:19.910 --> 00:06:24.380
is that this
outperforms, over time,

00:06:24.380 --> 00:06:28.970
the standard supervised baseline
that they're comparing against,

00:06:28.970 --> 00:06:32.270
where you're getting much
higher accuracy early

00:06:32.270 --> 00:06:35.510
in a patient's visit to
be able to identify them

00:06:35.510 --> 00:06:39.660
as belonging to this cohort.

00:06:39.660 --> 00:06:45.110
I'm going to come back later to
look at another similar attempt

00:06:45.110 --> 00:06:50.280
to generalize from a core using
a different set of techniques.

00:06:50.280 --> 00:06:54.410
So you should see that in
about 45 minutes, I hope.

00:06:57.380 --> 00:06:59.850
Well, context is important.

00:06:59.850 --> 00:07:02.810
So if you look at a sentence
like Mr. Huntington was treated

00:07:02.810 --> 00:07:06.290
for Huntington's disease at
Huntington Hospital, located

00:07:06.290 --> 00:07:10.670
on Huntington Avenue, each
of those mentions of the word

00:07:10.670 --> 00:07:13.250
Huntington is different.

00:07:13.250 --> 00:07:16.850
And for example, if you're
interested in eliminating

00:07:16.850 --> 00:07:19.850
personally identifiable
health information

00:07:19.850 --> 00:07:23.270
from a record like
this, then certainly you

00:07:23.270 --> 00:07:26.540
want to get rid of the
Mr. Huntington part.

00:07:26.540 --> 00:07:29.150
You don't want to get rid
of Huntington's disease,

00:07:29.150 --> 00:07:33.680
because that's a
medically relevant fact.

00:07:33.680 --> 00:07:37.940
And you probably do want to
get rid of Huntington Hospital

00:07:37.940 --> 00:07:40.850
and its location on
Huntington Avenue,

00:07:40.850 --> 00:07:44.390
although those are not
necessarily something

00:07:44.390 --> 00:07:46.580
that you're prohibited
from retaining.

00:07:46.580 --> 00:07:50.600
So for example, if you're
trying to do quality studies

00:07:50.600 --> 00:07:52.850
among different
hospitals, then it

00:07:52.850 --> 00:07:56.480
would make sense to retain the
name of the hospital, which

00:07:56.480 --> 00:08:00.230
is not considered identifying
of the individual.

00:08:00.230 --> 00:08:05.420
So we, in fact, did a study
back in the mid 2000s,

00:08:05.420 --> 00:08:10.040
where we were trying to build
an improved de-identifier.

00:08:12.580 --> 00:08:14.500
And here's the way
we went about it.

00:08:14.500 --> 00:08:17.990
This is a kind of kitchen
sink approach that says,

00:08:17.990 --> 00:08:23.330
OK, take the text, tokenize it.

00:08:23.330 --> 00:08:25.400
Look at every single token.

00:08:25.400 --> 00:08:27.900
And derive things from it.

00:08:27.900 --> 00:08:30.350
So the words that
make up the token,

00:08:30.350 --> 00:08:33.200
the part of speech,
how it's capitalized,

00:08:33.200 --> 00:08:36.169
whether there's
punctuation around it,

00:08:36.169 --> 00:08:40.549
which document
section is it in--

00:08:40.549 --> 00:08:44.059
many databases have sort
of conventional document

00:08:44.059 --> 00:08:44.660
structure.

00:08:44.660 --> 00:08:48.260
If you've looked at the
mimic discharge summaries,

00:08:48.260 --> 00:08:51.530
for example, there's a
kind of prototypical way

00:08:51.530 --> 00:08:54.620
in which that flows
from beginning to end.

00:08:54.620 --> 00:08:57.920
And you can use that
structural information.

00:08:57.920 --> 00:09:01.890
We then identified a bunch of
patterns and thesaurus terms.

00:09:01.890 --> 00:09:06.650
So we looked up, in the
UMLS, words and phrases

00:09:06.650 --> 00:09:10.580
to see if they matched some
clinically meaningful term.

00:09:10.580 --> 00:09:13.010
We had patterns that
identified things

00:09:13.010 --> 00:09:17.390
like phone numbers and social
security numbers and addresses

00:09:17.390 --> 00:09:19.140
and so on.

00:09:19.140 --> 00:09:22.620
And then we did
parsing of the text.

00:09:22.620 --> 00:09:24.740
So in those days,
we used something

00:09:24.740 --> 00:09:27.740
called the Link
Grammar Parser, which,

00:09:27.740 --> 00:09:30.470
doesn't make a whole lot
of difference what parser.

00:09:30.470 --> 00:09:34.790
But you get either a constituent
or constituency or dependency

00:09:34.790 --> 00:09:39.420
parse, which gives you
relationships among the words.

00:09:39.420 --> 00:09:44.300
And so it allows you to
include, as features,

00:09:44.300 --> 00:09:47.180
the way in which a word
that you're looking at

00:09:47.180 --> 00:09:49.920
relates to other
words around it.

00:09:49.920 --> 00:09:54.110
And so what we did is we
said, OK, the lexical context

00:09:54.110 --> 00:09:57.860
includes all of the
above kind of information

00:09:57.860 --> 00:10:02.720
for all of the words that
are either literally adjacent

00:10:02.720 --> 00:10:05.630
or within n words of the
original word that you're

00:10:05.630 --> 00:10:10.730
focusing on, or that are
linked by within k links

00:10:10.730 --> 00:10:13.590
through the parse to that word.

00:10:13.590 --> 00:10:17.900
So this gives you a very
large set of features.

00:10:17.900 --> 00:10:23.070
And of course, parsing
is not a solved problem.

00:10:23.070 --> 00:10:27.860
And so this is an
example from that story

00:10:27.860 --> 00:10:29.780
that I showed you last time.

00:10:29.780 --> 00:10:36.530
And if you see, it comes
up with 24 ambiguous parses

00:10:36.530 --> 00:10:39.710
of this sentence.

00:10:39.710 --> 00:10:44.960
So there are technical problems
about how to deal with that.

00:10:44.960 --> 00:10:47.030
Today, you could use
a different parser.

00:10:47.030 --> 00:10:49.700
The Stanford
Parser, for example,

00:10:49.700 --> 00:10:51.650
probably does a better
job than the one

00:10:51.650 --> 00:10:58.010
we were using 14 years
ago and gives you at least

00:10:58.010 --> 00:11:00.080
more definitive answers.

00:11:00.080 --> 00:11:02.700
And so you could
use that instead.

00:11:02.700 --> 00:11:04.460
And so if you look
at what we did,

00:11:04.460 --> 00:11:09.100
we said, well, here
is the text "Mr."

00:11:09.100 --> 00:11:15.080
And here are all the ways that
you can look it up in the UMLS.

00:11:15.080 --> 00:11:18.330
And it turns out to
be very ambiguous.

00:11:18.330 --> 00:11:22.960
So M-R stands not
only for mister,

00:11:22.960 --> 00:11:25.480
but it also stands for
Magnetic Resonance.

00:11:25.480 --> 00:11:28.740
And it stands for a whole
bunch of other things.

00:11:28.740 --> 00:11:31.820
And so you get huge
amounts of ambiguity.

00:11:31.820 --> 00:11:36.410
"Blind" turns out also to
give you various ambiguities.

00:11:36.410 --> 00:11:41.000
So it maps here to four
different concept-unique

00:11:41.000 --> 00:11:43.250
identifiers.

00:11:43.250 --> 00:11:46.010
"Is" is OK.

00:11:46.010 --> 00:11:49.250
"79-year-old" is OK.

00:11:49.250 --> 00:11:56.300
And then "male," again, maps to
five different concept-unique

00:11:56.300 --> 00:11:57.570
identifiers.

00:11:57.570 --> 00:12:00.590
So there are all these
problems of over-generation

00:12:00.590 --> 00:12:02.580
from this database.

00:12:02.580 --> 00:12:05.450
And here's some more, but
I'm going to skip over that.

00:12:05.450 --> 00:12:07.340
And then the learning
model, in our case,

00:12:07.340 --> 00:12:11.240
was a support vector machine for
this project, in which we just

00:12:11.240 --> 00:12:14.060
said, well, throw in all the--

00:12:14.060 --> 00:12:15.440
you know, it's
the kill them all,

00:12:15.440 --> 00:12:19.370
and God will sort them
out kind of approach.

00:12:19.370 --> 00:12:21.500
So we just threw in
all these features

00:12:21.500 --> 00:12:23.450
and said, oh, support
vector machines

00:12:23.450 --> 00:12:26.990
are really good at picking
out exactly what are the best

00:12:26.990 --> 00:12:27.950
features.

00:12:27.950 --> 00:12:30.320
And so we just relied on that.

00:12:30.320 --> 00:12:34.580
And sure enough, so you wind
up with literally millions

00:12:34.580 --> 00:12:36.440
of features.

00:12:36.440 --> 00:12:38.750
But sure enough, it
worked pretty well.

00:12:38.750 --> 00:12:41.550
And so Stat De-ID
was our program.

00:12:41.550 --> 00:12:44.600
And you see that on real
discharge summaries,

00:12:44.600 --> 00:12:49.370
we're getting precision
and recall on PHI

00:12:49.370 --> 00:12:53.510
up around 98 and
1/2%, 95 and 1/4%,

00:12:53.510 --> 00:12:56.480
which was much better than
the previous state of the art,

00:12:56.480 --> 00:13:00.680
which had been based on
rules and dictionaries

00:13:00.680 --> 00:13:03.560
as a way of
de-identifying things.

00:13:03.560 --> 00:13:08.090
So this was a successful
example of that approach.

00:13:08.090 --> 00:13:13.160
And of course, this is usable
not only for de-identification.

00:13:13.160 --> 00:13:16.910
But it's also usable
for entity recognition.

00:13:16.910 --> 00:13:19.400
Because instead of
selecting entities

00:13:19.400 --> 00:13:22.720
that are personally
identifiable health information,

00:13:22.720 --> 00:13:26.830
you could train it to select
entities that are diseases

00:13:26.830 --> 00:13:30.370
or that are medications or
that are various other things.

00:13:30.370 --> 00:13:35.470
And so this was, in the
2000s, a pretty typical way

00:13:35.470 --> 00:13:38.620
for people to approach
these kinds of problems.

00:13:38.620 --> 00:13:40.080
And it's still used today.

00:13:40.080 --> 00:13:43.060
There are tools around
that let you do this.

00:13:43.060 --> 00:13:45.620
And they work
reasonably effectively.

00:13:45.620 --> 00:13:47.980
They're not state of
the art at the moment,

00:13:47.980 --> 00:13:50.680
but they're simpler than
many of today's state

00:13:50.680 --> 00:13:54.380
of the art methods.

00:13:54.380 --> 00:13:56.690
So here's another approach.

00:13:56.690 --> 00:14:01.760
This was something we published
a few years ago, where

00:14:01.760 --> 00:14:06.510
we started working with
some psychiatrists and said,

00:14:06.510 --> 00:14:09.560
could we predict
30-day readmission

00:14:09.560 --> 00:14:14.780
for a psychiatric patient with
any degree of reliability?

00:14:14.780 --> 00:14:16.550
That's a hard prediction.

00:14:16.550 --> 00:14:19.340
Willie is currently
running an experiment

00:14:19.340 --> 00:14:23.030
where we're asking
psychiatrists to predict that.

00:14:23.030 --> 00:14:26.240
And it turns out, they're
barely better than chance

00:14:26.240 --> 00:14:27.890
at that prediction.

00:14:27.890 --> 00:14:30.800
So it's not an easy task.

00:14:30.800 --> 00:14:35.960
And what we did is we said,
well, let's use topic modeling.

00:14:35.960 --> 00:14:40.580
And so we had this
cohort of patients,

00:14:40.580 --> 00:14:42.320
close to 5,000 patients.

00:14:42.320 --> 00:14:45.320
About 10% of them
were readmitted

00:14:45.320 --> 00:14:47.390
with a psych diagnosis.

00:14:47.390 --> 00:14:50.720
And almost 3,000 of
them were readmitted

00:14:50.720 --> 00:14:52.790
with other diagnoses.

00:14:52.790 --> 00:14:54.860
So one thing this
tells you right away

00:14:54.860 --> 00:14:58.465
is that if you're dealing
with psychiatric patients,

00:14:58.465 --> 00:15:01.700
they come and go to the
hospital frequently.

00:15:01.700 --> 00:15:05.150
And this is not good for the
hospital's bottom line because

00:15:05.150 --> 00:15:10.710
of reimbursement policies of
insurance companies and so on.

00:15:10.710 --> 00:15:17.820
So of the 4,700, only 1,240 were
not readmitted within 30 days.

00:15:17.820 --> 00:15:21.500
So there's very
frequent bounce-back.

00:15:21.500 --> 00:15:27.560
So we said, well, let's try
building a baseline model using

00:15:27.560 --> 00:15:31.190
a support vector machine from
baseline clinical features

00:15:31.190 --> 00:15:34.430
like age, gender,
public health insurance

00:15:34.430 --> 00:15:37.460
as a proxy for
socioeconomic status.

00:15:37.460 --> 00:15:41.210
So if you're on Medicaid,
you're probably poor.

00:15:41.210 --> 00:15:44.220
And if you have
private insurance,

00:15:44.220 --> 00:15:48.830
then you're probably an MIT
employee and/or better off.

00:15:48.830 --> 00:15:53.390
So that's a frequently used
proxy, a comorbidity index

00:15:53.390 --> 00:15:55.640
that tells you sort
of how sick you

00:15:55.640 --> 00:15:59.660
are from things other than
your psychiatric problems.

00:15:59.660 --> 00:16:01.160
And then we said,
well, what if we

00:16:01.160 --> 00:16:05.270
add to that model
common words from notes.

00:16:05.270 --> 00:16:10.200
So we said, let's do
a TF-IDF calculation.

00:16:10.200 --> 00:16:14.150
So this is term frequency
divided by log of the document

00:16:14.150 --> 00:16:15.510
frequency.

00:16:15.510 --> 00:16:18.410
So it's sort of, how
specific is a term

00:16:18.410 --> 00:16:22.100
to identify a particular
kind of condition?

00:16:22.100 --> 00:16:28.580
And we take the 1,000 most
informative words, and so there

00:16:28.580 --> 00:16:29.440
are a lot of these.

00:16:29.440 --> 00:16:33.170
So if you use 1,000
most informative words

00:16:33.170 --> 00:16:37.400
from these nearly
5,000 patients,

00:16:37.400 --> 00:16:42.860
you wind up with something like
66,000 words, unique words,

00:16:42.860 --> 00:16:47.750
that are informative
for some patient.

00:16:47.750 --> 00:16:50.250
But if you limit
yourself to the top 10,

00:16:50.250 --> 00:16:53.180
then it only uses 18,000 words.

00:16:53.180 --> 00:16:55.490
And if you limit
yourself to the top one,

00:16:55.490 --> 00:16:58.490
then it uses about 3,000 words.

00:16:58.490 --> 00:17:01.550
And then we said, well, instead
of doing individual words,

00:17:01.550 --> 00:17:04.670
let's do a latent
Dirichlet allocation.

00:17:04.670 --> 00:17:09.140
So topic modeling on all of
the words, as a bag of words--

00:17:09.140 --> 00:17:13.980
so no sequence information,
just the collection of words.

00:17:13.980 --> 00:17:19.640
And so we calculated
75 topics from using

00:17:19.640 --> 00:17:22.400
LDA on all these notes.

00:17:22.400 --> 00:17:26.839
So just to remind
you, the LDA process

00:17:26.839 --> 00:17:30.800
is a model that says
every document consists

00:17:30.800 --> 00:17:34.670
of a certain mixture of topics,
and each of those topics

00:17:34.670 --> 00:17:38.390
probabilistically
generates certain words.

00:17:38.390 --> 00:17:42.680
And so you can build
a model like this,

00:17:42.680 --> 00:17:46.250
and then solve it using
complicated techniques.

00:17:46.250 --> 00:17:52.218
And you'd wind up with topics,
in this study, as follows.

00:17:52.218 --> 00:17:52.760
I don't know.

00:17:52.760 --> 00:17:54.080
Can you read these?

00:17:54.080 --> 00:17:57.170
They may be too small.

00:17:57.170 --> 00:18:01.550
So these are
unsupervised topics.

00:18:01.550 --> 00:18:03.290
And if you look
at the first one,

00:18:03.290 --> 00:18:06.650
it says patient, alcohol,
withdrawal, depression,

00:18:06.650 --> 00:18:11.450
drinking, and Ativan,
ETOH, drinks, medications,

00:18:11.450 --> 00:18:16.730
clinic inpatient, diagnosis,
days, hospital, substance,

00:18:16.730 --> 00:18:18.320
use treatment program, name.

00:18:18.320 --> 00:18:24.110
That's a de-identified
use/abuse problem number.

00:18:24.110 --> 00:18:26.990
And we had our experts
look at these topics.

00:18:26.990 --> 00:18:28.970
And they said, oh,
well, that topic

00:18:28.970 --> 00:18:33.380
is related to alcohol abuse,
which seems reasonable.

00:18:33.380 --> 00:18:36.590
And then you see, on
the bottom, psychosis,

00:18:36.590 --> 00:18:41.090
thought features, paranoid
psychosis, paranoia symptoms,

00:18:41.090 --> 00:18:43.100
psychiatric, et cetera.

00:18:43.100 --> 00:18:45.930
And they said, OK,
that's a psychosis topic.

00:18:45.930 --> 00:18:49.760
So in retrospect, you can
assign meaning to these topics.

00:18:49.760 --> 00:18:53.900
But in fact, they're generated
without any a priori notion

00:18:53.900 --> 00:18:55.010
of what they ought to be.

00:18:55.010 --> 00:18:58.490
They're just a
statistical summarization

00:18:58.490 --> 00:19:03.980
of the common co-occurrences
of words in these documents.

00:19:03.980 --> 00:19:11.390
But what you find is that if you
use the baseline model, which

00:19:11.390 --> 00:19:15.320
used just the demographic
and clinical variables,

00:19:15.320 --> 00:19:19.220
and you say, what's the
difference in survival,

00:19:19.220 --> 00:19:23.030
in this case, in
time to readmission

00:19:23.030 --> 00:19:28.370
between one set and
another in this cohort,

00:19:28.370 --> 00:19:30.920
and the answer is
they're pretty similar.

00:19:30.920 --> 00:19:34.130
Whereas, if you use
a model that predicts

00:19:34.130 --> 00:19:37.160
based on the baseline
and 75 topics,

00:19:37.160 --> 00:19:40.010
the 75 topics that
we identified,

00:19:40.010 --> 00:19:42.260
you get a much
bigger separation.

00:19:42.260 --> 00:19:44.990
And of course, this is
statistically significant.

00:19:44.990 --> 00:19:47.150
And it tells you
that this technique

00:19:47.150 --> 00:19:50.780
is useful for being
able to improve

00:19:50.780 --> 00:19:54.290
the prediction of
a cohort that's

00:19:54.290 --> 00:19:57.020
more likely to be readmitted
from a cohort that's

00:19:57.020 --> 00:19:59.320
less likely to be readmitted.

00:19:59.320 --> 00:20:01.020
It's not a terrific prediction.

00:20:01.020 --> 00:20:06.960
So the AUC for this model
was only on the order of 0.7.

00:20:06.960 --> 00:20:10.490
So you know, it's not like 0.99.

00:20:10.490 --> 00:20:16.040
But nevertheless, it
provides useful information.

00:20:16.040 --> 00:20:20.780
The same group of psychiatrists
that we worked with also

00:20:20.780 --> 00:20:25.370
did a study with a much larger
cohort but much less rich data.

00:20:25.370 --> 00:20:28.820
So they got all
of the discharges

00:20:28.820 --> 00:20:33.120
from two medical centers
over a period of 12 years.

00:20:33.120 --> 00:20:38.960
So they had 845,000
discharges from 458,000

00:20:38.960 --> 00:20:40.610
unique individuals.

00:20:40.610 --> 00:20:44.480
And they were looking for
suicide or other causes

00:20:44.480 --> 00:20:46.910
of death in these
patients to see

00:20:46.910 --> 00:20:49.910
if they could predict
whether somebody

00:20:49.910 --> 00:20:52.100
is likely to try
to harm themselves,

00:20:52.100 --> 00:20:54.800
or whether they're likely
to die accidentally,

00:20:54.800 --> 00:20:59.880
which sometimes can't be
distinguished from suicide.

00:20:59.880 --> 00:21:03.480
So the censoring problems
that David talked about

00:21:03.480 --> 00:21:05.190
are very much present in this.

00:21:05.190 --> 00:21:07.800
Because you lose
track of people.

00:21:07.800 --> 00:21:10.110
It's a highly
imbalanced data set.

00:21:10.110 --> 00:21:15.990
Because out of the
845,000 patients, only 235

00:21:15.990 --> 00:21:19.950
committed suicide, which is, of
course, probably a good thing

00:21:19.950 --> 00:21:22.410
from a societal point
of view but makes

00:21:22.410 --> 00:21:24.360
the data analysis hard.

00:21:24.360 --> 00:21:28.230
On the other hand, all-cause
mortality was about 18%

00:21:28.230 --> 00:21:30.750
during nine years
of a follow-up.

00:21:30.750 --> 00:21:33.090
So that's not so imbalanced.

00:21:33.090 --> 00:21:35.340
And then what they
did is they curated

00:21:35.340 --> 00:21:39.390
a list of 3,000 terms
that correspond to what,

00:21:39.390 --> 00:21:43.080
in the psychiatric literature,
is called positive valence.

00:21:43.080 --> 00:21:47.790
So this is concepts like joy
and happiness and good stuff,

00:21:47.790 --> 00:21:51.720
as opposed to negative valence,
like depression and sorrow

00:21:51.720 --> 00:21:53.610
and all that stuff.

00:21:53.610 --> 00:21:58.740
And they said, well, we can
use these types of terms

00:21:58.740 --> 00:22:02.980
in order to help distinguish
among these patients.

00:22:02.980 --> 00:22:07.650
And what they found is that, if
you plot the Kaplan-Meier curve

00:22:07.650 --> 00:22:14.280
for different quartiles of
risk for these patients,

00:22:14.280 --> 00:22:16.800
you see that there's a
pretty big difference

00:22:16.800 --> 00:22:19.020
between the different quartiles.

00:22:19.020 --> 00:22:23.460
And you can certainly
identify the people

00:22:23.460 --> 00:22:27.030
who are more likely to commit
suicide from the people who

00:22:27.030 --> 00:22:29.280
are less likely to do so.

00:22:29.280 --> 00:22:33.930
This curve is for suicide
or accidental death.

00:22:33.930 --> 00:22:36.660
So this is a much
larger data set,

00:22:36.660 --> 00:22:39.090
and therefore the
error bars are smaller.

00:22:39.090 --> 00:22:43.060
But you see the same
kind of separation here.

00:22:43.060 --> 00:22:46.290
So these are all
useful techniques.

00:22:46.290 --> 00:22:48.930
Now I'll to another approach.

00:22:48.930 --> 00:22:52.920
This was work by one of
my students, Yuon Wo,

00:22:52.920 --> 00:22:56.070
who was working with some
lymphoma pathologists

00:22:56.070 --> 00:22:57.630
at Mass General.

00:22:57.630 --> 00:23:00.390
And so the approach
they took was

00:23:00.390 --> 00:23:06.590
to say, well, if you read a
pathology report about somebody

00:23:06.590 --> 00:23:10.340
with lymphoma, can we
tell what type of lymphoma

00:23:10.340 --> 00:23:13.190
they had from the
pathology report

00:23:13.190 --> 00:23:16.460
if we blank out the part of
the pathology report that

00:23:16.460 --> 00:23:22.340
says, "I, the pathologist, think
this person has non-Hodgkin's

00:23:22.340 --> 00:23:24.770
lymphoma," or something?

00:23:24.770 --> 00:23:28.880
So from the rest of the context,
can we make that prediction?

00:23:28.880 --> 00:23:33.620
Now, Yuon took a kind of
interesting, slightly odd

00:23:33.620 --> 00:23:35.900
approach to it,
which is to treat

00:23:35.900 --> 00:23:38.420
this as an unsupervised
learning problem

00:23:38.420 --> 00:23:41.220
rather than as a supervised
learning problem.

00:23:41.220 --> 00:23:45.110
So he literally
masked the real answer

00:23:45.110 --> 00:23:48.590
and said, if we just treat
everything except what

00:23:48.590 --> 00:23:52.310
gives away the
answer as just data,

00:23:52.310 --> 00:23:57.030
can we essentially cluster that
data in some interesting way

00:23:57.030 --> 00:24:02.540
so that we re-identify the
different types of lymphoma?

00:24:02.540 --> 00:24:05.210
Now, the reason this
turns out to be important

00:24:05.210 --> 00:24:07.580
is because lymphoma
pathologists keep

00:24:07.580 --> 00:24:11.870
arguing about how to
classify lymphomas.

00:24:11.870 --> 00:24:15.920
And every few years, they
revise the classification rules.

00:24:15.920 --> 00:24:19.380
And so part of his
objective was to say,

00:24:19.380 --> 00:24:24.320
let's try to provide an
unbiased, data-driven method

00:24:24.320 --> 00:24:28.370
that may help identify
appropriate characteristics

00:24:28.370 --> 00:24:32.570
by which to classify
these different lymphomas.

00:24:32.570 --> 00:24:37.760
So his approach was a tensor
factorization approach.

00:24:40.265 --> 00:24:42.560
You often see data
sets like this

00:24:42.560 --> 00:24:47.180
that's, say, patient
by a characteristic.

00:24:47.180 --> 00:24:49.085
So in this case,
laboratory measurements--

00:24:49.085 --> 00:24:53.180
so systolic/diastolic blood
pressure, sodium, potassium,

00:24:53.180 --> 00:24:54.170
et cetera.

00:24:54.170 --> 00:24:57.980
That's a very vanilla
matrix encoding of data.

00:24:57.980 --> 00:25:00.350
And then if you add a
third dimension to it,

00:25:00.350 --> 00:25:02.600
like this is at the
time of admission,

00:25:02.600 --> 00:25:06.890
30 minutes later, 60 minutes
later, 90 minutes later,

00:25:06.890 --> 00:25:09.750
now you have a
three-dimensional tensor.

00:25:09.750 --> 00:25:14.180
And so just like you can
do matrix factorization, as

00:25:14.180 --> 00:25:19.400
in the picture above, where
we say, my matrix of data,

00:25:19.400 --> 00:25:26.130
I'm going to assume is generated
by a product of two matrices,

00:25:26.130 --> 00:25:28.890
which are smaller in dimension.

00:25:28.890 --> 00:25:31.610
And you can train
this by saying,

00:25:31.610 --> 00:25:34.940
I want entries in
these two matrices

00:25:34.940 --> 00:25:37.860
that minimize the
reconstruction error.

00:25:37.860 --> 00:25:41.510
So if I multiply these
matrices together,

00:25:41.510 --> 00:25:46.350
then I get back my
original matrix plus error.

00:25:46.350 --> 00:25:48.290
And I want to
minimize that error,

00:25:48.290 --> 00:25:51.860
usually root mean square, or
mean square error, or something

00:25:51.860 --> 00:25:53.220
like that.

00:25:53.220 --> 00:25:57.230
Well, you can play the
same game for a tensor

00:25:57.230 --> 00:26:02.900
by having a so-called core
tensor, which identifies

00:26:02.900 --> 00:26:14.660
the subset of characteristics
that subdivide

00:26:14.660 --> 00:26:18.050
that dimension of your data.

00:26:18.050 --> 00:26:20.630
And then what you
do is the same game.

00:26:20.630 --> 00:26:26.810
You have matrices corresponding
to each of the dimensions.

00:26:26.810 --> 00:26:29.090
And if you multiply
this core tensor

00:26:29.090 --> 00:26:32.240
by each of these
matrices, you reconstruct

00:26:32.240 --> 00:26:34.460
the original tensor.

00:26:34.460 --> 00:26:37.730
And you can train it again to
minimize the reconstruction

00:26:37.730 --> 00:26:40.100
loss.

00:26:40.100 --> 00:26:43.130
So there are, again,
a few more tricks.

00:26:43.130 --> 00:26:45.810
Because this is
dealing with language.

00:26:45.810 --> 00:26:50.660
And so this is a typical report
from one of these lymphoma

00:26:50.660 --> 00:26:55.580
pathologists that says
immunohistochemical stains show

00:26:55.580 --> 00:26:58.850
that the follicles-- blah,
blah, blah, blah, blah--

00:26:58.850 --> 00:27:01.760
so lots and lots of details.

00:27:01.760 --> 00:27:05.000
And so he needed a
representation that

00:27:05.000 --> 00:27:08.780
could be put into
this matrix tensor,

00:27:08.780 --> 00:27:13.610
this tensor factorization form.

00:27:13.610 --> 00:27:16.460
And what he did is to
say, well, let's see.

00:27:16.460 --> 00:27:18.770
If we look at a
statement like this,

00:27:18.770 --> 00:27:22.550
immuno stains show that
large atypical cells

00:27:22.550 --> 00:27:28.520
are strongly positive for
CD30, negative for these other

00:27:28.520 --> 00:27:31.590
surface expressions.

00:27:31.590 --> 00:27:35.480
So the sentence tells us
relationships among procedures,

00:27:35.480 --> 00:27:39.350
types of cells, and
immunologic factors.

00:27:39.350 --> 00:27:43.010
And for feature choice,
we can use words.

00:27:43.010 --> 00:27:45.770
Or we can use UMLS concepts.

00:27:45.770 --> 00:27:48.590
Or we can find various
kinds of mappings.

00:27:48.590 --> 00:27:53.900
But he decided that
in order to retain

00:27:53.900 --> 00:27:57.770
the syntactic relationships
here, what he would do

00:27:57.770 --> 00:28:01.760
is to use a graphical
representation that

00:28:01.760 --> 00:28:06.650
came out of, again, parsing
all of these sentences.

00:28:06.650 --> 00:28:11.780
And so what you get is that
this creates one graph that

00:28:11.780 --> 00:28:17.750
talks about the strongly
positive for CD30,

00:28:17.750 --> 00:28:20.550
large atypical cells, et cetera.

00:28:20.550 --> 00:28:24.470
And then you can factor
this into subgraphs.

00:28:24.470 --> 00:28:27.860
And then you also have
to identify frequently

00:28:27.860 --> 00:28:29.570
occurring subgraphs.

00:28:29.570 --> 00:28:32.630
So for example,
large atypical cells

00:28:32.630 --> 00:28:36.380
appears here, and also appears
there, and of course will

00:28:36.380 --> 00:28:38.230
appear in many other places.

00:28:38.230 --> 00:28:38.877
Yeah?

00:28:38.877 --> 00:28:42.595
AUDIENCE: Is this parsing
domain in language diagnostics?

00:28:42.595 --> 00:28:43.970
For example, did
they incorporate

00:28:43.970 --> 00:28:45.512
some sort of medical
information here

00:28:45.512 --> 00:28:47.330
or some sort of linguistic--

00:28:47.330 --> 00:28:49.390
PETER SZOLOVITS: So in
this particular study,

00:28:49.390 --> 00:28:53.620
he was using the Stanford
Parser with some tricks.

00:28:53.620 --> 00:28:55.780
So the Stanford
Parser doesn't know

00:28:55.780 --> 00:28:57.310
a lot of the medical words.

00:28:57.310 --> 00:29:03.640
And so he basically marked
these things as noun phrases.

00:29:03.640 --> 00:29:05.980
And then the
Stanford Parser also

00:29:05.980 --> 00:29:10.200
doesn't do well with
long lists like the set

00:29:10.200 --> 00:29:14.470
of immune features.

00:29:14.470 --> 00:29:18.100
And so he would recognize
those as a pattern,

00:29:18.100 --> 00:29:21.520
substitute a single
made-up word for them,

00:29:21.520 --> 00:29:24.850
and that made the parser
work much better on it.

00:29:24.850 --> 00:29:27.250
So there were a whole
bunch of little tricks

00:29:27.250 --> 00:29:29.500
like that in order to adapt it.

00:29:29.500 --> 00:29:33.160
But it was not a model
trained specifically on this.

00:29:33.160 --> 00:29:36.790
I think it's trained on
Wall Street Journal corpus

00:29:36.790 --> 00:29:37.810
or something like that.

00:29:37.810 --> 00:29:39.667
So it's general English.

00:29:39.667 --> 00:29:42.250
AUDIENCE: Those are things that
he did manually as opposed to,

00:29:42.250 --> 00:29:44.447
say, [INAUDIBLE]?

00:29:44.447 --> 00:29:45.280
PETER SZOLOVITS: No.

00:29:45.280 --> 00:29:47.500
He did it algorithmically,
but he didn't

00:29:47.500 --> 00:29:50.230
learn which algorithms to use.

00:29:50.230 --> 00:29:52.300
He made them up by hand.

00:29:52.300 --> 00:29:54.340
But then, of course,
it's a big corpus.

00:29:54.340 --> 00:29:56.590
And he ran these
programs over it

00:29:56.590 --> 00:29:58.810
that did those transformations.

00:29:58.810 --> 00:30:01.420
So he calls it
two-phase parsing.

00:30:01.420 --> 00:30:05.950
There's a reference to his
paper on the first slide

00:30:05.950 --> 00:30:08.560
in this section if you're
interested in the details.

00:30:08.560 --> 00:30:11.470
It's described there.

00:30:11.470 --> 00:30:16.000
So what he wound
up with is a tensor

00:30:16.000 --> 00:30:20.890
that has patients on one
axis, the words appearing

00:30:20.890 --> 00:30:23.680
in the text on another axis.

00:30:23.680 --> 00:30:27.730
So he's still using a
bag-of-words representation.

00:30:27.730 --> 00:30:30.370
But the third axis is
these language concept

00:30:30.370 --> 00:30:33.650
subgraphs that we
were talking about.

00:30:33.650 --> 00:30:36.790
And then he does tensor
factorization on this.

00:30:36.790 --> 00:30:40.360
And what's interesting
is that it works

00:30:40.360 --> 00:30:42.620
much better than I expected.

00:30:42.620 --> 00:30:49.540
So if you look at his technique,
which he called SANTF,

00:30:49.540 --> 00:30:55.450
the precision and recall
are about 0.72 and 0.854

00:30:55.450 --> 00:31:00.400
macro-average and
0.754 micro-average,

00:31:00.400 --> 00:31:04.510
which is much better than
the non-negative matrix

00:31:04.510 --> 00:31:09.460
factorization results, which
only use patient by word

00:31:09.460 --> 00:31:14.860
or patient by subgraph, or,
in fact, one where you simply

00:31:14.860 --> 00:31:19.180
do patient and concatenate
the subgraphs and the words

00:31:19.180 --> 00:31:20.740
in one dimension.

00:31:20.740 --> 00:31:24.160
So that means that this is
actually taking advantage

00:31:24.160 --> 00:31:27.090
of the three-way relationship.

00:31:27.090 --> 00:31:31.620
If you read papers from
about 15, 20 years ago,

00:31:31.620 --> 00:31:35.680
people got very excited about
the idea of bi-clustering,

00:31:35.680 --> 00:31:40.140
which is, in modern terms,
the equivalent of matrix

00:31:40.140 --> 00:31:41.590
factorization.

00:31:41.590 --> 00:31:45.600
So it says given two
dimensions of data,

00:31:45.600 --> 00:31:48.570
and I want to
cluster things, but I

00:31:48.570 --> 00:31:50.820
want to cluster
them in such a way

00:31:50.820 --> 00:31:53.160
that the clustering
of one dimension

00:31:53.160 --> 00:31:56.370
helps the clustering
of the other dimension.

00:31:56.370 --> 00:32:00.870
So this is a formal way of doing
that relatively efficiently.

00:32:00.870 --> 00:32:04.170
And tensor factorization is
essentially tri-clustering.

00:32:07.190 --> 00:32:13.320
So now I'm going to turn to
the last of today's big topics,

00:32:13.320 --> 00:32:15.080
which is language modeling.

00:32:15.080 --> 00:32:18.140
And this is really where
the action is nowadays

00:32:18.140 --> 00:32:21.210
in natural language
processing in general.

00:32:21.210 --> 00:32:24.020
I would say that the
natural language processing

00:32:24.020 --> 00:32:28.010
on clinical data is
somewhat behind the state

00:32:28.010 --> 00:32:32.270
of the art in natural
language processing overall.

00:32:32.270 --> 00:32:34.520
There are fewer corpora
that are available.

00:32:34.520 --> 00:32:37.220
There are fewer
people working on it.

00:32:37.220 --> 00:32:40.460
And so we're catching up.

00:32:40.460 --> 00:32:44.000
But I'm going to lead
into this somewhat gently.

00:32:44.000 --> 00:32:47.660
So what does it mean
to model a language?

00:32:47.660 --> 00:32:50.270
I mean, you could imagine
saying it's coming up

00:32:50.270 --> 00:32:55.550
with a set of parsing rules that
define the syntactic structure

00:32:55.550 --> 00:32:56.840
of the language.

00:32:56.840 --> 00:33:00.110
Or you could imagine
saying, as we suggested

00:33:00.110 --> 00:33:03.350
last time, coming up
with a corresponding set

00:33:03.350 --> 00:33:07.060
of semantic rules
that say a concept

00:33:07.060 --> 00:33:10.970
or terms in the language
correspond to certain concepts

00:33:10.970 --> 00:33:13.790
and that they are
a combinatorially,

00:33:13.790 --> 00:33:17.420
functionally combined
as the syntax directs,

00:33:17.420 --> 00:33:20.870
in order to give us a
semantic representation.

00:33:20.870 --> 00:33:24.090
So we don't know how to do
either of those very well.

00:33:24.090 --> 00:33:26.810
And so the current,
the contemporary idea

00:33:26.810 --> 00:33:30.200
about language
modeling is to say,

00:33:30.200 --> 00:33:33.230
given a sequence of tokens,
predict the next token.

00:33:35.750 --> 00:33:38.900
If you could do that
perfectly, presumably you

00:33:38.900 --> 00:33:41.480
would have a good
language model.

00:33:41.480 --> 00:33:43.790
So obviously, you
can't do it perfectly.

00:33:43.790 --> 00:33:47.910
Because we don't always
say the same word

00:33:47.910 --> 00:33:52.630
after some sequence of
previous words when we speak.

00:33:52.630 --> 00:33:56.220
But probabilistically,
you can get close to that.

00:33:56.220 --> 00:33:59.970
And there's usually some
kind of Markov assumption

00:33:59.970 --> 00:34:04.380
that says that the probability
of emitting a token

00:34:04.380 --> 00:34:10.230
given the stuff that came before
it is ordinarily dependent

00:34:10.230 --> 00:34:18.600
only on n previous words
rather than on all of history,

00:34:18.600 --> 00:34:21.659
on everything you've ever
said before in your life.

00:34:24.480 --> 00:34:30.570
And there's a measure
called perplexity,

00:34:30.570 --> 00:34:34.230
which is the entropy of the
probability distribution

00:34:34.230 --> 00:34:36.570
over the predicted words.

00:34:36.570 --> 00:34:39.900
And roughly speaking, it's
the number of likely ways

00:34:39.900 --> 00:34:47.610
that you could continue the
text if all of the possibilities

00:34:47.610 --> 00:34:50.710
were equally likely.

00:34:50.710 --> 00:34:54.989
So perplexity is often used, for
example, in speech processing.

00:34:57.970 --> 00:34:59.680
We did a study
where we were trying

00:34:59.680 --> 00:35:03.280
to build a speech system that
understood a conversation

00:35:03.280 --> 00:35:05.560
between a doctor and a patient.

00:35:05.560 --> 00:35:08.260
And we ran into real
problems, because we

00:35:08.260 --> 00:35:12.280
were using software that had
been developed to interpret

00:35:12.280 --> 00:35:14.710
dictation by doctors.

00:35:14.710 --> 00:35:16.600
And that was very well trained.

00:35:16.600 --> 00:35:19.990
But it turned out-- we didn't
know this when we started--

00:35:19.990 --> 00:35:24.610
that the language that doctors
use in dictating medical notes

00:35:24.610 --> 00:35:27.490
is pretty straightforward,
pretty simple.

00:35:27.490 --> 00:35:32.730
And so it's perplexity
is about nine,

00:35:32.730 --> 00:35:37.050
whereas conversations are much
more free flowing and cover

00:35:37.050 --> 00:35:38.700
many more topics.

00:35:38.700 --> 00:35:42.450
And so its perplexity
is about 73.

00:35:42.450 --> 00:35:46.230
And so the model that works
well for perplexity nine

00:35:46.230 --> 00:35:50.490
doesn't work as well
for perplexity 73.

00:35:50.490 --> 00:35:54.840
And so what this tells you about
the difficulty of accurately

00:35:54.840 --> 00:35:58.320
transcribing speech
is that it's hard.

00:35:58.320 --> 00:35:59.700
It's much harder.

00:35:59.700 --> 00:36:02.930
And that's still not
a solved problem.

00:36:02.930 --> 00:36:06.350
Now, you probably all
know about Zipf's law.

00:36:06.350 --> 00:36:10.790
So if you empirically
just take all the words

00:36:10.790 --> 00:36:15.350
in all the literature of, let's
say, English, what you discover

00:36:15.350 --> 00:36:20.990
is that the n-th word
is about one over n

00:36:20.990 --> 00:36:24.450
as probable as the first word.

00:36:24.450 --> 00:36:28.000
So there is a
long-tailed distribution.

00:36:28.000 --> 00:36:29.710
One thing you should
realize, of course,

00:36:29.710 --> 00:36:33.280
is if you integrate one over
n from zero to infinity,

00:36:33.280 --> 00:36:35.980
it's infinite.

00:36:35.980 --> 00:36:39.700
And that may not be an
inaccurate representation

00:36:39.700 --> 00:36:44.140
of language, because language
is productive and changes.

00:36:44.140 --> 00:36:47.860
And people make up new words
all the time and so on.

00:36:47.860 --> 00:36:49.840
So it may actually be infinite.

00:36:49.840 --> 00:36:54.260
But roughly speaking, there is
a kind of decline like this.

00:36:54.260 --> 00:36:56.980
And interestingly,
in the brown corpus,

00:36:56.980 --> 00:37:01.540
the top 10 words make
up almost a quarter

00:37:01.540 --> 00:37:03.640
of the size of the corpus.

00:37:03.640 --> 00:37:07.500
So you write a lot of thes,
ofs, ands, a's, twos, ins,

00:37:07.500 --> 00:37:14.470
et cetera, and much less
hematemesis, obviously.

00:37:17.460 --> 00:37:19.590
So what about n-gram models?

00:37:19.590 --> 00:37:22.770
Well, remember, if we make
this Markov assumption,

00:37:22.770 --> 00:37:25.530
then all we have to
do is pay attention

00:37:25.530 --> 00:37:27.960
to the last n tokens
before the one

00:37:27.960 --> 00:37:30.310
that we're interested
in predicting.

00:37:30.310 --> 00:37:34.800
And so people have generated
these large corpora n-grams.

00:37:34.800 --> 00:37:38.670
So for example, somebody,
a couple of decades ago,

00:37:38.670 --> 00:37:41.700
took all of
Shakespeare's writings--

00:37:41.700 --> 00:37:43.320
I think they were
trying to decide

00:37:43.320 --> 00:37:45.780
whether he had
written all his works

00:37:45.780 --> 00:37:49.230
or whether the earl
of somebody or other

00:37:49.230 --> 00:37:52.120
was actually the guy
who wrote Shakespeare.

00:37:52.120 --> 00:37:54.810
You know about this controversy?

00:37:54.810 --> 00:37:56.570
Yeah.

00:37:56.570 --> 00:37:58.190
So that's why they
were doing it.

00:37:58.190 --> 00:38:00.890
But anyway, they
created this corpus.

00:38:00.890 --> 00:38:03.290
And they said--
so Shakespeare had

00:38:03.290 --> 00:38:07.790
a vocabulary of about
30,000 words and about

00:38:07.790 --> 00:38:17.480
300,000 bigrams, and out of
844 million possible bigrams.

00:38:17.480 --> 00:38:22.970
So 99.96% of bigrams
were never seen.

00:38:22.970 --> 00:38:27.170
So there's a certain regularity
to his production of language.

00:38:27.170 --> 00:38:30.310
Now, Google, of course,
did Shakespeare one better.

00:38:30.310 --> 00:38:34.100
And they said, hmm, we can
take a terabyte corpus--

00:38:34.100 --> 00:38:36.230
this was in 2006.

00:38:36.230 --> 00:38:40.400
I wouldn't be surprised if
it's a petabyte corpus today.

00:38:40.400 --> 00:38:41.540
And they published this.

00:38:41.540 --> 00:38:43.190
They just made it available.

00:38:43.190 --> 00:38:46.520
So there were 13.6
million unique words

00:38:46.520 --> 00:38:51.290
that occurred at least 200
times in this tera-word corpus.

00:38:51.290 --> 00:38:55.490
And there were 1.2 billion
five-word sequences that

00:38:55.490 --> 00:38:57.500
occurred at least 40 times.

00:38:57.500 --> 00:38:59.060
So these are the statistics.

00:38:59.060 --> 00:39:02.090
And if you're interested,
there's a URL.

00:39:02.090 --> 00:39:05.240
And here's a very tiny
part of their database.

00:39:05.240 --> 00:39:11.210
So ceramics, collectibles,
collectibles--

00:39:11.210 --> 00:39:16.550
I don't know-- occurred 55
times in a terabyte of text.

00:39:16.550 --> 00:39:20.670
Ceramics collectibles fine,
ceramics collectibles by,

00:39:20.670 --> 00:39:25.140
pottery, cooking, comma, period,
end of sentence, and, at,

00:39:25.140 --> 00:39:26.490
is, et cetera--

00:39:26.490 --> 00:39:27.780
different number of times.

00:39:27.780 --> 00:39:32.640
Ceramics comes from
occurred 660 times,

00:39:32.640 --> 00:39:35.880
which is reasonably large
number compared to some

00:39:35.880 --> 00:39:37.920
of its competitors here.

00:39:37.920 --> 00:39:40.890
If you look at
four-grams, you see

00:39:40.890 --> 00:39:44.070
things like serve as the
incoming, blah, blah,

00:39:44.070 --> 00:39:49.500
blah, 92 times; serve
as the index, 223 times;

00:39:49.500 --> 00:39:53.860
serve as the
initial, 5,300 times.

00:39:53.860 --> 00:39:56.730
So you've got all
these statistics.

00:39:56.730 --> 00:40:02.430
And now, given those statistics,
we can then build a generator.

00:40:02.430 --> 00:40:05.940
So we can say, all right.

00:40:05.940 --> 00:40:08.700
Suppose I start with
the token, which

00:40:08.700 --> 00:40:11.400
is the beginning of a
sentence, or the separator

00:40:11.400 --> 00:40:13.180
between sentences.

00:40:13.180 --> 00:40:16.380
And I say sample a
random bigram starting

00:40:16.380 --> 00:40:19.350
with the beginning of
a sentence and a word,

00:40:19.350 --> 00:40:24.240
according to its probability,
and then sample the next bigram

00:40:24.240 --> 00:40:27.670
from that word and
all the other words,

00:40:27.670 --> 00:40:30.630
according to its
probability, and keep

00:40:30.630 --> 00:40:34.920
doing that until you hit
the end of sentence marker.

00:40:34.920 --> 00:40:40.020
So for example, here I'm
generating the sentence,

00:40:40.020 --> 00:40:43.860
I, starts with I,
then followed by want,

00:40:43.860 --> 00:40:47.730
followed by two, followed
by get, followed by Chinese,

00:40:47.730 --> 00:40:51.120
followed by food, followed
by end of sentence.

00:40:51.120 --> 00:40:53.100
So I've just generated,
"I want to get

00:40:53.100 --> 00:40:57.780
Chinese food," which sounds
like a perfectly good sentence.

00:40:57.780 --> 00:40:59.170
So here's what's interesting.

00:40:59.170 --> 00:41:02.130
If you look back again
at the Shakespeare corpus

00:41:02.130 --> 00:41:07.220
and saying, if we generated
Shakespeare from unigrams,

00:41:07.220 --> 00:41:09.800
you get stuff like
at the top, "To him

00:41:09.800 --> 00:41:12.110
swallowed confess here both.

00:41:12.110 --> 00:41:13.540
Which.

00:41:13.540 --> 00:41:19.100
Of save on trail for are ay
device and rote life have."

00:41:19.100 --> 00:41:21.350
It doesn't sound terribly good.

00:41:21.350 --> 00:41:23.240
It's not very grammatical.

00:41:23.240 --> 00:41:30.140
It doesn't have that sort of
Shakespearean English flavor.

00:41:30.140 --> 00:41:34.250
Although, you do have words
like nave and ay and so on that

00:41:34.250 --> 00:41:36.680
are vaguely reminiscent.

00:41:36.680 --> 00:41:38.930
Now, if you go to
bigrams, it starts

00:41:38.930 --> 00:41:40.550
to sound a little better.

00:41:40.550 --> 00:41:41.480
"What means, sir.

00:41:41.480 --> 00:41:43.040
I confess she?

00:41:43.040 --> 00:41:45.450
Then all sorts, he
is trim, captain."

00:41:49.400 --> 00:41:51.060
That doesn't make any sense.

00:41:51.060 --> 00:41:53.630
But it starts to
sound a little better.

00:41:53.630 --> 00:41:56.870
And with trigrams, we get,
"Sweet prince, Falstaff

00:41:56.870 --> 00:41:57.980
shall die.

00:41:57.980 --> 00:42:01.460
Harry of Monmouth," et cetera.

00:42:01.460 --> 00:42:05.540
So this is beginning to
sound a little Shakespearean.

00:42:05.540 --> 00:42:08.220
And if you go to quadrigrams,
you get, "King Henry.

00:42:08.220 --> 00:42:08.720
What?

00:42:08.720 --> 00:42:11.180
I will go seek the
traitor Gloucester.

00:42:11.180 --> 00:42:13.090
Exeunt some of the watch.

00:42:13.090 --> 00:42:17.960
A great banquet
serv'd in," et cetera.

00:42:17.960 --> 00:42:23.090
I mean, when I first saw this,
like 20 years ago or something,

00:42:23.090 --> 00:42:24.320
I was stunned.

00:42:24.320 --> 00:42:26.840
This is actually
generating stuff

00:42:26.840 --> 00:42:30.170
that sounds vaguely
Shakespearean and vaguely

00:42:30.170 --> 00:42:33.570
English-like.

00:42:33.570 --> 00:42:37.070
Here's an example of generating
the Wall Street Journal.

00:42:37.070 --> 00:42:42.410
So from unigrams, "Months
the my and issue of year

00:42:42.410 --> 00:42:45.830
foreign new exchanges
September were recession."

00:42:45.830 --> 00:42:47.600
It's word salad.

00:42:47.600 --> 00:42:50.600
But if you go to trigrams,
"They also point to ninety nine

00:42:50.600 --> 00:42:54.020
point six billion
from two hundred four

00:42:54.020 --> 00:42:57.980
oh six three percent of the
rates of interest stores

00:42:57.980 --> 00:43:00.050
as Mexico and Brazil."

00:43:00.050 --> 00:43:03.620
So you could imagine that this
is some Wall Street Journal

00:43:03.620 --> 00:43:09.080
writer on acid
writing this text.

00:43:09.080 --> 00:43:13.850
Because it has a little bit
of the right kind of flavor.

00:43:13.850 --> 00:43:17.570
So more recently,
people said, well,

00:43:17.570 --> 00:43:22.520
we ought to be able to make use
of this in some systematic way

00:43:22.520 --> 00:43:25.730
to help us with our
language analysis tasks.

00:43:25.730 --> 00:43:31.790
So to me, the first
effort in this direction

00:43:31.790 --> 00:43:35.240
was Word2Vec, which
was Mikolov's approach

00:43:35.240 --> 00:43:36.440
to doing this.

00:43:36.440 --> 00:43:38.540
And he developed two models.

00:43:38.540 --> 00:43:45.260
He said, let's build a
continuous bag-of-words model

00:43:45.260 --> 00:43:47.420
that says what
we're going to use

00:43:47.420 --> 00:43:54.850
is co-occurrence data on a
series of tokens in the text

00:43:54.850 --> 00:43:56.840
that we're trying to model.

00:43:56.840 --> 00:43:59.300
And we're going to
use a neural network

00:43:59.300 --> 00:44:05.250
model to predict the word
from the words around it.

00:44:05.250 --> 00:44:07.730
And in that process,
we're going to use

00:44:07.730 --> 00:44:13.910
the parameters of that neural
network model as a vector.

00:44:13.910 --> 00:44:17.060
And that vector will be the
representation of that word.

00:44:19.590 --> 00:44:21.830
And so what we're
going to find is

00:44:21.830 --> 00:44:26.580
that words that tend to
appear in the same context

00:44:26.580 --> 00:44:29.460
will have similar
representations

00:44:29.460 --> 00:44:31.835
in this high-dimensional vector.

00:44:31.835 --> 00:44:33.210
And by the way,
high-dimensional,

00:44:33.210 --> 00:44:37.670
people typically use like 300
or 500 dimensional vectors.

00:44:37.670 --> 00:44:39.060
So there's a lot of--

00:44:39.060 --> 00:44:40.830
it's a big space.

00:44:40.830 --> 00:44:43.860
And the words are
scattered throughout this.

00:44:43.860 --> 00:44:48.140
But you get this
kind of cohesion,

00:44:48.140 --> 00:44:53.470
where words that are
used in the same context

00:44:53.470 --> 00:44:55.430
appear close to each other.

00:44:55.430 --> 00:44:58.250
And the extrapolation
of that is that if words

00:44:58.250 --> 00:45:00.500
are used in the
same context, maybe

00:45:00.500 --> 00:45:03.890
they share something
about meaning.

00:45:03.890 --> 00:45:06.405
So the other model
is a skip-gram model,

00:45:06.405 --> 00:45:07.780
where you're doing
the prediction

00:45:07.780 --> 00:45:08.920
in the other direction.

00:45:08.920 --> 00:45:13.300
From a word, you're predicting
the words that are around it.

00:45:13.300 --> 00:45:16.330
And again, you are using
a neural network model

00:45:16.330 --> 00:45:17.950
to do that.

00:45:17.950 --> 00:45:20.800
And you use the
parameters of that model

00:45:20.800 --> 00:45:27.050
in order to represent the
word that you're focused on.

00:45:27.050 --> 00:45:31.240
So what came as a surprise
to me is this claim that's

00:45:31.240 --> 00:45:35.830
in his original paper, which
is that not only do you

00:45:35.830 --> 00:45:43.030
get this effect of locality
as corresponding meaning

00:45:43.030 --> 00:45:46.630
but that you get relationships
that are geometrically

00:45:46.630 --> 00:45:50.770
represented in the space
of these embeddings.

00:45:50.770 --> 00:45:53.980
And so what you
see is that if you

00:45:53.980 --> 00:45:58.510
take the encoding of the
word man and the word woman

00:45:58.510 --> 00:46:01.450
and look at the vector
difference between them,

00:46:01.450 --> 00:46:05.530
and then apply that same
vector difference to king,

00:46:05.530 --> 00:46:07.570
you get close to queen.

00:46:07.570 --> 00:46:11.410
And if you apply it uncle,
you get close to aunt.

00:46:11.410 --> 00:46:13.630
And so they showed a
number of examples.

00:46:13.630 --> 00:46:15.520
And then people
have studied this.

00:46:15.520 --> 00:46:17.500
It doesn't hold
it perfectly well.

00:46:17.500 --> 00:46:21.010
I mean, it's not like we've
solved the semantics problem.

00:46:21.010 --> 00:46:24.040
But it is a genuine
relationship.

00:46:24.040 --> 00:46:25.930
The place where it
doesn't work well

00:46:25.930 --> 00:46:30.460
is when some of these things are
much more frequent than others.

00:46:30.460 --> 00:46:33.970
And so one of the examples
that's often cited

00:46:33.970 --> 00:46:41.420
is if you go, London is to
England as Paris is to France,

00:46:41.420 --> 00:46:43.040
and that one works.

00:46:43.040 --> 00:46:47.950
But then you say as Kuala
Lumpur is to Malaysia,

00:46:47.950 --> 00:46:50.500
and that one doesn't
work so well.

00:46:50.500 --> 00:46:57.310
And then you go, as
Juba or something

00:46:57.310 --> 00:47:01.090
is to whatever country
it's the capital of.

00:47:01.090 --> 00:47:05.140
And since we don't write about
Africa in our newspapers,

00:47:05.140 --> 00:47:07.040
there's very little
data on that.

00:47:07.040 --> 00:47:10.420
And so that doesn't
work so well.

00:47:10.420 --> 00:47:13.150
So there was this
other paper later

00:47:13.150 --> 00:47:16.960
from van der Maaten
and Geoff Hinton,

00:47:16.960 --> 00:47:19.930
where they came up with
a visualization method

00:47:19.930 --> 00:47:22.180
to take these
high-dimensional vectors

00:47:22.180 --> 00:47:25.090
and visualize them
in two dimensions.

00:47:25.090 --> 00:47:28.750
And what you see is that if
you take a bunch of concepts

00:47:28.750 --> 00:47:30.520
that are count concepts--

00:47:30.520 --> 00:47:36.490
so 1/2, 30, 15, 5, 4, 2,
3, several, some, many,

00:47:36.490 --> 00:47:38.530
et cetera--

00:47:38.530 --> 00:47:41.450
there is a geometric
relationship between them.

00:47:41.450 --> 00:47:45.380
So they, in fact, do map to
the same part of the space.

00:47:45.380 --> 00:47:48.970
Similarly, minister, leader,
president, chairman, director,

00:47:48.970 --> 00:47:51.580
spokesman, chief,
head, et cetera

00:47:51.580 --> 00:47:54.420
form a kind of
cluster in the space.

00:47:54.420 --> 00:47:58.540
So there's definitely
something to this.

00:47:58.540 --> 00:48:04.120
I promised you that I would
get back to a different attempt

00:48:04.120 --> 00:48:06.880
to try to take a
core of concepts

00:48:06.880 --> 00:48:09.640
that you want to use
for term-spotting

00:48:09.640 --> 00:48:13.780
and develop an automated way of
enlarging that set of concepts

00:48:13.780 --> 00:48:17.080
in order to give you a
richer vocabulary by which

00:48:17.080 --> 00:48:20.480
to try to identify cases
that you're interested in.

00:48:20.480 --> 00:48:23.480
So this was by some
of my colleagues,

00:48:23.480 --> 00:48:27.310
including Kat, who
you saw on Tuesday.

00:48:27.310 --> 00:48:32.800
And they said,
well, what we'd like

00:48:32.800 --> 00:48:35.770
is the fully automated and
robust, unsupervised feature

00:48:35.770 --> 00:48:38.860
selection method that
leverages only publicly

00:48:38.860 --> 00:48:42.910
available medical knowledge
sources instead of VHR data.

00:48:42.910 --> 00:48:46.690
So the method that David's
group had developed,

00:48:46.690 --> 00:48:49.870
which we talked about
earlier, uses data

00:48:49.870 --> 00:48:51.790
from electronic
health records, which

00:48:51.790 --> 00:48:54.520
means that you move
to different hospitals

00:48:54.520 --> 00:48:56.690
and there may be
different conventions.

00:48:56.690 --> 00:48:58.390
And you might
imagine that you have

00:48:58.390 --> 00:49:03.880
to retrain that sort of method,
whereas here the idea is

00:49:03.880 --> 00:49:06.910
to derive these surrogate
features from knowledge

00:49:06.910 --> 00:49:08.110
sources.

00:49:08.110 --> 00:49:13.330
So unlike that earlier model,
here they built a Word2Vec

00:49:13.330 --> 00:49:17.620
skip-gram model from about 5
million Springer articles--

00:49:17.620 --> 00:49:21.610
so these are published
medical articles--

00:49:21.610 --> 00:49:25.420
to yield 500 dimensional
vectors for each word.

00:49:25.420 --> 00:49:29.800
And then what they did is
they took the concept names

00:49:29.800 --> 00:49:33.130
that they were interested
in and their definitions

00:49:33.130 --> 00:49:38.580
from the UMLS, and
then they summoned

00:49:38.580 --> 00:49:42.390
the word vectors for each
of these words, weighted

00:49:42.390 --> 00:49:44.650
by inverse document frequency.

00:49:44.650 --> 00:49:48.485
So it's sort of a
TF-IDF-like approach

00:49:48.485 --> 00:49:51.240
to weight different words.

00:49:51.240 --> 00:49:53.700
And then they went
out and they said, OK,

00:49:53.700 --> 00:49:56.610
for every disease that's
mentioned in Wikipedia,

00:49:56.610 --> 00:49:59.760
Medscape, eMedicine, the
Merck Manuals Professional

00:49:59.760 --> 00:50:03.390
Edition, the Mayo Clinic
Diseases and Conditions,

00:50:03.390 --> 00:50:06.120
MedlinePlus Medical
Encyclopedia,

00:50:06.120 --> 00:50:09.330
they used named entity
recognition techniques

00:50:09.330 --> 00:50:15.550
to find all the concepts that
are related to this phenotype.

00:50:15.550 --> 00:50:19.080
So then they said, well,
there's a lot of randomness

00:50:19.080 --> 00:50:22.840
in these sources, and maybe
in our extraction techniques.

00:50:22.840 --> 00:50:25.320
But if we insist that
some concept appear

00:50:25.320 --> 00:50:28.810
in at least three of
these five sources,

00:50:28.810 --> 00:50:32.400
then we can be pretty confident
that it's a relevant concept.

00:50:32.400 --> 00:50:34.480
And so they said,
OK, we'll do that.

00:50:34.480 --> 00:50:37.130
Then they chose
the top k concepts

00:50:37.130 --> 00:50:41.190
whose embedding vectors are
closest by cosine distance

00:50:41.190 --> 00:50:43.020
to the embedding
of this phenotype

00:50:43.020 --> 00:50:44.850
that they've calculated.

00:50:44.850 --> 00:50:47.280
And they say, OK,
the phenotype is

00:50:47.280 --> 00:50:51.970
going to be a linear combination
of all these related concepts.

00:50:51.970 --> 00:50:55.840
So again, this is a bit
similar to what we saw before.

00:50:55.840 --> 00:50:58.110
But here, instead of
extracting the data

00:50:58.110 --> 00:51:01.110
from electronic medical
records, they're

00:51:01.110 --> 00:51:04.680
extracting it from published
literature and these web

00:51:04.680 --> 00:51:07.260
sources.

00:51:07.260 --> 00:51:16.230
And again, what you see is that
the expert-curated features

00:51:16.230 --> 00:51:22.050
for these five phenotypes,
which are coronary artery

00:51:22.050 --> 00:51:24.180
disease, rheumatoid
arthritis, Crohn's

00:51:24.180 --> 00:51:29.070
disease, ulcerative colitis,
and pediatric pulmonary arterial

00:51:29.070 --> 00:51:37.260
hypertension, they started
with 20 to 50 curated features.

00:51:37.260 --> 00:51:39.150
So these were the
ones that the doctors

00:51:39.150 --> 00:51:44.610
said, OK, these are the
anchors in David's terminology.

00:51:44.610 --> 00:51:51.090
And then they expanded
these to a larger set

00:51:51.090 --> 00:51:56.850
using the technique that I just
described, and then selected

00:51:56.850 --> 00:52:04.515
down to the top n that
were effective in finding

00:52:04.515 --> 00:52:06.360
relevant phenotypes.

00:52:06.360 --> 00:52:13.140
And this is a terrible graph
that summarizes the results.

00:52:13.140 --> 00:52:19.590
But what you're seeing is that
the orange lines are based

00:52:19.590 --> 00:52:22.830
on the expert-curated features.

00:52:22.830 --> 00:52:28.920
This is based on an earlier
version of trying to do this.

00:52:28.920 --> 00:52:33.000
And SEDFE is the technique
that I've just described.

00:52:33.000 --> 00:52:37.410
And what you see is that
the automatic techniques

00:52:37.410 --> 00:52:42.000
for many of these phenotypes
are just about as good

00:52:42.000 --> 00:52:44.760
as the manually curated ones.

00:52:44.760 --> 00:52:47.640
And of course, they require
much less manual curation.

00:52:47.640 --> 00:52:52.980
Because they're using this
automatic learning approach.

00:52:52.980 --> 00:52:56.100
Another interesting
example to return

00:52:56.100 --> 00:52:58.770
to the theme of
de-identification

00:52:58.770 --> 00:53:02.380
is a couple of my
students, a few years ago,

00:53:02.380 --> 00:53:06.150
built a new de-identifier
that has this rather

00:53:06.150 --> 00:53:08.280
complicated architecture.

00:53:08.280 --> 00:53:13.680
So it starts with a
bi-directional recursive neural

00:53:13.680 --> 00:53:18.330
network model that
is implemented

00:53:18.330 --> 00:53:23.280
over the character sequences
of words in the medical text.

00:53:23.280 --> 00:53:25.920
So why character sequences?

00:53:25.920 --> 00:53:27.841
Why might those be important?

00:53:33.140 --> 00:53:38.090
Well, consider a misspelled
word, for example.

00:53:38.090 --> 00:53:41.120
Most of the character
sequence is correct.

00:53:41.120 --> 00:53:44.600
There will be a bug in
it at the misspelling.

00:53:44.600 --> 00:53:47.540
Or consider that a
lot of medical terms

00:53:47.540 --> 00:53:50.060
are these compound
terms, where they're

00:53:50.060 --> 00:53:53.120
made up of lots of
pieces that correspond

00:53:53.120 --> 00:53:56.360
to Greek or Latin roots.

00:53:56.360 --> 00:54:00.440
So learning those can
actually be very helpful.

00:54:00.440 --> 00:54:02.990
So you start with that model.

00:54:02.990 --> 00:54:06.110
You then could
concatenate the results

00:54:06.110 --> 00:54:10.250
from both the left-running and
the right-running recursive

00:54:10.250 --> 00:54:12.140
neural network.

00:54:12.140 --> 00:54:18.095
And concatenate that with
the Word2Vec embedding

00:54:18.095 --> 00:54:20.850
of the whole word.

00:54:20.850 --> 00:54:26.490
And you feed that into another
bi-directional RNN layer.

00:54:26.490 --> 00:54:33.050
And then for each word, you
take the output of those RNNs,

00:54:33.050 --> 00:54:36.650
run them through a feed-forward
neural network in order

00:54:36.650 --> 00:54:38.940
to estimate the prob--

00:54:38.940 --> 00:54:40.310
it's like a soft max.

00:54:40.310 --> 00:54:44.900
And you estimate the probability
of this word belonging

00:54:44.900 --> 00:54:49.280
to a particular category of
personally identifiable health

00:54:49.280 --> 00:54:50.300
information.

00:54:50.300 --> 00:54:51.440
So is it a name?

00:54:51.440 --> 00:54:52.520
Is it an address?

00:54:52.520 --> 00:54:53.570
Is it a phone number?

00:54:53.570 --> 00:54:56.150
Is it or whatever?

00:54:56.150 --> 00:54:59.480
And then the top layer is a
kind of conditional random

00:54:59.480 --> 00:55:04.970
field-like layer that imposes
a sequential probability

00:55:04.970 --> 00:55:10.490
distribution that says, OK,
if you've seen a name, then

00:55:10.490 --> 00:55:14.220
what's the next most likely
thing that you're going to see?

00:55:14.220 --> 00:55:19.220
And so you combine that with
the probability distributions

00:55:19.220 --> 00:55:24.920
for each word in order to
identify the category of PHI

00:55:24.920 --> 00:55:28.860
or non-PHI for that word.

00:55:28.860 --> 00:55:31.400
And this did insanely well.

00:55:31.400 --> 00:55:41.000
So optimized by F1 score, we're
up at a precision of 99.2%,

00:55:41.000 --> 00:55:44.270
recall of 99.3%.

00:55:44.270 --> 00:55:51.290
Optimized by recall,
we're up at about 98%, 99%

00:55:51.290 --> 00:55:53.240
for each of them.

00:55:53.240 --> 00:55:55.370
So this is doing quite well.

00:55:55.370 --> 00:56:00.030
Now, there is a non-machine
learning comment to make,

00:56:00.030 --> 00:56:02.570
which is that if you read
the HIPAA law, the HIPAA

00:56:02.570 --> 00:56:05.660
regulations, they
don't say that you

00:56:05.660 --> 00:56:10.400
must get rid of 99%
of the personally

00:56:10.400 --> 00:56:13.760
identifying information in
order to be able to share

00:56:13.760 --> 00:56:15.500
this data for research.

00:56:15.500 --> 00:56:18.761
It says you have to
get rid of all of it.

00:56:18.761 --> 00:56:23.770
So no technique we
know is 100% perfect.

00:56:23.770 --> 00:56:27.840
And so there's a kind of
practical understanding

00:56:27.840 --> 00:56:30.240
among people who
work on this stuff

00:56:30.240 --> 00:56:32.850
that nothing's
going to be perfect.

00:56:32.850 --> 00:56:36.990
And therefore, that you can
get away with a little bit.

00:56:36.990 --> 00:56:42.300
But legally, you're on thin ice.

00:56:42.300 --> 00:56:46.590
So I remember many years ago,
my wife was in law school.

00:56:46.590 --> 00:56:51.600
And I asked her at one point,
so what can people sue you for?

00:56:51.600 --> 00:56:55.640
And she said,
absolutely anything.

00:56:55.640 --> 00:56:57.430
They may not win.

00:56:57.430 --> 00:57:00.180
But they can be a
real pain if you have

00:57:00.180 --> 00:57:02.460
to go defend yourself in court.

00:57:02.460 --> 00:57:04.750
And so this hasn't
played out yet.

00:57:04.750 --> 00:57:08.910
We don't know if a
de-identifier that

00:57:08.910 --> 00:57:13.050
is 99% sensitive
and 99% specific

00:57:13.050 --> 00:57:17.730
will pass muster with people
who agree to release data sets.

00:57:17.730 --> 00:57:21.000
Because they're worried,
too, about winding up

00:57:21.000 --> 00:57:23.700
in the newspaper or
winding up getting sued.

00:57:26.910 --> 00:57:28.810
Last topic for today--

00:57:28.810 --> 00:57:34.980
so if you read this interesting
blog, which, by the way,

00:57:34.980 --> 00:57:39.870
has a very good
tutorial on BERT,

00:57:39.870 --> 00:57:43.290
he says, "The year 2018 has been
an inflection point for machine

00:57:43.290 --> 00:57:47.850
learning models handling
text, or more accurately, NLP.

00:57:47.850 --> 00:57:49.680
Our conceptual
understanding of how

00:57:49.680 --> 00:57:52.770
best to represent words
and sentences in a way

00:57:52.770 --> 00:57:55.710
that best captures underlying
meanings and relationships

00:57:55.710 --> 00:57:57.760
is rapidly evolving."

00:57:57.760 --> 00:58:00.330
And so there are a
whole bunch of new ideas

00:58:00.330 --> 00:58:05.530
that have come about in about
the last year or two years,

00:58:05.530 --> 00:58:10.410
including ELMo, which learns
context-specific embeddings,

00:58:10.410 --> 00:58:13.920
the Transformer architecture,
this BERT approach.

00:58:13.920 --> 00:58:19.470
And then I'll end with just
showing you this gigantic GPT

00:58:19.470 --> 00:58:24.060
model that was developed
by the OpenAI people, which

00:58:24.060 --> 00:58:27.360
does remarkably better
than the stuff I showed you

00:58:27.360 --> 00:58:31.690
before in generating language.

00:58:31.690 --> 00:58:33.160
All right.

00:58:33.160 --> 00:58:36.010
If you look inside
Google Translate,

00:58:36.010 --> 00:58:40.180
at least as of not
long ago, what you find

00:58:40.180 --> 00:58:43.260
is a model like this.

00:58:43.260 --> 00:58:49.470
So it's essentially an LSTM
model that takes input words

00:58:49.470 --> 00:58:53.970
and munges them together
into some representation,

00:58:53.970 --> 00:58:58.980
a high-dimensional vector
representation, that summarizes

00:58:58.980 --> 00:59:03.270
everything that the model
knows about that sentence

00:59:03.270 --> 00:59:06.330
that you've just fed it.

00:59:06.330 --> 00:59:08.550
Obviously, it has to be
a pretty high-dimensional

00:59:08.550 --> 00:59:12.120
representation, because your
sentence could be about almost

00:59:12.120 --> 00:59:13.690
anything.

00:59:13.690 --> 00:59:17.520
And so it's important to
be able to capture all

00:59:17.520 --> 00:59:19.980
that in this representation.

00:59:19.980 --> 00:59:22.170
But basically, at
this point, you

00:59:22.170 --> 00:59:24.340
start generating the output.

00:59:24.340 --> 00:59:27.130
So if you're translating
English to French,

00:59:27.130 --> 00:59:29.310
these are English
words coming in,

00:59:29.310 --> 00:59:32.670
and these are French words
going out, in sort of the way

00:59:32.670 --> 00:59:35.190
I showed you, where we're
generating Shakespeare

00:59:35.190 --> 00:59:39.030
or we're generating Wall
Street Journal text.

00:59:41.910 --> 00:59:45.780
But the critical feature here
is that in the initial version

00:59:45.780 --> 00:59:48.210
of this, everything
that you learned

00:59:48.210 --> 00:59:51.870
about this English sentence
had to be encoded in this one

00:59:51.870 --> 00:59:58.150
vector that got passed from
the encoder into the decoder,

00:59:58.150 --> 01:00:03.720
or from the source language into
the target language generator.

01:00:03.720 --> 01:00:06.930
So then someone came
along and said, hmm--

01:00:06.930 --> 01:00:11.470
someone, namely these
guys, came along and said,

01:00:11.470 --> 01:00:13.440
wouldn't it be nice
if we could provide

01:00:13.440 --> 01:00:17.430
some auxiliary information
to the generator that said,

01:00:17.430 --> 01:00:19.980
hey, which part of
the input sentence

01:00:19.980 --> 01:00:23.120
should you pay attention to?

01:00:23.120 --> 01:00:25.790
And of course, there's
no fixed answer to that.

01:00:25.790 --> 01:00:29.180
I mean, if I'm translating
an arbitrary English sentence

01:00:29.180 --> 01:00:32.840
into an arbitrary French
sentence, I can't say,

01:00:32.840 --> 01:00:36.770
in general, look at the third
word in the English sentence

01:00:36.770 --> 01:00:39.680
when you're generating the third
word in the French sentence.

01:00:39.680 --> 01:00:43.040
Because that may or may
not be true, depending

01:00:43.040 --> 01:00:44.780
on the particular sentence.

01:00:44.780 --> 01:00:46.520
But on the other
hand, the intuition

01:00:46.520 --> 01:00:50.060
is that there is such
a positional dependence

01:00:50.060 --> 01:00:56.030
and a dependence on what the
particular English word was

01:00:56.030 --> 01:01:00.330
that is an important component
of generating the French word.

01:01:00.330 --> 01:01:04.190
And so they created this
idea that in addition

01:01:04.190 --> 01:01:10.340
to passing along
the this vector that

01:01:10.340 --> 01:01:13.490
encodes the meaning
of the entire input

01:01:13.490 --> 01:01:18.680
and the previous word that you
had generated in the output,

01:01:18.680 --> 01:01:23.730
in addition, we pass along this
other information that says,

01:01:23.730 --> 01:01:27.320
which of the input words
should we pay attention to?

01:01:27.320 --> 01:01:30.110
And how much attention
should we pay to them?

01:01:30.110 --> 01:01:34.520
And of course, in the
style of these embeddings,

01:01:34.520 --> 01:01:37.520
these are all represented
by high-dimensional vectors,

01:01:37.520 --> 01:01:41.540
high-dimensional real
number vectors that

01:01:41.540 --> 01:01:44.030
get combined with
the other vectors

01:01:44.030 --> 01:01:46.880
in order to produce the output.

01:01:46.880 --> 01:01:53.660
Now, a classical linguist
would look at this and retch.

01:01:53.660 --> 01:01:57.980
Because this looks nothing
like classical linguistics.

01:01:57.980 --> 01:02:04.160
It's just numerology that gets
trained by stochastic gradient

01:02:04.160 --> 01:02:08.240
descent methods in order
to optimize the output.

01:02:08.240 --> 01:02:12.990
But from an engineering point
of view, it works quite well.

01:02:12.990 --> 01:02:16.700
So then for a while, that
was the state of the art.

01:02:16.700 --> 01:02:22.640
And then last year, these
guys, Vaswani et al.

01:02:22.640 --> 01:02:27.920
came along and said,
you know, we now

01:02:27.920 --> 01:02:30.020
have this complicated
architecture,

01:02:30.020 --> 01:02:34.490
where we are doing the
old-style translation where

01:02:34.490 --> 01:02:37.250
we summarize everything
into one vector,

01:02:37.250 --> 01:02:41.690
and then use that to generate
a sequence of outputs.

01:02:41.690 --> 01:02:43.850
And we have this
attention mechanism

01:02:43.850 --> 01:02:47.450
that tells us how
much of various inputs

01:02:47.450 --> 01:02:52.040
to use in generating each
element of the output.

01:02:52.040 --> 01:02:55.050
Is the first of those
actually necessary?

01:02:55.050 --> 01:02:58.040
And so they published this
lovely paper saying attention

01:02:58.040 --> 01:03:00.740
is all you need,
that says, hey, you

01:03:00.740 --> 01:03:04.280
know that thing that you guys
have added to this translation

01:03:04.280 --> 01:03:05.720
model.

01:03:05.720 --> 01:03:07.790
Not only is it a
useful addition,

01:03:07.790 --> 01:03:12.770
but in fact, it can take the
place of the original model.

01:03:12.770 --> 01:03:16.340
And so the Transformer
is an architecture that

01:03:16.340 --> 01:03:19.280
is the hottest thing
since sliced bread

01:03:19.280 --> 01:03:23.940
at the moment, that says,
OK, here's what we do.

01:03:23.940 --> 01:03:25.580
We take the inputs.

01:03:25.580 --> 01:03:29.400
We calculate some
embedding for them.

01:03:29.400 --> 01:03:31.460
We then want to
retain the position,

01:03:31.460 --> 01:03:35.380
because of course, the sequence
in which the words appear,

01:03:35.380 --> 01:03:36.890
it matters.

01:03:36.890 --> 01:03:39.590
And the positional encoding
is this weird thing

01:03:39.590 --> 01:03:44.230
where it encodes using
sine waves so that--

01:03:44.230 --> 01:03:46.700
it's an orthogonal basis.

01:03:46.700 --> 01:03:49.460
And so it has nice
characteristics.

01:03:49.460 --> 01:03:52.370
And then we run it
into an attention model

01:03:52.370 --> 01:03:54.890
that is essentially
computing self-attention.

01:03:54.890 --> 01:03:58.145
So it's saying what--

01:03:58.145 --> 01:04:02.870
it's like Word2Vec, except
in a more sophisticated way.

01:04:02.870 --> 01:04:06.260
So it's looking at all
the words in the sentence

01:04:06.260 --> 01:04:11.270
and saying, which words is
this word most related to?

01:04:13.890 --> 01:04:17.580
And then, in order to
complicate it some more,

01:04:17.580 --> 01:04:20.280
they say, well, we don't
want just a single notion

01:04:20.280 --> 01:04:21.420
of attention.

01:04:21.420 --> 01:04:25.210
We want multiple
notions of attention.

01:04:25.210 --> 01:04:27.240
So what does that sound like?

01:04:27.240 --> 01:04:30.510
Well, to me, it
sounds a bit like what

01:04:30.510 --> 01:04:34.230
you see in convolutional
neural networks,

01:04:34.230 --> 01:04:39.270
where often when you're
processing an image with a CNN,

01:04:39.270 --> 01:04:42.240
you're not only applying
one filter to the image

01:04:42.240 --> 01:04:45.540
but you're applying a whole
bunch of different filters.

01:04:45.540 --> 01:04:47.820
And because you
initialize them randomly,

01:04:47.820 --> 01:04:50.520
you hope that they
will converge to things

01:04:50.520 --> 01:04:55.370
that actually detect different
interesting properties

01:04:55.370 --> 01:04:56.920
of the image.

01:04:56.920 --> 01:04:58.710
So the same idea here--

01:04:58.710 --> 01:05:00.210
that what they're
doing is they're

01:05:00.210 --> 01:05:06.330
starting with a bunch of these
attention matrices and saying,

01:05:06.330 --> 01:05:07.980
we initialize them randomly.

01:05:07.980 --> 01:05:10.260
They will evolve
into something that

01:05:10.260 --> 01:05:14.860
is most useful for helping us
deal with the overall problem.

01:05:14.860 --> 01:05:17.400
So then they run
this through a series

01:05:17.400 --> 01:05:22.290
of, I think, in Vaswani's paper,
something like six layers that

01:05:22.290 --> 01:05:24.300
are just replicated.

01:05:24.300 --> 01:05:30.510
And there are additional things
like feeding forward the input

01:05:30.510 --> 01:05:36.240
signal in order to add it to
the output signal of the stage,

01:05:36.240 --> 01:05:39.750
and then normalizing,
and then rerunning it,

01:05:39.750 --> 01:05:42.900
and then running it through
a feed-forward network that

01:05:42.900 --> 01:05:47.550
also has a bypass that combines
the input with the output

01:05:47.550 --> 01:05:49.500
of the feed-forward network.

01:05:49.500 --> 01:05:52.890
And then you do this
six times, or n times.

01:05:52.890 --> 01:05:57.260
And that then feeds
into the generator.

01:05:57.260 --> 01:06:02.390
And the generator then uses
a very similar architecture

01:06:02.390 --> 01:06:04.820
to calculate output
probabilities,

01:06:04.820 --> 01:06:09.330
And then it samples from those
in order to generate the text.

01:06:09.330 --> 01:06:12.230
So this is sort of
the contemporary way

01:06:12.230 --> 01:06:16.190
that one can do translation,
using this approach.

01:06:16.190 --> 01:06:19.780
Obviously, I don't have time to
go into all the details of how

01:06:19.780 --> 01:06:21.440
all this is done.

01:06:21.440 --> 01:06:23.960
And I'd probably
do it wrong anyway.

01:06:23.960 --> 01:06:27.710
But you can look at the paper,
which gives a good explanation.

01:06:27.710 --> 01:06:30.590
And that blog that I
pointed to also has

01:06:30.590 --> 01:06:34.670
a pointer to another
blog post by the same guy

01:06:34.670 --> 01:06:39.800
that does a pretty good job
of explaining the Transformer

01:06:39.800 --> 01:06:41.330
architecture.

01:06:41.330 --> 01:06:43.680
It's complicated.

01:06:43.680 --> 01:06:48.200
So what you get out of the
multi-head attention mechanism

01:06:48.200 --> 01:06:49.310
is that--

01:06:49.310 --> 01:06:53.700
here is one attention machine.

01:06:53.700 --> 01:06:58.190
And for example, the colors
here indicate the degree

01:06:58.190 --> 01:07:01.850
to which the encoding
of the word "it"

01:07:01.850 --> 01:07:05.300
depends on the other
words in the sentence.

01:07:05.300 --> 01:07:09.860
And you see that it's focused on
the animal, which makes sense.

01:07:09.860 --> 01:07:14.215
Because "it," in
fact, is referring

01:07:14.215 --> 01:07:17.210
to the animal in this sentence.

01:07:17.210 --> 01:07:21.020
Here they introduce
another encoding.

01:07:21.020 --> 01:07:26.210
And this one focuses on "was
too tired," which is also good.

01:07:26.210 --> 01:07:32.490
Because "it," again, refers to
the thing that was too tired.

01:07:32.490 --> 01:07:34.560
And of course, by
multi-headed, they

01:07:34.560 --> 01:07:37.440
mean that it's doing
this many times.

01:07:37.440 --> 01:07:40.200
And so you're
identifying all kinds

01:07:40.200 --> 01:07:45.930
of different relationships
in the input sentence.

01:07:45.930 --> 01:07:52.380
Well, along the same lines
is this encoding called ELMo.

01:07:52.380 --> 01:07:56.970
People seem to like
Sesame Street characters.

01:07:56.970 --> 01:08:00.090
So ELMo is based on a
bi-directional LSTM.

01:08:00.090 --> 01:08:02.670
So it's an older technology.

01:08:02.670 --> 01:08:06.200
But what it does
is, unlike Word2Vec,

01:08:06.200 --> 01:08:12.000
which built an embedding
for each type--

01:08:12.000 --> 01:08:17.060
so every time the
word "junk" appears,

01:08:17.060 --> 01:08:19.229
it gets the same embedding.

01:08:19.229 --> 01:08:23.510
Here what they're saying is,
hey, take context seriously.

01:08:23.510 --> 01:08:26.540
And we're going to calculate
a different embedding

01:08:26.540 --> 01:08:32.710
for each occurrence
in context of a token.

01:08:32.710 --> 01:08:34.899
And this turns out
to be very good.

01:08:34.899 --> 01:08:38.200
Because it goes part
of the way to solving

01:08:38.200 --> 01:08:41.439
the word-sense
disambiguation problem.

01:08:41.439 --> 01:08:43.580
So this is just an example.

01:08:43.580 --> 01:08:46.899
If you look at the word
"play" in GloVe, which

01:08:46.899 --> 01:08:49.330
is a slightly more
sophisticated variant

01:08:49.330 --> 01:08:53.410
of the Word2Vec approach,
you get playing, game, games,

01:08:53.410 --> 01:08:57.520
played, players, plays, player,
play, football, multiplayer.

01:08:57.520 --> 01:09:00.390
This all seems to
be about games.

01:09:00.390 --> 01:09:02.740
Because probably,
from the literature

01:09:02.740 --> 01:09:06.130
that they got this from,
that's the most common usage

01:09:06.130 --> 01:09:08.350
of the word "play."

01:09:08.350 --> 01:09:13.090
Whereas, using this
bi-directional language model,

01:09:13.090 --> 01:09:16.330
they can separate
out something like,

01:09:16.330 --> 01:09:18.340
"Kieffer, the only
junior in the group,

01:09:18.340 --> 01:09:22.550
was commended for his ability
to hit in the clutch, as well as

01:09:22.550 --> 01:09:24.609
his all-around excellent play."

01:09:24.609 --> 01:09:27.970
So this is presumably
the baseball player.

01:09:27.970 --> 01:09:29.620
And here is, "They
were actors who

01:09:29.620 --> 01:09:33.100
had been handed fat roles
in a successful play."

01:09:33.100 --> 01:09:35.979
So this is a different
meaning of the word play.

01:09:35.979 --> 01:09:40.540
And so this embedding also
has made really important

01:09:40.540 --> 01:09:44.109
contributions to improving the
quality of natural language

01:09:44.109 --> 01:09:47.140
processing by being able
to deal with the fact

01:09:47.140 --> 01:09:50.620
that single words have multiple
meanings not only in English

01:09:50.620 --> 01:09:53.710
but in other languages.

01:09:53.710 --> 01:10:00.120
So after ELMo comes BERT, which
is this Bidirectional Encoder

01:10:00.120 --> 01:10:02.820
Representations
from Transformers.

01:10:02.820 --> 01:10:07.380
So rather than using the LSTM
kind of model that ELMo used,

01:10:07.380 --> 01:10:10.620
these guys say, well,
let's hop on the bandwagon,

01:10:10.620 --> 01:10:14.790
use the Transformer-based
architecture.

01:10:14.790 --> 01:10:18.570
And then they introduced
some interesting tricks.

01:10:18.570 --> 01:10:21.510
So one of the problems
with Transformers

01:10:21.510 --> 01:10:25.320
is if you stack them on
top of each other there

01:10:25.320 --> 01:10:27.930
are many paths from
any of the inputs

01:10:27.930 --> 01:10:31.210
to any of the intermediate
nodes and the outputs.

01:10:31.210 --> 01:10:33.930
And so if you're
doing self-attention,

01:10:33.930 --> 01:10:38.220
you're trying to figure
out where the output should

01:10:38.220 --> 01:10:42.210
pay attention to the input,
the answer, of course,

01:10:42.210 --> 01:10:45.810
is like, if you're trying
to reconstruct the input,

01:10:45.810 --> 01:10:50.700
if the input is present in
your model, what you will learn

01:10:50.700 --> 01:10:53.250
is that the
corresponding word is

01:10:53.250 --> 01:10:55.950
the right word for your output.

01:10:55.950 --> 01:10:58.720
So they have to prevent
that from happening.

01:10:58.720 --> 01:11:02.610
And so the way they do
it is by masking off,

01:11:02.610 --> 01:11:07.590
at each level, some fraction
of the words or of the inputs

01:11:07.590 --> 01:11:09.460
at that level.

01:11:09.460 --> 01:11:11.880
So what this is doing
is it's a little bit

01:11:11.880 --> 01:11:15.810
like the skip-gram model
in Word2Vec, where it's

01:11:15.810 --> 01:11:19.770
trying to predict the
likelihood of some word,

01:11:19.770 --> 01:11:23.100
except it doesn't know
what a significant fraction

01:11:23.100 --> 01:11:24.940
of the words are.

01:11:24.940 --> 01:11:29.910
And so it can't overfit in the
way that I was just suggesting.

01:11:29.910 --> 01:11:32.820
So this turned out
to be a good idea.

01:11:32.820 --> 01:11:34.380
It's more complicated.

01:11:34.380 --> 01:11:37.440
Again, for the details,
you have to read the paper.

01:11:37.440 --> 01:11:41.520
I gave both the Transformer
paper and the BERT paper

01:11:41.520 --> 01:11:44.010
as optional readings for today.

01:11:44.010 --> 01:11:46.380
I meant to give them
as required readings,

01:11:46.380 --> 01:11:47.970
but I didn't do it in time.

01:11:47.970 --> 01:11:50.220
So they're optional.

01:11:50.220 --> 01:11:52.770
But there are a whole
bunch of other tricks.

01:11:52.770 --> 01:11:57.240
So instead of using words,
they actually used word pieces.

01:11:57.240 --> 01:12:03.690
So think about syllables and
don't becomes do and apostrophe

01:12:03.690 --> 01:12:06.570
t, and so on.

01:12:06.570 --> 01:12:11.130
And then they discovered
that about 15% of the tokens

01:12:11.130 --> 01:12:15.540
to be masked seems to work
better than other percentages.

01:12:15.540 --> 01:12:21.720
So those are the hidden tokens
that prevent overfitting.

01:12:21.720 --> 01:12:26.010
And then they do some
other weird stuff.

01:12:26.010 --> 01:12:28.860
Like, instead of
masking a token,

01:12:28.860 --> 01:12:32.790
they will inject random other
words from the vocabulary

01:12:32.790 --> 01:12:36.810
into its place, again,
to prevent overfitting.

01:12:36.810 --> 01:12:39.720
And then they look at
different tasks like,

01:12:39.720 --> 01:12:43.020
can I predict the next
sentence in a corpus?

01:12:43.020 --> 01:12:44.790
So I read a sentence.

01:12:44.790 --> 01:12:48.330
And the translation is
not into another language.

01:12:48.330 --> 01:12:52.500
But it's predicting what the
next sentence is going to be.

01:12:52.500 --> 01:12:56.880
So they trained it on 800
million words from something

01:12:56.880 --> 01:13:02.430
called the Books corpus
and about 2 and 1/2

01:13:02.430 --> 01:13:06.000
million-word Wikipedia corpus.

01:13:06.000 --> 01:13:07.640
And what they found
was that there

01:13:07.640 --> 01:13:12.360
is an enormous improvement
on a lot of classical tasks.

01:13:12.360 --> 01:13:15.990
So this is a listing of
some of the standard tasks

01:13:15.990 --> 01:13:20.980
for natural language processing,
mostly not in the medical world

01:13:20.980 --> 01:13:24.450
but in the general NLP domain.

01:13:24.450 --> 01:13:32.280
And you see that you get things
like an improvement from 80%.

01:13:32.280 --> 01:13:35.880
Or even the GPT model
that I'll talk about

01:13:35.880 --> 01:13:39.060
in a minute is at 82%.

01:13:39.060 --> 01:13:42.030
They're up to about 86%.

01:13:42.030 --> 01:13:47.470
So a 4% improvement in
this domain is really huge.

01:13:47.470 --> 01:13:50.110
I mean, very often
people publish papers

01:13:50.110 --> 01:13:53.110
showing a 1% improvement.

01:13:53.110 --> 01:13:54.900
And if their corpus
is big enough,

01:13:54.900 --> 01:13:57.190
then it's statistically
significant,

01:13:57.190 --> 01:13:59.020
and therefore publishable.

01:13:59.020 --> 01:14:02.590
But it's not significant in the
ordinary meaning of the term

01:14:02.590 --> 01:14:05.890
significant, if you're
doing 1% better.

01:14:05.890 --> 01:14:08.590
But doing 4% better
is pretty good.

01:14:08.590 --> 01:14:15.370
Here we're going
from like 66% to 72%

01:14:15.370 --> 01:14:17.670
from the earlier
state of the art--

01:14:17.670 --> 01:14:26.410
82 to 91; 93 to 94; 35 to
60 in the CoLA task corpus

01:14:26.410 --> 01:14:28.540
of linguistic acceptability.

01:14:28.540 --> 01:14:32.110
So this is asking, I
think, Mechanical Turk

01:14:32.110 --> 01:14:36.550
people, for generated
sentences, is this sentence

01:14:36.550 --> 01:14:39.000
a valid sentence of English?

01:14:39.000 --> 01:14:42.700
And so it's an
interesting benchmark.

01:14:42.700 --> 01:14:47.650
So it's producing really
significant improvements

01:14:47.650 --> 01:14:49.240
all over the place.

01:14:49.240 --> 01:14:50.860
They trained two models of it.

01:14:50.860 --> 01:14:52.750
The base model is
the smaller one.

01:14:52.750 --> 01:14:57.470
The large model is just
trained on larger data sets.

01:14:57.470 --> 01:15:01.050
Enormous amount of computation
in doing this training--

01:15:01.050 --> 01:15:04.610
so I've forgotten, it
took them like a month

01:15:04.610 --> 01:15:08.270
on some gigantic
cluster of GPU machines.

01:15:08.270 --> 01:15:11.780
And so it's daunting,
because you can't just

01:15:11.780 --> 01:15:14.000
crank this up on your
laptop and expect

01:15:14.000 --> 01:15:16.018
it to finish in your lifetime.

01:15:20.210 --> 01:15:23.610
The last thing I want to
tell you about is this GPT-2.

01:15:23.610 --> 01:15:26.780
So this is from the
OpenAI Institute,

01:15:26.780 --> 01:15:30.320
which is one of these
philanthropically funded--

01:15:30.320 --> 01:15:33.320
I think, this one,
by Elon Musk--

01:15:33.320 --> 01:15:37.910
research institute
to advance AI.

01:15:37.910 --> 01:15:42.900
And what they said is, well,
this is all cool, but--

01:15:42.900 --> 01:15:45.260
so they were not using BERT.

01:15:45.260 --> 01:15:49.520
They were using the
Transformer architecture

01:15:49.520 --> 01:15:53.720
but without the same
training style as BERT.

01:15:53.720 --> 01:15:56.780
And they said, the
secret is going

01:15:56.780 --> 01:16:02.930
to be that we're going to apply
this not only to one problem

01:16:02.930 --> 01:16:05.160
but to a whole
bunch of problems.

01:16:05.160 --> 01:16:08.690
So it's a multi-task
learning approach that says,

01:16:08.690 --> 01:16:10.880
we're going to
build a better model

01:16:10.880 --> 01:16:16.000
by trying to solve a bunch of
different tasks simultaneously.

01:16:16.000 --> 01:16:19.950
And so they built
enormous models.

01:16:19.950 --> 01:16:24.180
By the way, the task itself is
given as a sequence of tokens.

01:16:24.180 --> 01:16:26.880
So for example, they
might have a task

01:16:26.880 --> 01:16:31.890
that says translate to French,
English text, French text.

01:16:31.890 --> 01:16:36.780
Or answer the question,
document, question, answer.

01:16:36.780 --> 01:16:43.400
And so the system
not only learns

01:16:43.400 --> 01:16:45.660
how to do whatever
it's supposed to do.

01:16:45.660 --> 01:16:47.990
But it even learns
something about the tasks

01:16:47.990 --> 01:16:52.670
that it's being asked to work
on by encoding these and using

01:16:52.670 --> 01:16:54.890
them as part of its model.

01:16:54.890 --> 01:16:58.070
So they built four
different models.

01:16:58.070 --> 01:17:01.790
Take a look at the bottom one.

01:17:01.790 --> 01:17:09.120
1.5 billion parameters--
this is a large model.

01:17:09.120 --> 01:17:10.860
This is a very large model.

01:17:13.430 --> 01:17:16.610
And so it's a byte-level model.

01:17:16.610 --> 01:17:20.240
So they just said forget
words, because we're trying

01:17:20.240 --> 01:17:21.890
to do this multilingually.

01:17:21.890 --> 01:17:25.020
And so for Chinese,
you want characters.

01:17:25.020 --> 01:17:29.330
And for English, you might
as well take characters also.

01:17:29.330 --> 01:17:32.990
And the system will, in
its 1.5 billion parameters,

01:17:32.990 --> 01:17:37.520
learn all about the sequences of
characters that make up words.

01:17:37.520 --> 01:17:39.590
And it'll be cool.

01:17:39.590 --> 01:17:44.540
And so then they look at a whole
bunch of different challenges.

01:17:44.540 --> 01:17:48.380
And what you see is that the
state of the art before they

01:17:48.380 --> 01:17:54.010
did this on, for example,
the Lambada data set

01:17:54.010 --> 01:18:00.130
was that the perplexity of
its predictions was a hundred.

01:18:00.130 --> 01:18:04.300
And with this large model, the
perplexity of its predictions

01:18:04.300 --> 01:18:06.500
is about nine.

01:18:06.500 --> 01:18:10.340
So that means that it's
reduced the uncertainty of what

01:18:10.340 --> 01:18:13.700
to predict next
ridiculously much--

01:18:13.700 --> 01:18:16.280
I mean, by more than
an order of magnitude.

01:18:16.280 --> 01:18:18.920
And you get similar
gains, accuracy going

01:18:18.920 --> 01:18:25.700
from 59% to 63% accuracy on a--

01:18:25.700 --> 01:18:29.480
this is the children's
something-or-other challenge--

01:18:29.480 --> 01:18:31.640
from 85% to 93%--

01:18:31.640 --> 01:18:37.100
so dramatic improvements
almost across the board,

01:18:37.100 --> 01:18:40.160
except for this
particular data set,

01:18:40.160 --> 01:18:42.720
where they did not do well.

01:18:42.720 --> 01:18:47.880
And what really blew
me away is here's

01:18:47.880 --> 01:18:51.660
an application of this
1.5 billion-word model

01:18:51.660 --> 01:18:56.730
that they built. So they
said, OK, I give you a prompt,

01:18:56.730 --> 01:18:59.490
like the opening paragraph
of a Wall Street Journal

01:18:59.490 --> 01:19:02.010
article or a Wikipedia article.

01:19:02.010 --> 01:19:07.230
And you complete the article
by using that generator idea

01:19:07.230 --> 01:19:10.980
that I showed you before, that
just uses the language model

01:19:10.980 --> 01:19:14.520
and picks the most
likely word to come next

01:19:14.520 --> 01:19:17.160
and emits that as the next word.

01:19:17.160 --> 01:19:20.490
So here is a prompt that says,
"A train carriage containing

01:19:20.490 --> 01:19:24.270
controlled nuclear materials
was stolen in Cincinnati today.

01:19:24.270 --> 01:19:26.070
Its whereabouts are unknown."

01:19:26.070 --> 01:19:27.450
By the way, this is made up.

01:19:27.450 --> 01:19:31.230
I mean, this is not
a real news article.

01:19:31.230 --> 01:19:34.380
And the system comes
back with a completion

01:19:34.380 --> 01:19:36.900
that says, "The incident
occurred on the downtown train

01:19:36.900 --> 01:19:40.260
line, which runs from
Covington and Ashland stations.

01:19:40.260 --> 01:19:42.300
In an email to
Ohio news outlets,

01:19:42.300 --> 01:19:44.910
the US Department
of Energy said it's

01:19:44.910 --> 01:19:47.250
working with the Federal
Railroad Administration

01:19:47.250 --> 01:19:50.400
to find the thief," et cetera.

01:19:50.400 --> 01:19:53.790
This looks astoundingly good.

01:19:53.790 --> 01:19:56.650
Now, the paper from
which this comes--

01:19:56.650 --> 01:19:59.220
this is actually from a
blog, but they've also

01:19:59.220 --> 01:20:01.560
published a paper about it--

01:20:01.560 --> 01:20:04.590
claims that these examples
are not even cherry-picked.

01:20:04.590 --> 01:20:09.410
If you go to that page and
pick sample 1, 2, 3, 4, 5,

01:20:09.410 --> 01:20:12.810
6, et cetera, you get
different examples

01:20:12.810 --> 01:20:15.270
that they claim are
not cherry-picked.

01:20:15.270 --> 01:20:17.880
And every one of
them is really good.

01:20:17.880 --> 01:20:21.690
I mean, you could imagine
this being an actual article

01:20:21.690 --> 01:20:24.090
about this actual event.

01:20:24.090 --> 01:20:27.520
So somehow or other,
in this enormous model,

01:20:27.520 --> 01:20:30.600
and with this
Transformer technology,

01:20:30.600 --> 01:20:34.510
and with the multi-task
training that they've done,

01:20:34.510 --> 01:20:37.300
they have managed
to capture so much

01:20:37.300 --> 01:20:40.810
of the regularity of
the English language

01:20:40.810 --> 01:20:43.840
that they can generate these
fake news articles based

01:20:43.840 --> 01:20:48.910
on a prompt and make them
look unbelievably realistic.

01:20:48.910 --> 01:20:51.940
Now, interestingly,
they have chosen not

01:20:51.940 --> 01:20:54.400
to release that trained model.

01:20:54.400 --> 01:20:57.980
Because they're worried that
people will, in fact, do this,

01:20:57.980 --> 01:21:02.260
and that they will generate
fake news articles all the time.

01:21:02.260 --> 01:21:04.360
They've released a
much smaller model

01:21:04.360 --> 01:21:09.010
that is not nearly as good
in terms of its realism.

01:21:09.010 --> 01:21:12.580
So that's the state of the
art in language modeling

01:21:12.580 --> 01:21:13.970
at the moment.

01:21:13.970 --> 01:21:18.520
And as I say, the general domain
is ahead of the medical domain.

01:21:18.520 --> 01:21:20.530
But you can bet
that there are tons

01:21:20.530 --> 01:21:24.040
of people who are sitting
around looking at exactly

01:21:24.040 --> 01:21:27.250
these results and
saying, well, we

01:21:27.250 --> 01:21:29.590
ought to be able to
take advantage of this

01:21:29.590 --> 01:21:33.310
to build much better language
models for the medical domain

01:21:33.310 --> 01:21:36.670
and to exploit them in order
to do phenotyping, in order

01:21:36.670 --> 01:21:41.200
to do entity recognition,
in order to do inference,

01:21:41.200 --> 01:21:43.420
in order to do
question answering,

01:21:43.420 --> 01:21:47.156
in order to do any of
these kinds of topics.

01:21:47.156 --> 01:21:51.030
And I was talking to
Patrick Winston, who

01:21:51.030 --> 01:21:54.660
is one of the good
old-fashioned AI people,

01:21:54.660 --> 01:21:56.970
as he characterizes himself.

01:21:56.970 --> 01:22:00.090
And the thing that's a
little troublesome about this

01:22:00.090 --> 01:22:04.770
is that this technology
has virtually nothing

01:22:04.770 --> 01:22:07.470
to do with anything
that we understand

01:22:07.470 --> 01:22:11.670
about language or about
inference or about question

01:22:11.670 --> 01:22:15.010
answering or about anything.

01:22:15.010 --> 01:22:19.140
And so one is left with
this queasy feeling that,

01:22:19.140 --> 01:22:22.530
here is a wonderful engineering
solution to a whole set

01:22:22.530 --> 01:22:24.870
of problems, but
it's unclear how

01:22:24.870 --> 01:22:29.110
it relates to the original goal
of artificial intelligence,

01:22:29.110 --> 01:22:31.830
which is to understand something
about human intelligence

01:22:31.830 --> 01:22:35.160
by simulating it in a computer.

01:22:35.160 --> 01:22:38.410
Maybe our BCS
friends will discover

01:22:38.410 --> 01:22:42.780
that there are, in fact,
transformer mechanisms deeply

01:22:42.780 --> 01:22:44.670
buried in our brain.

01:22:44.670 --> 01:22:46.830
But I would be surprised
if that turned out

01:22:46.830 --> 01:22:48.960
to be exactly the case.

01:22:48.960 --> 01:22:52.480
But perhaps there is
something like that going on.

01:22:52.480 --> 01:22:54.930
And so this leaves an
interesting scientific

01:22:54.930 --> 01:22:57.180
conundrum of,
exactly what have we

01:22:57.180 --> 01:23:02.040
learned from this type of very,
very successful model building?

01:23:02.040 --> 01:23:02.760
OK.

01:23:02.760 --> 01:23:03.540
Thank you.

01:23:03.540 --> 01:23:06.590
[APPLAUSE]