WEBVTT

00:00:02.530 --> 00:00:07.220
You've tested positive for a rare and deadly
cancer that afflicts 1 out of 1000 people,

00:00:07.220 --> 00:00:11.980
based on a test that is 99% accurate. What
are the chances that you actually have the

00:00:11.980 --> 00:00:17.560
cancer? By the end of this video, you'll be
able to answer this question!

00:00:17.560 --> 00:00:22.510
This video is part of the Probability and
Statistics video series. Many natural and

00:00:22.510 --> 00:00:28.260
social phenomena are probabilistic in nature.
Engineers, scientists, and policymakers often

00:00:28.260 --> 00:00:31.930
use probability to model and predict system
behavior.

00:00:31.930 --> 00:00:38.469
Hi, my name is Sam Watson, and I'm a graduate
student in mathematics at MIT.

00:00:38.469 --> 00:00:43.679
Before watching this video, you should be
familiar with basic probability vocabulary

00:00:43.679 --> 00:00:46.989
and the definition of conditional probability.

00:00:46.989 --> 00:00:51.219
After watching this video, you'll be able
to: Calculate the conditional probability

00:00:51.219 --> 00:00:56.269
of a given event using tables and trees; and
Understand how conditional probability can

00:00:56.269 --> 00:01:03.269
be used to interpret medical diagnoses.

00:01:04.479 --> 00:01:11.479
Suppose that in front of you are two bowls,
labeled A and B. Each bowl contains five marbles.

00:01:11.600 --> 00:01:18.240
Bowl A has 1 blue and 4 yellow marbles. Bowl
B has 3 blue and 2 yellow marbles.

00:01:18.240 --> 00:01:23.229
Now choose a bowl at random and draw a marble
uniformly at random from it. Based on your

00:01:23.229 --> 00:01:28.200
existing knowledge of probability, how likely
is it that you pick a blue marble? How about

00:01:28.200 --> 00:01:33.109
a yellow marble?

00:01:33.109 --> 00:01:40.109
Out of
the 10 marbles you could choose from, 4 are
blue. So the probability of choosing a blue

00:01:55.130 --> 00:01:58.020
marble is 4 out of 10.

00:01:58.020 --> 00:02:03.109
There are 6 yellow marbles out of 10 total,
so the probability of choosing yellow is 6

00:02:03.109 --> 00:02:03.270
out of 10.

00:02:03.270 --> 00:02:04.109
When the number of possible outcomes is finite,
and all events are equally likely, the probability

00:02:04.109 --> 00:02:05.009
of one event happening is the number of favorable
outcomes divided by the total number of possible

00:02:05.009 --> 00:02:05.070
outcomes.

00:02:05.070 --> 00:02:09.470
What if you must draw from Bowl A? What's
the probability of drawing a blue marble,

00:02:09.470 --> 00:02:16.470
given that you draw from Bowl A?

00:02:18.239 --> 00:02:25.239
Let's go back to the table and consider only
Bowl A. Bowl A contains 5 marbles of which

00:02:25.599 --> 00:02:31.300
1 is blue, so the probability of picking a
blue one is 1 in 5.

00:02:31.300 --> 00:02:36.610
Notice the probability has changed. In the
first scenario, the sample space consists

00:02:36.610 --> 00:02:42.329
of all 10 marbles, because we are free to
draw from both bowls.

00:02:42.329 --> 00:02:48.220
In the second scenario, we are restricted
to Bowl A. Our new sample space consists of

00:02:48.220 --> 00:02:55.220
only the five marbles in Bowl A. We ignore
these marbles in Bowl.

00:02:55.670 --> 00:03:00.909
Restricting our attention to a specific set
of outcomes changes the sample space, and

00:03:00.909 --> 00:03:07.730
can also change the probability of an event.
This new probability is what we call a conditional

00:03:07.730 --> 00:03:08.370
In the previous example, we calculated the
conditional probability of drawing a blue

00:03:08.370 --> 00:03:09.659
marble, given that we draw from Bowl A.

00:03:09.659 --> 00:03:14.510
This is standard notation for conditional
probability. The vertical bar ( | ) is read

00:03:14.510 --> 00:03:21.510
as "given." The probability we are looking
for precedes the bar, and the condition follows

00:03:25.099 --> 00:03:32.099
the bar.

00:03:32.299 --> 00:03:37.439
Now let's flip things around. Suppose someone
picks a marble at random from either bowl

00:03:37.439 --> 00:03:43.939
A or bowl B and reveals to you that the marble
drawn was blue. What is the probability that

00:03:43.939 --> 00:03:46.989
the blue marble came from Bowl A?

00:03:46.989 --> 00:03:52.159
In other words, what's the conditional probability
that the marble was drawn from Bowl A, given

00:03:52.159 --> 00:03:59.159
that it is blue? Pause the video and try to
work this out.

00:04:01.159 --> 00:04:06.400
Going back to the table, because we are dealing
with the condition that the marble is blue,

00:04:06.400 --> 00:04:11.170
the sample space is restricted to the four
blue marbles.

00:04:11.170 --> 00:04:17.290
Of these four blue marbles, one is in Bowl
A, and each is equally likely to be drawn.

00:04:17.290 --> 00:04:22.320
Thus, the conditional probability is 1 out
of 4.

00:04:22.320 --> 00:04:27.160
Notice that the probability of picking a blue
marble given that the marble came from Bowl

00:04:27.160 --> 00:04:33.030
A is NOT equal to the probability that the
marble came from Bowl A given that the marble

00:04:33.030 --> 00:04:40.030
was blue. Each has a different condition,
so be careful not to mix them up!

00:04:42.919 --> 00:04:47.650
We've seen how tables can help us organize
our data and visualize changes in the sample

00:04:47.650 --> 00:04:49.060
space.

00:04:49.060 --> 00:04:53.180
Let's look at another tool that is useful
for understanding conditional probabilities

00:04:53.180 --> 00:04:55.870
- a tree diagram.

00:04:55.870 --> 00:05:02.870
Suppose we have a jar containing 5 marbles;
2 are blue and 3 are yellow. If we draw any

00:05:03.220 --> 00:05:08.280
one marble at random, the probability of drawing
a blue marble is 2/5.

00:05:08.280 --> 00:05:14.280
Now, without replacing the first marble, draw
a second marble from the jar. Given that the

00:05:14.280 --> 00:05:20.330
first marble is blue, is the probability of
drawing a second blue marble still 2/5?

00:05:20.330 --> 00:05:27.250
NO, it isn't. Our sample space has changed.
If a blue marble is drawn first, you are left

00:05:27.250 --> 00:05:31.569
with 4 marbles; 1 blue and 3 yellow.

00:05:31.569 --> 00:05:36.130
In other words, if a blue marble is selected
first, the probability that you draw blue

00:05:36.130 --> 00:05:42.580
second is 1/4. And the probability you draw
yellow second is 3/4.

00:05:42.580 --> 00:05:49.580
Now pause the video and determine the probabilities
if the yellow marble is selected first instead.

00:05:54.659 --> 00:06:00.539
If a yellow marble is selected first, you
are left with 2 yellow and 2 blue marbles.

00:06:00.539 --> 00:06:06.389
There is now a 2/4 chance of drawing a blue
marble and a 2/4 chance of drawing a yellow

00:06:06.389 --> 00:06:08.060
marble.

00:06:08.060 --> 00:06:15.060
What we have drawn here is called a tree diagram.
The probability assigned to the second branch

00:06:18.550 --> 00:06:21.660
denotes the conditional probability given
that the first happened.

00:06:21.660 --> 00:06:23.139
Tree diagrams help us to visualize our sample
space and reason out probabilities.

00:06:23.139 --> 00:06:27.500
We can answer questions like "What is the
probability of drawing 2 blue marbles in a

00:06:27.500 --> 00:06:32.550
row?" In other words, what is the probability
of drawing a blue marble first AND a blue

00:06:32.550 --> 00:06:34.479
marble second?

00:06:34.479 --> 00:06:39.880
This event is represented by these two branches
in the tree diagram.

00:06:39.880 --> 00:06:46.880
We have a 2/5 chance followed by a 1/4 chance.
We multiply these to get 2/20, or 1/10. The

00:06:48.849 --> 00:06:53.349
probability of drawing two blue marbles in
a row is 1/10.

00:06:53.349 --> 00:06:58.729
Now you do it. Use the tree diagram to calculate
the probabilities of the other possibilities:

00:06:58.729 --> 00:07:05.729
blue, yellow; yellow, blue; and yellow, yellow.

00:07:10.050 --> 00:07:16.750
The probabilities each work out to 3/10. The
four probabilities add up to a total of 1,

00:07:16.750 --> 00:07:18.800
as they should.

00:07:18.800 --> 00:07:22.599
What if we don't care about the first marble?
We just want to determine the probability

00:07:22.599 --> 00:07:26.330
that the second marble is yellow.

00:07:26.330 --> 00:07:30.699
Because it does not matter whether the first
marble is blue or yellow, we consider both

00:07:30.699 --> 00:07:37.699
the blue, yellow, and the yellow, yellow paths.
Adding the probabilities gives us 3/10 + 3/10,

00:07:38.099 --> 00:07:41.139
which works out to 3/5.

00:07:41.139 --> 00:07:45.190
Here's another interesting question. What
is the probability that the first marble drawn

00:07:45.190 --> 00:07:48.819
is blue, given that the second marble drawn
is yellow?

00:07:48.819 --> 00:07:54.050
Intuitively, this seems tricky. Pause the
video and reason through the probability tree

00:07:54.050 --> 00:08:01.050
with a friend.

00:08:01.370 --> 00:08:05.680
Because we are conditioning on the event that
the second marble drawn is yellow, our sample

00:08:05.680 --> 00:08:09.289
space is restricted to these two paths: P(blue,
yellow) and P(yellow, yellow).

00:08:09.289 --> 00:08:14.690
Of these two paths, only the top one meets
our criteria - that the blue marble is drawn

00:08:14.690 --> 00:08:16.759
first.

00:08:16.759 --> 00:08:21.919
We represent the probability as a fraction
of favorable to possible outcomes. Hence,

00:08:21.919 --> 00:08:26.199
the probability that the first marble drawn
is blue, given that the second marble drawn

00:08:26.199 --> 00:08:33.198
is yellow is 3/10 divided by (3/10 +3/10),
which works out to 1/2.

00:08:33.450 --> 00:08:39.000
I hope you appreciate that tree diagrams and
tables make these types of probability problems

00:08:39.000 --> 00:08:46.000
doable without having to memorize any formulas!

00:08:47.210 --> 00:08:51.830
Let's return to our opening question. Recall
that you've tested positive for a cancer that

00:08:51.830 --> 00:08:56.780
afflicts 1 out of 1000 people, based on a
test that is 99% accurate.

00:08:56.780 --> 00:09:02.950
More precisely, out of 100 test results, we
expect about 99 correct results and only 1

00:09:02.950 --> 00:09:05.100
incorrect result.

00:09:05.100 --> 00:09:10.290
Since the test is highly accurate, you might
conclude that the test is unlikely to be wrong,

00:09:10.290 --> 00:09:13.110
and that you most likely have cancer.

00:09:13.110 --> 00:09:19.320
But wait! Let's first use conditional probability
to make sense of our seemingly gloomy diagnosis.

00:09:19.320 --> 00:09:20.850
Now pause the video and determine the probability
that you have the cancer, given that you test

00:09:20.850 --> 00:09:20.940
positive.

00:09:20.940 --> 00:09:24.470
Let's use a tree diagram to help with our
calculations.

00:09:24.470 --> 00:09:30.650
The first branch of the tree represents the
likelihood of cancer in the general population.

00:09:30.650 --> 00:09:36.910
The probability of having the rare cancer
is 1 in 1000, or 0.001. The probability of

00:09:36.910 --> 00:09:41.790
having no cancer is 0.999.

00:09:41.790 --> 00:09:46.260
Let's extend the tree diagram to illustrate
the possible results of the medical test that

00:09:46.260 --> 00:09:49.020
is 99% accurate.

00:09:49.020 --> 00:09:56.020
In the cancer population, 99% will test positive
(correctly), but 1% will test negative (incorrectly).

00:09:57.150 --> 00:10:01.200
These incorrect results are called false negatives.

00:10:01.200 --> 00:10:07.520
In the cancer-free population, 99% will test
negative (correctly), but 1% will test positive

00:10:07.520 --> 00:10:13.590
(incorrectly). These incorrect results are
called false positives.

00:10:13.590 --> 00:10:19.270
Given that you test positive, our sample space
is now restricted to only the population that

00:10:19.270 --> 00:10:24.920
test positive. This is represented by these
two paths.

00:10:24.920 --> 00:10:30.760
The top path shows the probability you have
the cancer AND test positive. The lower path

00:10:30.760 --> 00:10:37.440
shows the probability that you don't have
cancer AND still test positive.

00:10:37.440 --> 00:10:44.440
The probability that you
actually do have the cancer, given that you
test positive, is (0.001*0.99)/((0.001*0.99)+(0.999*0.01)),

00:10:55.720 --> 00:11:01.150
which works out to about 0.09 - less than
10%!

00:11:01.150 --> 00:11:06.880
The error rate of the test is only 1 percent,
but the chance of a misdiagnosis is more than

00:11:06.880 --> 00:11:13.550
90%! Chances are pretty good that you do not
actually have cancer, despite the rather accurate

00:11:13.550 --> 00:11:16.760
test. Why is this so?

00:11:16.760 --> 00:11:22.440
The accuracy of the test actually reflects
the conditional probability that one tests

00:11:22.440 --> 00:11:25.070
positive, given that one has cancer.

00:11:25.070 --> 00:11:30.520
But in practice, what you want to know is
the conditional probability that you have

00:11:30.520 --> 00:11:37.520
cancer, given that you test positive! These
probabilities are NOT the same!

00:11:37.520 --> 00:11:42.550
Whenever we take medical tests, or perform
experiments, it is important to understand

00:11:42.550 --> 00:11:47.260
what events our results are conditioned on,
and how that might affect the accuracy of

00:11:47.260 --> 00:11:53.180
our conclusions.

00:11:53.180 --> 00:11:57.180
In this video, you've seen that conditional
probability must be used to understand and

00:11:57.180 --> 00:12:02.810
predict the outcomes of many events. You've
also learned to evaluate and manage conditional

00:12:02.810 --> 00:12:06.830
probabilities using tables and trees.

00:12:06.830 --> 00:12:11.310
We hope that you will now think more carefully
about the probabilities you encounter, and

00:12:11.310 --> 00:12:14.200
consider how conditioning affects their interpretation.