WEBVTT

00:00:09.660 --> 00:00:14.190
In this video, we'll discuss the
meaning of data visualization,

00:00:14.190 --> 00:00:18.020
and why it's often useful to
visualize your data to discover

00:00:18.020 --> 00:00:21.540
hidden trends and properties.

00:00:21.540 --> 00:00:26.320
Data visualization is defined
as a mapping of data properties

00:00:26.320 --> 00:00:28.820
to visual properties.

00:00:28.820 --> 00:00:33.190
Data properties are usually
numerical or categorical,

00:00:33.190 --> 00:00:37.590
like the mean of a variable,
the maximum value of a variable,

00:00:37.590 --> 00:00:41.610
or the number of observations
with a certain property.

00:00:41.610 --> 00:00:45.740
Visual properties can be (x,y)
coordinates to plot points

00:00:45.740 --> 00:00:52.250
on a graph, colors to assign
labels, sizes, shapes, heights,

00:00:52.250 --> 00:00:53.790
etc.

00:00:53.790 --> 00:00:56.950
Both types of properties are
used to better understand

00:00:56.950 --> 00:01:00.540
the data, but in different ways.

00:01:00.540 --> 00:01:03.640
To motivate the need
for data visualization,

00:01:03.640 --> 00:01:08.100
let's look at a famous example
called Anscombe's Quartet.

00:01:08.100 --> 00:01:12.120
Each of these tables corresponds
to a different data set.

00:01:12.120 --> 00:01:17.650
We have four data sets, each
with two variables, x and y.

00:01:17.650 --> 00:01:19.890
Just looking at
the tables of data,

00:01:19.890 --> 00:01:23.370
it's hard to notice
anything special about it.

00:01:23.370 --> 00:01:27.539
It turns out that the mean
and variance of the x variable

00:01:27.539 --> 00:01:30.250
is the same for
all four data sets,

00:01:30.250 --> 00:01:32.960
the mean and variance
of the y variable

00:01:32.960 --> 00:01:35.820
is the same for
all four data sets,

00:01:35.820 --> 00:01:38.950
and the correlation
between x and y, as well as

00:01:38.950 --> 00:01:41.070
the regression
equation to predict y

00:01:41.070 --> 00:01:46.520
from x, is the exact same
for all four data sets.

00:01:46.520 --> 00:01:49.140
So just by looking
at data properties,

00:01:49.140 --> 00:01:53.640
we might conclude that these
data sets are very similar.

00:01:53.640 --> 00:01:57.710
But if we plot the four data
sets, they're very different.

00:01:57.710 --> 00:02:00.090
These plots show
the four data sets,

00:02:00.090 --> 00:02:03.080
with the x variable
on the x-axis,

00:02:03.080 --> 00:02:06.380
and the y variable
on the y-axis.

00:02:06.380 --> 00:02:09.669
Visually, these data
sets look very different.

00:02:09.669 --> 00:02:13.690
But without visualizing them,
we might not have noticed this.

00:02:13.690 --> 00:02:17.130
This is one example of
why visualizing data can

00:02:17.130 --> 00:02:19.970
be very important.

00:02:19.970 --> 00:02:23.150
We'll use the
ggplot2 package in R

00:02:23.150 --> 00:02:26.160
to create data visualizations.

00:02:26.160 --> 00:02:29.500
This package was created
by Hadley Wickham, who

00:02:29.500 --> 00:02:33.570
described ggplot as "a
plotting system for R

00:02:33.570 --> 00:02:36.510
based on the grammar
of graphics, which

00:02:36.510 --> 00:02:40.500
tries to take the good parts
of base and lattice graphics

00:02:40.500 --> 00:02:43.230
and none of the bad parts.

00:02:43.230 --> 00:02:46.070
It takes care of many of
the fiddly details that

00:02:46.070 --> 00:02:49.560
make plotting a hassle
(like drawing legends)

00:02:49.560 --> 00:02:53.740
as well as providing a powerful
model of graphics that makes it

00:02:53.740 --> 00:02:58.860
easy to produce complex
multi-layered graphics."

00:02:58.860 --> 00:03:01.400
So what do we gain
from using ggplot

00:03:01.400 --> 00:03:04.990
over just making plots
using the basic R functions,

00:03:04.990 --> 00:03:08.210
or what's referred to as base R?

00:03:08.210 --> 00:03:12.180
Well, in base R, each
mapping of data properties

00:03:12.180 --> 00:03:15.890
to visual properties is
its own special case.

00:03:15.890 --> 00:03:20.350
When we create a scatter plot,
or a box plot, or a histogram,

00:03:20.350 --> 00:03:24.210
we have to use a completely
different function.

00:03:24.210 --> 00:03:29.010
Additionally, the graphics are
composed of simple elements,

00:03:29.010 --> 00:03:31.100
like points or lines.

00:03:31.100 --> 00:03:36.020
It's challenging to create any
sophisticated visualizations.

00:03:36.020 --> 00:03:40.920
It's also difficult to add
elements to existing plots.

00:03:40.920 --> 00:03:44.540
But in ggplot, the
mapping of data properties

00:03:44.540 --> 00:03:49.370
to visual properties is done by
just adding layers to the plot.

00:03:49.370 --> 00:03:52.960
This makes it much easier to
create sophisticated plots

00:03:52.960 --> 00:03:54.770
and to add to existing plots.

00:03:57.540 --> 00:03:59.470
So what is the
grammar of graphics

00:03:59.470 --> 00:04:01.980
that ggplot is based on?

00:04:01.980 --> 00:04:06.070
All ggplot graphics
consist of three elements.

00:04:06.070 --> 00:04:10.100
The first is data,
in a data frame.

00:04:10.100 --> 00:04:13.040
The second is an
aesthetic mapping,

00:04:13.040 --> 00:04:16.279
which describes how
variables in the data frame

00:04:16.279 --> 00:04:19.500
are mapped to
graphical attributes.

00:04:19.500 --> 00:04:22.770
This is where we'll define
which variables are on the x-

00:04:22.770 --> 00:04:26.270
and y-axes, whether or not
points should be colored

00:04:26.270 --> 00:04:30.430
or shaped by certain
attributes, etc.

00:04:30.430 --> 00:04:34.250
The third element is
which geometric objects

00:04:34.250 --> 00:04:36.770
we want to determine
how the data values

00:04:36.770 --> 00:04:39.150
are rendered graphically.

00:04:39.150 --> 00:04:43.159
This is where we indicate if the
plot should have points, lines,

00:04:43.159 --> 00:04:46.640
bars, boxes, etc.

00:04:46.640 --> 00:04:50.820
In the next video, we'll
load the WHO data into R

00:04:50.820 --> 00:04:55.250
and create some data
visualizations using ggplot.