WEBVTT

00:00:01.000 --> 00:00:08.000
Good morning.
Good morning.

00:00:08.000 --> 00:00:14.000
So, I'd like to pick up where we
left off last time and just finish

00:00:14.000 --> 00:00:20.000
off translation and then step back
and look at how this central dogma

00:00:20.000 --> 00:00:26.000
of DNA is replicated into DNA, is
read into RNA, and is translated

00:00:26.000 --> 00:00:30.000
into protein. Or,
actually, as Francis Crick

00:00:30.000 --> 00:00:33.000
really put it, all
information flow from nucleic

00:00:33.000 --> 00:00:37.000
acid to protein. How that
varies amongst organisms.

00:00:37.000 --> 00:00:40.000
Because first we're going through
it and looking at the absolutely

00:00:40.000 --> 00:00:43.000
common features, DNA
replication, so it's five prime

00:00:43.000 --> 00:00:46.000
to three prime, et
cetera, et cetera,

00:00:46.000 --> 00:00:50.000
transcription, translation. But
in a moment I'd like to turn to

00:00:50.000 --> 00:00:53.000
the variations between
different kinds of organisms.

00:00:53.000 --> 00:00:56.000
But let me briefly finish up, if
I may, the bit about translation

00:00:56.000 --> 00:01:00.000
in general so we can
look at its variation.

00:01:00.000 --> 00:01:06.000
As we talked about last time,
we have a messenger RNA that has

00:01:06.000 --> 00:01:13.000
been transcribed from a specific
region of the chromosome starting at

00:01:13.000 --> 00:01:20.000
a promoter and going to
some stop of transcription.

00:01:20.000 --> 00:01:27.000
And that messenger RNA will
include some particular sequence,

00:01:27.000 --> 00:01:34.000
and I'll copy one here, A-U-A-C-G-A-U-G-A-A-G-A-G-G-C-C-C,

00:01:34.000 --> 00:01:41.000
et cetera, et cetera,
et cetera, out to a UAG.

00:01:41.000 --> 00:01:45.000
And this is the direction
five prime to three prime.

00:01:45.000 --> 00:01:49.000
We'll remember that all nucleic
acid polymerization goes five prime

00:01:49.000 --> 00:01:53.000
to three prime. So, what
happens is the cell begins

00:01:53.000 --> 00:01:57.000
scanning this message. And
it does that by this message

00:01:57.000 --> 00:02:01.000
being exported into the cytoplasm of
the cell. The ribosome coming along

00:02:01.000 --> 00:02:05.000
and glomming onto this
message and scanning on for the

00:02:05.000 --> 00:02:09.000
place to start.
It looks, it looks,

00:02:09.000 --> 00:02:13.000
it looks, it looks, and it
finds the first AUG. Footnote,

00:02:13.000 --> 00:02:18.000
this isn't 100% true. There are
occasional messages that start their

00:02:18.000 --> 00:02:22.000
translation not at an AUG,
and there are even occasional,

00:02:22.000 --> 00:02:27.000
there are even more messages that
don't quite start at the first AUG

00:02:27.000 --> 00:02:31.000
because the ribosome is really is
looking for something a little bit

00:02:31.000 --> 00:02:36.000
special, but to a first
order approximation.

00:02:36.000 --> 00:02:40.000
Good enough for the textbooks.
It goes along to the first AUG.

00:02:40.000 --> 00:02:44.000
In reality it's a little
more subtle than that.

00:02:44.000 --> 00:02:48.000
But it starts at the first AUG.
And what it does is it builds a

00:02:48.000 --> 00:02:52.000
protein that corresponds to it
according to a three letter genetic

00:02:52.000 --> 00:02:56.000
code. And you all know the
lookup table. It's in your book.

00:02:56.000 --> 00:03:00.000
AUG, always the first amino
acid put in. A methionine.

00:03:00.000 --> 00:03:04.000
Then AAG. Lysine, I
think. Then arginine.

00:03:04.000 --> 00:03:08.000
Then a proline. Now, I mean
this is this particular sequence.

00:03:08.000 --> 00:03:12.000
Any other sequence would
be different. Et cetera.

00:03:12.000 --> 00:03:17.000
How does it accomplish this
matching between three letters of

00:03:17.000 --> 00:03:21.000
the genetic code? Oh,
and when it gets to AUG,

00:03:21.000 --> 00:03:25.000
that is one of the three singles
for stop, don't put in any more amino

00:03:25.000 --> 00:03:30.000
acids. There are three
such stop signals.

00:03:30.000 --> 00:03:37.000
AUG, sorry, UAG, UGG and
U, oops, what did I just do

00:03:37.000 --> 00:03:44.000
here? Let's get that right.
UAG, UGG and UGA. Those are the

00:03:44.000 --> 00:03:51.000
three stop codons. So, how
many total codons are there?

00:03:51.000 --> 00:03:59.000
64 codons. Three of them
spell stop. 61 of them spell

00:03:59.000 --> 00:04:05.000
specific amino acids. And how
many amino acids are there?

00:04:05.000 --> 00:04:11.000
20. So, the average redundancy
is three. Some are specified by

00:04:11.000 --> 00:04:16.000
multiple codons. The
most extreme is some amino

00:04:16.000 --> 00:04:21.000
acids are specified by as
many as six codons. Did I,

00:04:21.000 --> 00:04:27.000
oh, thank you. Come back down. Of
course. U-A, so it's UAG, right?

00:04:27.000 --> 00:04:32.000
Sorry, UAA and
UGA and UAG.

00:04:32.000 --> 00:04:37.000
Thank you. Very good. All right.
So, now, how does it accomplish

00:04:37.000 --> 00:04:42.000
this feat of taking amino acids,
of taking nucleotide sequence, RNA

00:04:42.000 --> 00:04:47.000
sequence and converting it into
the sequence of amino acids?

00:04:47.000 --> 00:04:52.000
As I mentioned last time, there
was lots of original somewhat

00:04:52.000 --> 00:04:57.000
nutty thinking about some looping
codes that would make the RNA fold

00:04:57.000 --> 00:05:03.000
up in such a way to bind
the amino acids and all that.

00:05:03.000 --> 00:05:12.000
But, as Francis Crick thought up,
there had to be some kind of an

00:05:12.000 --> 00:05:21.000
adapter molecule that would take
the RNA sequence and would somehow

00:05:21.000 --> 00:05:30.000
connect it up to the correct
amino acid, and that was UAC.

00:05:30.000 --> 00:05:35.000
A particular transfer RNA molecule.
And the tRNA molecule is an adapter

00:05:35.000 --> 00:05:40.000
sequence that has three nucleotides
here that match up to the three

00:05:40.000 --> 00:05:45.000
nucleotides of the codon that
we're trying to translate,

00:05:45.000 --> 00:05:50.000
and it has the appropriate amino
acid that's been stuck on the end of

00:05:50.000 --> 00:05:55.000
it. And how does it get there?
How does the right tRNA, the tRNA

00:05:55.000 --> 00:06:00.000
to match this codon have the
right amino acid put on it?

00:06:00.000 --> 00:06:03.000
There's a dedicated enzyme that
recognizes that tRNA and puts on

00:06:03.000 --> 00:06:07.000
that amino acid. It's
aminoacyl-tRNA synthetase.

00:06:07.000 --> 00:06:11.000
It sticks the right amino
on the right transfer RNA.

00:06:11.000 --> 00:06:15.000
So, that's how it accomplishes the
physical recognition of these three

00:06:15.000 --> 00:06:18.000
bases and has the right
amino acid attached to it.

00:06:18.000 --> 00:06:22.000
There's an enzymatic machinery
that has all of these tRNAs floating

00:06:22.000 --> 00:06:26.000
around in the cell which can be
used for this translation here.

00:06:26.000 --> 00:06:30.000
How does this actually
happen physically?

00:06:30.000 --> 00:06:35.000
It happens in this vast
machine called the ribosome.

00:06:35.000 --> 00:06:41.000
In the ribosome, if we have,
say, our codon here and we have a

00:06:41.000 --> 00:06:47.000
tRNA that, well, we'll
put that actually in the

00:06:47.000 --> 00:06:53.000
ribosome that, say, has
the first amino acid here,

00:06:53.000 --> 00:06:59.000
methionine, there's a cavity
for this guy and there's a cavity

00:06:59.000 --> 00:07:05.000
for the next guy. And other
tRNAs come into the cell

00:07:05.000 --> 00:07:11.000
carrying their next amino acid.
Maybe it will be here a lysine that

00:07:11.000 --> 00:07:17.000
matches up with the codon and the
anti-codon. And when the right tRNA

00:07:17.000 --> 00:07:23.000
fits in the next cavity over,
the ribosome itself catalyzes a

00:07:23.000 --> 00:07:30.000
peptide bond between
these amino acids.

00:07:30.000 --> 00:07:35.000
Then it chugs over by one, it
translocates by one moving this

00:07:35.000 --> 00:07:40.000
bit of the complex to the left,
and the peptide chain continues to

00:07:40.000 --> 00:07:45.000
grow out this end as each new
codon is moved into position,

00:07:45.000 --> 00:07:50.000
a tRNA comes in bringing the right
amino acid until finally a stop

00:07:50.000 --> 00:07:55.000
codon is hit. And what happens
when you hit a stop codon?

00:07:55.000 --> 00:08:00.000
It stops. And is there
a tRNA for a stop?

00:08:00.000 --> 00:08:02.000
It turns out there's not. There
actually isn't. There's some

00:08:02.000 --> 00:08:05.000
other factor. There's a protein
factor that helps recognize the

00:08:05.000 --> 00:08:08.000
stops. So, that just continues
to chug on. Those of you who are

00:08:08.000 --> 00:08:11.000
computer scientists or
mathematicians will recognize this

00:08:11.000 --> 00:08:14.000
is a two-tape Turing machine.
It is the small two-tape Turing

00:08:14.000 --> 00:08:17.000
machine that I know to exist. If
you don't know what that means,

00:08:17.000 --> 00:08:20.000
you can forget about that comment.
In any case, but some of you know

00:08:20.000 --> 00:08:23.000
what that is. So,
that's how it proceeds.

00:08:23.000 --> 00:08:26.000
That is your basic
protein translation.

00:08:26.000 --> 00:08:28.000
And, I must say, what I
really love about this was

00:08:28.000 --> 00:08:31.000
that Francis Crick kind of figured
out what had to happen just on first

00:08:31.000 --> 00:08:34.000
principles and was able to think
through it much more clearly and

00:08:34.000 --> 00:08:37.000
direct people to know what
to look for in the laboratory.

00:08:37.000 --> 00:08:40.000
And if people had not had the
clarity of thinking that Crick

00:08:40.000 --> 00:08:43.000
provided by saying, look,
there's got to be this kind of

00:08:43.000 --> 00:08:46.000
adapter, I don't think they
would have found it as quickly.

00:08:46.000 --> 00:08:49.000
But once he said this is
what you've got to look for,

00:08:49.000 --> 00:08:52.000
golly, it was there. You
can't do that very often,

00:08:52.000 --> 00:08:55.000
but Francis Crick seemed to have
a very good track record of doing

00:08:55.000 --> 00:08:58.000
those things. OK. So,
that was just finishing off

00:08:58.000 --> 00:09:02.000
translation. Now what
I'd like to do is turn to

00:09:02.000 --> 00:09:08.000
variations on the theme as
the major issue for today.

00:09:08.000 --> 00:09:14.000
How does this central dogma, DNA
replicates, is transcribed into

00:09:14.000 --> 00:09:20.000
RNA and is translated into protein,
vary amongst the different kinds of

00:09:20.000 --> 00:09:26.000
organisms that we
might be interested in?

00:09:26.000 --> 00:09:32.000
The kinds of organisms
we might be interested in,

00:09:32.000 --> 00:09:39.000
eukaryotes, prokaryotes,
viruses. Sample eukaryote,

00:09:39.000 --> 00:09:46.000
MIT undergraduate. Prokaryote,
E. coli. And virus, many possible

00:09:46.000 --> 00:09:53.000
viruses. The eukaryotes'
big nucleated cells.

00:09:53.000 --> 00:10:00.000
So, in here we're going to
have our nucleated cells.

00:10:00.000 --> 00:10:04.000
DNA living in there. In
our prokaryotes we have no

00:10:04.000 --> 00:10:09.000
distinct nucleus. The
DNA is not in a distinct

00:10:09.000 --> 00:10:14.000
nucleus, although it's not
entirely freely floating around.

00:10:14.000 --> 00:10:19.000
It tends to be clustered together.
In the virus the nucleic acid

00:10:19.000 --> 00:10:24.000
resides in some kind of a capsid,
some kind of a, it could be a

00:10:24.000 --> 00:10:29.000
protein capsid. There
are some of them that have

00:10:29.000 --> 00:10:34.000
lipid capsids with lipid particles
around them, but some kind of a coat

00:10:34.000 --> 00:10:39.000
around nucleic acid there. Do
they all do exactly the same

00:10:39.000 --> 00:10:44.000
things with regard to DNA
replication, RNA transcription and

00:10:44.000 --> 00:10:49.000
protein translation?
Well, not entirely. So,

00:10:49.000 --> 00:10:54.000
as a way, in a way to reinforce
what we know about these,

00:10:54.000 --> 00:11:00.000
let's look at how they differ.
DNA replication. Eukaryotes.

00:11:00.000 --> 00:11:06.000
What's the structure of one of
your chromosomes? Is it a long line,

00:11:06.000 --> 00:11:12.000
a long linear molecule, or
is it a circular molecule?

00:11:12.000 --> 00:11:18.000
How many of you have linear
chromosomes? How many of you have

00:11:18.000 --> 00:11:24.000
circular chromosomes? I heard
there were some people with

00:11:24.000 --> 00:11:30.000
circular. And how many of you
are unsure about your chromosomes?

00:11:30.000 --> 00:11:38.000
OK. That's good. Well,
then I'm pleased to inform

00:11:38.000 --> 00:11:47.000
you that you have long linear
chromosomes. Every human chromosome

00:11:47.000 --> 00:11:55.000
is a long double-stranded molecule
of DNA. Linear double-stranded DNA.

00:11:55.000 --> 00:12:02.000
They can be extremely long.
You have 23 chromosomes,

00:12:02.000 --> 00:12:08.000
and together they make up three
billion nucleotides of DNA.

00:12:08.000 --> 00:12:13.000
A typical chromosome could be 150
million bases long as an average

00:12:13.000 --> 00:12:19.000
size for a chromosome.
And it's a single connected

00:12:19.000 --> 00:12:24.000
molecule. 150 million bases long in
the human is a typical chromosome.

00:12:24.000 --> 00:12:30.000
One tricky little bit
about replicating DNA.

00:12:30.000 --> 00:12:34.000
Let's just think back to our
little model of replicating DNA.

00:12:34.000 --> 00:12:38.000
Let's come to the chromosome end
here. It's five prime to three

00:12:38.000 --> 00:12:42.000
prime. Five prime to three prime.
We're going to start replicating.

00:12:42.000 --> 00:12:47.000
We're getting to the end
of chromosome number one.

00:12:47.000 --> 00:12:51.000
We've got a primer here, and
the primer is going to be used

00:12:51.000 --> 00:12:55.000
to extend, extend, extend.
We get right to the end.

00:12:55.000 --> 00:13:00.000
That's good. Tell me how
we're going to replicate back.

00:13:00.000 --> 00:13:04.000
We need a little primer to start
it, right? And where's that primer

00:13:04.000 --> 00:13:09.000
going to land? Maybe
over here it will start

00:13:09.000 --> 00:13:14.000
replicating back. Oh,
boy, we haven't done this

00:13:14.000 --> 00:13:19.000
figure. So, what do we have to
do there? So, we need to primer a

00:13:19.000 --> 00:13:24.000
little further back.
OK. But, you know what,

00:13:24.000 --> 00:13:29.000
the chance that we're going
to get that right at the end,

00:13:29.000 --> 00:13:34.000
that we're going to get a primer
exactly at the end is pretty low.

00:13:34.000 --> 00:13:37.000
And if we don't have a
primer exactly at the end,

00:13:37.000 --> 00:13:41.000
what's going to be wrong with
that copy of the chromosome?

00:13:41.000 --> 00:13:45.000
Too short. Now, big deal. So,
it's short by maybe 20 bases.

00:13:45.000 --> 00:13:49.000
But that's just this cell division.
What about next cell division? It

00:13:49.000 --> 00:13:53.000
will be short on average by a little
bit, and then the next cell division

00:13:53.000 --> 00:13:57.000
and the next cell division.
It's actually pretty tricky to

00:13:57.000 --> 00:14:01.000
replicate a linear chromosome
on the lagging strand,

00:14:01.000 --> 00:14:05.000
unless you can land the primer
in exactly the right place,

00:14:05.000 --> 00:14:09.000
which doesn't happen. So,
a special little solution is

00:14:09.000 --> 00:14:15.000
used. The ends of chromosomes
here are called telomeres,

00:14:15.000 --> 00:14:21.000
telo meaning end. These telomeres
have very specific structures.

00:14:21.000 --> 00:14:27.000
In the human they repeat,
T-T-A-G-G-G, again and

00:14:27.000 --> 00:14:32.000
again and again. At the end
of the chromosome there's

00:14:32.000 --> 00:14:36.000
a special enzyme that will come
along and add some extra telomere to

00:14:36.000 --> 00:14:41.000
the chromosome. That,
sorry? Did I say leading

00:14:41.000 --> 00:14:46.000
strand? It's the, oh, yeah,
sorry. It's the lagging,

00:14:46.000 --> 00:14:50.000
sorry. It's the leading strand. No,
no, no, this is the lagging strand.

00:14:50.000 --> 00:14:55.000
This is the leading strand because
it's running along happily not

00:14:55.000 --> 00:15:00.000
having to make a primer. The
okazaki fragment should be here.

00:15:00.000 --> 00:15:04.000
I'll stick by that.
We'll debate it later.

00:15:04.000 --> 00:15:08.000
Anyway, they, we get the point.
But it's lagging because you've got

00:15:08.000 --> 00:15:12.000
the ogzocy fragments there.
So, anyway, we have a problem of

00:15:12.000 --> 00:15:16.000
replication. And the way the cell
solves it is the actual replication

00:15:16.000 --> 00:15:20.000
is shorter, but since it manages to
stick some repeat at the end of the

00:15:20.000 --> 00:15:24.000
chromosome it adds back some
more T-T-A-G-G-G, T-T-A-G-G-G,

00:15:24.000 --> 00:15:29.000
T-T-A-G-G-G, and it keeps
dynamically adding more.

00:15:29.000 --> 00:15:33.000
What do you think would happen if
you didn't, or what's the enzyme

00:15:33.000 --> 00:15:37.000
that adds telomeres?
Telomerase. Telomerase adds that.

00:15:37.000 --> 00:15:41.000
What cells do you think need
to have active telomerase?

00:15:41.000 --> 00:15:45.000
Rapidly dividing cells would
need to have telomerase.

00:15:45.000 --> 00:15:49.000
Cells that are not rapidly dividing,
cells that have stopped dividing can

00:15:49.000 --> 00:15:53.000
shut off their telomerase.
But if a cell is going to go

00:15:53.000 --> 00:15:57.000
through lots and lots of
cell divisions it's got to,

00:15:57.000 --> 00:16:01.000
it's got to tidy up its telomeres
each time because they're

00:16:01.000 --> 00:16:06.000
getting too short. You've
got to have an enzyme that's

00:16:06.000 --> 00:16:10.000
adding back ends of chromosomes.
What cells do you think

00:16:10.000 --> 00:16:14.000
particularly care about
having telomerase on them?

00:16:14.000 --> 00:16:19.000
Cancers. It turns out that
this is not a trivial point.

00:16:19.000 --> 00:16:23.000
More than 90% of cancers turn
on actively the telomerase gene,

00:16:23.000 --> 00:16:27.000
which would be a shut off in
normal cells because the cell is

00:16:27.000 --> 00:16:32.000
not dividing anymore. Part
of becoming a cancer is having

00:16:32.000 --> 00:16:36.000
to turn on this repair mechanism for
the ends, this extension mechanism

00:16:36.000 --> 00:16:40.000
for the ends of your chromosomes.
And so, various people are trying

00:16:40.000 --> 00:16:44.000
to make drugs to inhibit cancers by
inhibiting this telomerase enzyme.

00:16:44.000 --> 00:16:49.000
So, understanding just your linear
replication of chromosomes is a kind

00:16:49.000 --> 00:16:53.000
of useful thing even in
dealing with things like cancer.

00:16:53.000 --> 00:16:57.000
Genome sizes. I mentioned,
how big was the human genome?

00:16:57.000 --> 00:17:02.000
Three times ten to the ninth
bases. The mouse genome?

00:17:02.000 --> 00:17:06.000
It's almost as big, about
2.7 times ten to the ninth

00:17:06.000 --> 00:17:11.000
bases, 2.7 million bases. The
elephant genome? I actually

00:17:11.000 --> 00:17:15.000
just found this out last week
because we just finished sequencing

00:17:15.000 --> 00:17:20.000
elephant DNA, and I can
now tell you I think it's 3.

00:17:20.000 --> 00:17:25.000
. The dog is 2. times
ten to the ninth.

00:17:25.000 --> 00:17:29.000
Anyway, it's about, for most
mammals it's pretty close

00:17:29.000 --> 00:17:33.000
to three billion bases. And
there is some fluctuation.

00:17:33.000 --> 00:17:37.000
Some are a little bigger.
Some are a little smaller.

00:17:37.000 --> 00:17:41.000
It doesn't scale with
sizing the animal, though,

00:17:41.000 --> 00:17:45.000
because the dog has a smaller genome,
for example, than the mouse does,

00:17:45.000 --> 00:17:48.000
but the elephant is a big bigger
than us. And check in later in the

00:17:48.000 --> 00:17:52.000
term, I'll tell you about the
aardvark. We should know in a

00:17:52.000 --> 00:17:56.000
little while. But here are, for
example, fruit flies. The fruit

00:17:56.000 --> 00:18:00.000
fly, it has a genome of
two times ten to the eighth.

00:18:00.000 --> 00:18:04.000
I'm giving, I'm being
quite approximate. In fact,

00:18:04.000 --> 00:18:08.000
I'll make it, I'll give you
1. times ten to the eighth. 150

00:18:08.000 --> 00:18:12.000
million bases.
Yeast, by contrast,

00:18:12.000 --> 00:18:17.000
has a genome of 1.2 times ten to
the seventh. So, that's 12 million,

00:18:17.000 --> 00:18:21.000
150 million give or take,
and about three billion,

00:18:21.000 --> 00:18:25.000
so 3,000 million. So,
genome sizes can vary quite

00:18:25.000 --> 00:18:30.000
dramatically amongst
different eukaryotes.

00:18:30.000 --> 00:18:36.000
Now, what about prokaryotes?
How do the prokaryotes differ?

00:18:36.000 --> 00:18:43.000
Prokaryotes differ because their
genomes are typically not linear

00:18:43.000 --> 00:18:49.000
chromosomes. The typical
prokaryotic chromosome is a

00:18:49.000 --> 00:18:56.000
double-stranded circle. It's
a double-stranded circular DNA.

00:18:56.000 --> 00:19:02.000
Now, the
double-stranded circular

00:19:02.000 --> 00:19:07.000
DNA doesn't have this problem
of telomeres. You just keep

00:19:07.000 --> 00:19:12.000
replicating around and you get to
the end. So, there you have a much

00:19:12.000 --> 00:19:17.000
simpler replication system than
having to worry about your ends of

00:19:17.000 --> 00:19:22.000
chromosomes. You also
have much smaller genomes.

00:19:22.000 --> 00:19:27.000
The typical prokaryotic genome size,
it's on the order of a few million

00:19:27.000 --> 00:19:31.000
bases. E. coli, 4.6 million
bases. There are, for example,

00:19:31.000 --> 00:19:35.000
mycobacteria, such as the
mycobacteria that caused

00:19:35.000 --> 00:19:39.000
tuberculosis or leprosy,
have on the order of, well,

00:19:39.000 --> 00:19:43.000
actually, not quite them, but other
mycobacteria have on the order of

00:19:43.000 --> 00:19:46.000
about a million bases or so.
Mycobacteria, M. genitalia has

00:19:46.000 --> 00:19:50.000
actually slightly less
than a million basis.

00:19:50.000 --> 00:19:54.000
So, these are basically
several million bases.

00:19:54.000 --> 00:19:58.000
So, there's a huge
variation in genome size.

00:19:58.000 --> 00:20:02.000
Your genome is about a
thousand times bigger than E.

00:20:02.000 --> 00:20:07.000
coli's genome. Now, you do actually
have one circular chromosome.

00:20:07.000 --> 00:20:11.000
Do you know what it is? I speak
about the 23 pairs of human

00:20:11.000 --> 00:20:16.000
chromosomes. There's actually
one more human chromosome.

00:20:16.000 --> 00:20:20.000
The mitochondria have their
own chromosome. It's a circle.

00:20:20.000 --> 00:20:25.000
That's very odd that you would have
one chromosome that's a circle that

00:20:25.000 --> 00:20:30.000
looks like a bacterial chromosome.
Do you know why that is?

00:20:30.000 --> 00:20:34.000
The mitochondria arose as a
symbiotic bacterium that became a

00:20:34.000 --> 00:20:38.000
symbiont of eukaryotic cells
about 1. billion years ago.

00:20:38.000 --> 00:20:42.000
It was a bacterium taken up
into another cell, and that's how

00:20:42.000 --> 00:20:46.000
eukaryotes evolved. And
we can even see that little

00:20:46.000 --> 00:20:50.000
signature of it having been a
prokaryote from the fact that it's

00:20:50.000 --> 00:20:54.000
got one of these circular
prokaryotic looking chromosomes.

00:20:54.000 --> 00:20:58.000
Now, it, because it's living in
your cells, has thrown out all sorts

00:20:58.000 --> 00:21:02.000
of genes that it doesn't
need anymore because the main,

00:21:02.000 --> 00:21:06.000
the nucleus supplies
most of the proteins.

00:21:06.000 --> 00:21:10.000
So, your mitochondrial genome
is a circle that's a mere 16,

00:21:10.000 --> 00:21:14.000
00 bases long. It's a very small
circle encoding a very limited

00:21:14.000 --> 00:21:18.000
number of genes,
but it's, in fact,

00:21:18.000 --> 00:21:22.000
the residue of the bacterial
symbiont that lead to the formation

00:21:22.000 --> 00:21:26.000
of euks. Now, viruses,
what do viruses have?

00:21:26.000 --> 00:21:30.000
Do they have double-strained
linear chromosomes? Which is it?

00:21:30.000 --> 00:21:38.000
Is it double-stranded linear DNA or
is it double-stranded circular DNA?

00:21:38.000 --> 00:21:46.000
Circular DNA. So, who votes for
linear? Who votes for circular?

00:21:46.000 --> 00:21:54.000
Who's undecided? Ah, the
undecided are very larger here. So,

00:21:54.000 --> 00:22:01.000
the answer is both. Some
viruses have double-stranded

00:22:01.000 --> 00:22:07.000
linear DNA. Some viruses have
double-stranded circular DNA.

00:22:07.000 --> 00:22:14.000
It's worse than that, though.
Some viruses have single-stranded

00:22:14.000 --> 00:22:20.000
linear, circular DNA.
Ha? How does that work?

00:22:20.000 --> 00:22:26.000
Some viruses actually infect
the cell injecting DNA,

00:22:26.000 --> 00:22:32.000
and it's just single-stranded.
As soon as it gets into the cell,

00:22:32.000 --> 00:22:36.000
however, it's replicated to make a
double-stranded DNA which can then

00:22:36.000 --> 00:22:41.000
be transcribed, et
cetera, et cetera,

00:22:41.000 --> 00:22:46.000
et cetera. But it travels around
as a single-stranded piece of DNA.

00:22:46.000 --> 00:22:50.000
And it's actually weirder
than that. Some viruses,

00:22:50.000 --> 00:22:55.000
viruses being very small
can experiment with all

00:22:55.000 --> 00:23:02.000
sorts of things. Some viruses
actually consist not of

00:23:02.000 --> 00:23:10.000
DNA at all but of RNA,
single-stranded RNA. How does it do

00:23:10.000 --> 00:23:18.000
that? So, in other words, in
the capsid there's a single

00:23:18.000 --> 00:23:26.000
strand of RNA. When
it gets into the cell,

00:23:26.000 --> 00:23:32.000
what does it do?
Sorry? It creates DNA.

00:23:32.000 --> 00:23:36.000
How does it create DNA? From
the RNA. How's it going to do

00:23:36.000 --> 00:23:41.000
that? Well, how is it going
to turn itself into DNA?

00:23:41.000 --> 00:23:46.000
It needs an enzyme to do that?
Reverse transcriptase. You would

00:23:46.000 --> 00:23:50.000
like to reverse the transcription
process, and you would like to name

00:23:50.000 --> 00:23:55.000
that reverse transcriptase. And
where are you going to get this

00:23:55.000 --> 00:24:00.000
reverse transcriptase from?
Laying around. Laying around where?

00:24:00.000 --> 00:24:04.000
I mean the cell is just sitting
there with reverse transcriptase

00:24:04.000 --> 00:24:08.000
waiting to obligingly
reverse transcribe this virus?

00:24:08.000 --> 00:24:12.000
Your RNA. So make it how? With
ribosomes. So, in other words,

00:24:12.000 --> 00:24:17.000
if I'm an RNA, why don't
I encode the sequence for

00:24:17.000 --> 00:24:21.000
reverse transcriptase and
actually translate myself.

00:24:21.000 --> 00:24:25.000
So, if you were really cleaver,
you might decide to put in the

00:24:25.000 --> 00:24:30.000
genetic code for
reverse transcriptase.

00:24:30.000 --> 00:24:36.000
And when that message gets into the
cell, it will first act as an mRNA,

00:24:36.000 --> 00:24:42.000
a messenger RNA, translate, make,
here's the reverse transcriptase

00:24:42.000 --> 00:24:48.000
enzyme, which is then going to go,
and it's going to reverse transcribe

00:24:48.000 --> 00:24:54.000
this thing into,
say, DNA. So, wow.

00:24:54.000 --> 00:25:00.000
Now, that's a good one.
This is a plus strand virus.

00:25:00.000 --> 00:25:05.000
It encodes its own reverse
transcriptase in its instructions.

00:25:05.000 --> 00:25:10.000
There actually are
minus-strand viruses that don't,

00:25:10.000 --> 00:25:15.000
but what they do is instead in their
own code, in their own package bring

00:25:15.000 --> 00:25:20.000
a longer reverse transcriptase.
So, either you can encode your own

00:25:20.000 --> 00:25:25.000
reverse transcriptase or in the
package you can include your own

00:25:25.000 --> 00:25:30.000
reverse transcriptase.
Do you know any viruses?

00:25:30.000 --> 00:25:36.000
And then the reverse transcriptase
is then used to transcribe the DNA,

00:25:36.000 --> 00:25:43.000
the RNA into DNA, and eventually
into a double-stranded DNA which,

00:25:43.000 --> 00:25:49.000
in some of the viruses, can then be
slammed into and inserted into your

00:25:49.000 --> 00:25:56.000
own chromosomes. So, a DNA
copy of the virus can be

00:25:56.000 --> 00:26:03.000
installed into your own chromosomes,
which is somewhat insidious.

00:26:03.000 --> 00:26:08.000
What viruses do you know that
do this? HIV. More generally

00:26:08.000 --> 00:26:13.000
retroviruses are the class
of these viruses that can,

00:26:13.000 --> 00:26:19.000
in fact, run this replication
process from RNA to DNA and install

00:26:19.000 --> 00:26:24.000
DNA copies of them in your genome.
And how do you then get the DNA

00:26:24.000 --> 00:26:30.000
copy out of your
genome? You don't.

00:26:30.000 --> 00:26:33.000
It doesn't come out.
Retroviral insertions don't come

00:26:33.000 --> 00:26:37.000
out. That's one of the issues in
dealing with AIDS is once this DNA

00:26:37.000 --> 00:26:40.000
copy is in a cell it's not coming
out. We have no way to remove it.

00:26:40.000 --> 00:26:44.000
We have to make sure that the virus
is shut down by other mechanisms

00:26:44.000 --> 00:26:47.000
that might inhibit its products,
et cetera, but once its stuck a DNA

00:26:47.000 --> 00:26:51.000
copy into your chromosomes, you
know, there's no way of getting

00:26:51.000 --> 00:26:55.000
it out. So, if we had
to try to inhibit the

00:26:55.000 --> 00:27:00.000
action of the AIDS virus, we
might wish to make inhibitors of

00:27:00.000 --> 00:27:05.000
this aspect of replication,
inhibitors or reverse transcription.

00:27:05.000 --> 00:27:10.000
And, of course, as probably many of
you know, some of the important AIDS

00:27:10.000 --> 00:27:15.000
drugs are reverse transcriptase
inhibitors, very important to

00:27:15.000 --> 00:27:20.000
limiting the replication of the
AIDS virus. And there are many other

00:27:20.000 --> 00:27:25.000
kinds of weirdnesses.
Viruses pretty much explore,

00:27:25.000 --> 00:27:30.000
everything you possibly can do,
viruses come up with ways to do.

00:27:30.000 --> 00:27:35.000
Let's take now the
process of transcription.

00:27:35.000 --> 00:27:40.000
We have replication up there.
Let's look at transcription. And

00:27:40.000 --> 00:27:45.000
this time let's start with
prokaryotes. For the simple aspect

00:27:45.000 --> 00:27:50.000
of transcribing genes, the
prokaryotic genome looks just

00:27:50.000 --> 00:27:55.000
like the simple model I gave you.
There is some kind of a promoter

00:27:55.000 --> 00:28:00.000
that tells RNA polymerase
to come sit down here.

00:28:00.000 --> 00:28:07.000
RNA polymerase hops on, RNA
polymerase begins to copy in RNA,

00:28:07.000 --> 00:28:15.000
and eventually it hits the signal
that says to terminate transcription.

00:28:15.000 --> 00:28:22.000
OK. This is not a stop codon
which is about translation.

00:28:22.000 --> 00:28:30.000
This is a termination
of transcription.

00:28:30.000 --> 00:28:35.000
And this RNA then goes off.
A perfectly happy thing, a

00:28:35.000 --> 00:28:40.000
messenger RNA, mRNA.
So, there's nothing weird

00:28:40.000 --> 00:28:46.000
about proks compared to the simple
description that we gave before.

00:28:46.000 --> 00:28:51.000
But eukaryotes are different.
There are some funny things that

00:28:51.000 --> 00:28:57.000
happen in the eukaryote. Well,
first off it starts the same.

00:28:57.000 --> 00:29:03.000
There's a promoter. RNA
polymerase sits down there,

00:29:03.000 --> 00:29:09.000
it starts transcribing, it makes
an mRNA, it hits the transcriptional

00:29:09.000 --> 00:29:16.000
termination signal, it
stops, and then this RNA gets

00:29:16.000 --> 00:29:22.000
processed in interesting ways.
The first thing that happens is

00:29:22.000 --> 00:29:29.000
three modifications happen. The
first one is at the five prime

00:29:29.000 --> 00:29:35.000
end, remember five prime to three
prime, a funny modification is put

00:29:35.000 --> 00:29:41.000
on. It's a, if the message, say,
were A-U-C-U-G-G-C et cetera,

00:29:41.000 --> 00:29:47.000
a G triphosphate is put on
backwards. It's actually a methyl G

00:29:47.000 --> 00:29:53.000
triphosphate is put on backwards,
so going in the other direction.

00:29:53.000 --> 00:30:00.000
You have the triphosphate
bond there, a methyl G.

00:30:00.000 --> 00:30:04.000
And the only thing that
you share care about that,

00:30:04.000 --> 00:30:09.000
I don't care if you know the
structure, is that there's a funny

00:30:09.000 --> 00:30:13.000
cap. This thing is called a
cap that is put on this message.

00:30:13.000 --> 00:30:18.000
And that cap is very important
to signaling to the cell this is a

00:30:18.000 --> 00:30:23.000
messenger RNA to be dealt with in a
certain way, to get the ribosome to

00:30:23.000 --> 00:30:27.000
hop on, to get this thing
processed properly, et cetera.

00:30:27.000 --> 00:30:32.000
At the other end of the message
a long string of As is added

00:30:32.000 --> 00:30:37.000
to messenger RNAs. This
long string of As is called,

00:30:37.000 --> 00:30:41.000
very sensibly, a poly A tail.
The poly A tail is added to the

00:30:41.000 --> 00:30:46.000
messenger RNA,
and very often,

00:30:46.000 --> 00:30:51.000
I mean it's, if you wanted to purify
messenger RNAs from your own human

00:30:51.000 --> 00:30:55.000
cells, you can actually use poly T
as a reagent because it turns out,

00:30:55.000 --> 00:31:00.000
because messenger RNAs have
a poly A tail, they'll bind to

00:31:00.000 --> 00:31:04.000
and stick to poly T. So,
people actually purify messenger

00:31:04.000 --> 00:31:08.000
RNAs by binding them to poly
T and they get the poly A tail.

00:31:08.000 --> 00:31:11.000
But it is broadly believed that the
reason for this poly A tail is not

00:31:11.000 --> 00:31:15.000
to make things convenient for
molecular biologists to purify

00:31:15.000 --> 00:31:18.000
messages. To the contrary, it
is an important function for the

00:31:18.000 --> 00:31:22.000
cell. And it turns out that this
is important in regulating the

00:31:22.000 --> 00:31:25.000
stability of messages. If,
in fact, you don't have a poly

00:31:25.000 --> 00:31:29.000
A tail, if you contrive to make the
same message without the poly A tail,

00:31:29.000 --> 00:31:33.000
the message will be
degraded rather rapidly.

00:31:33.000 --> 00:31:36.000
And the lengths of the poly A tails
control aspects of the degradation,

00:31:36.000 --> 00:31:39.000
et cetera. So, in a
complex eukaryotic cell,

00:31:39.000 --> 00:31:43.000
already it's how to attach
a little signal at the front,

00:31:43.000 --> 00:31:46.000
some signals at the back that
says process me in a certain way,

00:31:46.000 --> 00:31:49.000
et cetera, don't degrade me yet.
You could even imagine that this

00:31:49.000 --> 00:31:53.000
poly A tail could serve as a little
bit of a clock for how long that

00:31:53.000 --> 00:31:56.000
message sticks around. It's
not quite that simple but

00:31:56.000 --> 00:31:59.000
there are ways to do it. But
all of these pale in comparison

00:31:59.000 --> 00:32:03.000
to the third way in which eukaryotic
messages differ from prokaryotic

00:32:03.000 --> 00:32:10.000
messages. These small
modifications are,

00:32:10.000 --> 00:32:21.000
as I say, small. The most striking
way in which they differ is that

00:32:21.000 --> 00:32:33.000
only a small portion often
of the gene, here's my gene,

00:32:33.000 --> 00:32:44.000
matters for the protein that
is made. So, my mRNA gets made.

00:32:44.000 --> 00:32:54.000
It includes the whole long sequence.
And then the cell comes along and

00:32:54.000 --> 00:33:05.000
splices this message together.
So, this is the immature RNA.

00:33:05.000 --> 00:33:13.000
It is processed by clipping
out this, clipping out this,

00:33:13.000 --> 00:33:22.000
clipping out this. And what you get
is a splice where the mature message

00:33:22.000 --> 00:33:30.000
throws this stuff out,
splices between here and here,

00:33:30.000 --> 00:33:39.000
splices here, splices here,
splices here, and you get a

00:33:39.000 --> 00:33:47.000
much shorter mRNA. And
this is a mature mRNA.

00:33:47.000 --> 00:33:53.000
This splicing is a remarkable
phenomenon. In fact,

00:33:53.000 --> 00:34:00.000
it was discovered by Phil Sharp
here, for which he won a Nobel prize.

00:34:00.000 --> 00:34:04.000
This splicing is a very
complex operation. First off,

00:34:04.000 --> 00:34:08.000
how does, well, actually,
what accomplishes splicing?

00:34:08.000 --> 00:34:12.000
It should be splicase, right?
But it turns out it's not a single

00:34:12.000 --> 00:34:17.000
enzyme. It's a big body of stuff.
So, instead it's the splicosome, OK.

00:34:17.000 --> 00:34:21.000
Everything is either ase or
some or something like that.

00:34:21.000 --> 00:34:25.000
So, it turns out it's the
splicosome that does that.

00:34:25.000 --> 00:34:30.000
It's just wonderful how all those
names work out. The splicosome.

00:34:30.000 --> 00:34:36.000
The splicosome comes along and
splices it. How does the splicosome

00:34:36.000 --> 00:34:42.000
know how to do this? Well,
there are kind of codes.

00:34:42.000 --> 00:34:48.000
It turns out that there are some
information encoded along in these

00:34:48.000 --> 00:34:54.000
messages. It turns out that
there is, you know, slight biases.

00:34:54.000 --> 00:35:00.000
Typically the sequence just after
where the slice starts here is a GU

00:35:00.000 --> 00:35:06.000
and the sequence here is an AG,
but that's obviously not enough

00:35:06.000 --> 00:35:10.000
information, right? It's not
enough bases of information

00:35:10.000 --> 00:35:14.000
to get this right. And
so there's a little more

00:35:14.000 --> 00:35:18.000
preferences for what bases use, but
the truth is we don't fully know.

00:35:18.000 --> 00:35:21.000
Our best picture right now
involves some cellular factors help

00:35:21.000 --> 00:35:25.000
recognizing the parts that are
supposed to stay in some sequences

00:35:25.000 --> 00:35:29.000
here. But the truth is we
don't have the simple codes.

00:35:29.000 --> 00:35:33.000
Because if we had the simple codes,
we'd be able to take a long stretch

00:35:33.000 --> 00:35:37.000
of DNA and figure out exactly
where the splices go based on just

00:35:37.000 --> 00:35:42.000
computer analysis. And
we can't do that so well.

00:35:42.000 --> 00:35:46.000
These bits that stay in are called
exons. The bits that go out are

00:35:46.000 --> 00:35:51.000
called introns. This is
the source of extraordinary

00:35:51.000 --> 00:35:55.000
confusion for students because you
might think that the bits that are

00:35:55.000 --> 00:36:00.000
excised are the
exons, but they're not.

00:36:00.000 --> 00:36:04.000
The bits that stay in are the exons.
Why are they called exons if they

00:36:04.000 --> 00:36:08.000
stay in and ex is a prefix meaning
out? Well, because the introns are

00:36:08.000 --> 00:36:12.000
named because they're intervening
sequences. Once the introns,

00:36:12.000 --> 00:36:17.000
the intervening sequences were named
as intervening sequences or introns,

00:36:17.000 --> 00:36:21.000
you were stuck then having to name
the things that stay in as exons.

00:36:21.000 --> 00:36:25.000
This was all done by a Harvard
professor, don't blame me.

00:36:25.000 --> 00:36:30.000
In any case, a good
friend Harvard professor.

00:36:30.000 --> 00:36:37.000
But, nonetheless, I'm not
sure that this was the best

00:36:37.000 --> 00:36:44.000
way to name them. But
you're stuck with it.

00:36:44.000 --> 00:36:52.000
So, for a typical human gene,
typical human gene, the length of

00:36:52.000 --> 00:36:59.000
the gene itself might be 30,
00 bases. But the mature RNA,

00:36:59.000 --> 00:37:07.000
the mature mRNA might be
one and a half, 1,500 bases.

00:37:07.000 --> 00:37:11.000
That's remarkable. Out
of 30,000 letters in the

00:37:11.000 --> 00:37:15.000
initial transcript that is made,
the genes start, the promoter, and

00:37:15.000 --> 00:37:19.000
the transcription will stop
30, 00 bases away. The cell goes

00:37:19.000 --> 00:37:24.000
through the trouble of making
an RNA of 30,000 bases long,

00:37:24.000 --> 00:37:28.000
and then it trims it
down by throwing out 28,

00:37:28.000 --> 00:37:33.000
00 of the bases, keeping
only 1, 00 bases at the end.

00:37:33.000 --> 00:37:37.000
Now, this may seem profligate but
it ain't nothing compared to some

00:37:37.000 --> 00:37:42.000
extreme cases. The
clotting factor gene,

00:37:42.000 --> 00:37:47.000
the factor 8 gene, the gene that
has mutated in individuals with

00:37:47.000 --> 00:37:52.000
hemophilia, that gene is 200, 00
bases long, and it gets spliced

00:37:52.000 --> 00:37:57.000
down to a mere 10, 00 bases.
190,000 bases are thrown

00:37:57.000 --> 00:38:02.000
away. But that's
nothing compared to

00:38:02.000 --> 00:38:08.000
Duchene muscular dystrophy. The
Duchene muscular dystrophy is

00:38:08.000 --> 00:38:13.000
the all time winner. That
gene makes an immature initial

00:38:13.000 --> 00:38:19.000
RNA of 2 million bases. RNA
polymerase hops on at the

00:38:19.000 --> 00:38:24.000
promoter and it gets off at the
end of the Boston Marathon on here 2

00:38:24.000 --> 00:38:30.000
million bases later having made
an RNA of 2 million bases long.

00:38:30.000 --> 00:38:36.000
Calculate the speed of RNA
polymerase and you'll find out that

00:38:36.000 --> 00:38:42.000
it's at it for hours. It
hops on and it stays on for

00:38:42.000 --> 00:38:48.000
hours until it gets to the other end.
And then for all its troubles this

00:38:48.000 --> 00:38:54.000
gene is spliced down to 16,
00 bases in the mature message.

00:38:54.000 --> 00:39:00.000
Yup? How would it increase
the chance of mutations? Yup.

00:39:00.000 --> 00:39:05.000
So, splicing mutations could be a
problem. Some diseases could arise

00:39:05.000 --> 00:39:10.000
from errors in splicing.
Do you think that happens?

00:39:10.000 --> 00:39:15.000
Sure does. There could
be mutations that create,

00:39:15.000 --> 00:39:20.000
that change a splicing, or
mutations that create a new

00:39:20.000 --> 00:39:25.000
splicing, and all of that
could screw up the gene.

00:39:25.000 --> 00:39:30.000
Why do this? What in
the world is going on?

00:39:30.000 --> 00:39:35.000
Just think about the energetic cost.
I mean count up the ATPs involved

00:39:35.000 --> 00:39:40.000
in synthesizing a nucleotide, and
then the ATPs involved in adding

00:39:40.000 --> 00:39:45.000
nucleotides up. You know,
think about this totally

00:39:45.000 --> 00:39:50.000
wasted energy.
What is the point?

00:39:50.000 --> 00:39:55.000
I might be able to encode multiple
proteins with the same gene.

00:39:55.000 --> 00:40:00.000
How would I do that? Ooh,
wouldn't that be cleaver?

00:40:00.000 --> 00:40:05.000
I might be able to take a single
gene and make a mix and match

00:40:05.000 --> 00:40:10.000
product. It might be, do you
mean like one type of cell

00:40:10.000 --> 00:40:15.000
might splice up that message one
way to produce a certain protein,

00:40:15.000 --> 00:40:20.000
but a different cell type might
splice the same gene another way to

00:40:20.000 --> 00:40:25.000
produce a different protein?
Ooh. So, you're proposing, if I

00:40:25.000 --> 00:40:30.000
understand you correctly,
alternative splicing.

00:40:30.000 --> 00:40:34.000
Alternative splicing could
create multiple proteins,

00:40:34.000 --> 00:40:38.000
multiple distinct proteins. It
might be, for example, that you

00:40:38.000 --> 00:40:42.000
might make one protein that has a
cytoplasmic tail and another protein

00:40:42.000 --> 00:40:46.000
that doesn't have cytosplasmic
tail or a different tail or,

00:40:46.000 --> 00:40:50.000
or, this is true. This actually
happens. It's very cleaver.

00:40:50.000 --> 00:40:54.000
Anything that can happen
does happen somewhere,

00:40:54.000 --> 00:40:58.000
and it's fairly regularly used.
A typical gene in the human being

00:40:58.000 --> 00:41:02.000
has at least two alternative
splice forms, on average.

00:41:02.000 --> 00:41:05.000
Most, many don't, but there
are some that have large

00:41:05.000 --> 00:41:08.000
numbers. The most extreme
is there's a gene known,

00:41:08.000 --> 00:41:11.000
drosophila, that has more than a
thousand alternative splice forms.

00:41:11.000 --> 00:41:15.000
How does it know, how does the cell
know whether to splice it one way in

00:41:15.000 --> 00:41:18.000
the liver and one way in a heart or
something? We don't fully know but

00:41:18.000 --> 00:41:21.000
there's machinery and signals people
are trying to work out for that.

00:41:21.000 --> 00:41:25.000
Now, I don't want to confuse
you too much about it.

00:41:25.000 --> 00:41:28.000
You know, mostly, when we give you
a gene, you should think about it

00:41:28.000 --> 00:41:31.000
spliced out introns, exons.
But the truth is it is more

00:41:31.000 --> 00:41:35.000
complicated than that. There
can be alternative splicing

00:41:35.000 --> 00:41:39.000
that allows genes to be
used in multiple ways.

00:41:39.000 --> 00:41:43.000
Sometimes they don't make multiple
proteins. They may splice into

00:41:43.000 --> 00:41:46.000
portions of the mRNA that
are not translated, but,

00:41:46.000 --> 00:41:50.000
there is that, but, boy,
it's a huge amount of overhead

00:41:50.000 --> 00:41:54.000
here just to do that.
Is it justified? Yes?

00:41:54.000 --> 00:41:58.000
That is by computer if I just
gave you the sequence? Not quite.

00:41:58.000 --> 00:42:01.000
Almost. Maybe. Sort of.
It turns out that the

00:42:01.000 --> 00:42:04.000
computer programs for automatically
recognizing the matter of the human

00:42:04.000 --> 00:42:07.000
genome are sort of, they're
mediocre, not very good.

00:42:07.000 --> 00:42:09.000
We have some idea of the signals,
and various people have trying to

00:42:09.000 --> 00:42:12.000
write better and better
algorithms for doing that,

00:42:12.000 --> 00:42:15.000
but the cell knows what it's
doing and we don't fully know,

00:42:15.000 --> 00:42:18.000
as evidenced by the fact that we
can't write a clean computer program

00:42:18.000 --> 00:42:21.000
to do it yet. We need to get
information from the cell or from

00:42:21.000 --> 00:42:24.000
evolution or various other things
like that, and that's the ultimate

00:42:24.000 --> 00:42:27.000
test. If we knew what we were
talking about we'd just be able to

00:42:27.000 --> 00:42:30.000
write a computer program
and splice it out.

00:42:30.000 --> 00:42:34.000
And we don't. There's another
reason why people think these big

00:42:34.000 --> 00:42:38.000
introns and exons, these
big introns are helpful,

00:42:38.000 --> 00:42:42.000
and that is an evolutionary reason.
The evolutionary reason is a little

00:42:42.000 --> 00:42:47.000
bit harder to follow,
but let me try it on you.

00:42:47.000 --> 00:42:51.000
Suppose a random event happens
and a chromosome breaks,

00:42:51.000 --> 00:42:55.000
that happens, and suppose a random
breakage sticks one part of a

00:42:55.000 --> 00:43:00.000
chromosome to some other
part of the chromosome.

00:43:00.000 --> 00:43:04.000
If it lands smack dab in the middle
of the coding sequence of a gene

00:43:04.000 --> 00:43:09.000
that's bad new. But it
turns out that if it lands

00:43:09.000 --> 00:43:13.000
in the introns of two different
genes and sticks them together it

00:43:13.000 --> 00:43:18.000
could make a new gene that would
still work. By having a random

00:43:18.000 --> 00:43:23.000
break between two genes in their
introns and slamming them together,

00:43:23.000 --> 00:43:27.000
you could make a gene that had a
bunch of exons from one gene and a

00:43:27.000 --> 00:43:32.000
bunch of exons from another gene.
And this intervening sequence in the

00:43:32.000 --> 00:43:38.000
middle and it would get spliced up.
Evolution might like that because

00:43:38.000 --> 00:43:44.000
it would be a very easy way for
evolution to build new genes that

00:43:44.000 --> 00:43:49.000
had a portion of one protein
and a portion of another protein.

00:43:49.000 --> 00:43:55.000
This kind of mix and match domain
swapping could be very useful.

00:43:55.000 --> 00:44:01.000
And when we look across genomes,
we see lots and lots of examples of

00:44:01.000 --> 00:44:07.000
genes that have a similar first
half but different second halves.

00:44:07.000 --> 00:44:10.000
Or have some portion in the middle,
a domain that we recognize, that we

00:44:10.000 --> 00:44:13.000
see in multiple proteins. And
so, in fact, an argument for

00:44:13.000 --> 00:44:17.000
why we have all of this intronic DNA,
one that's impossible to prove but

00:44:17.000 --> 00:44:20.000
is an argument is that from
an evolution point of view,

00:44:20.000 --> 00:44:24.000
this allows a great deal
of evolutionary innovation.

00:44:24.000 --> 00:44:27.000
You have to be careful that you say
those organisms that have this extra

00:44:27.000 --> 00:44:31.000
space are able to mix and match and
create more new kinds of combination

00:44:31.000 --> 00:44:34.000
proteins, et cetera, et
cetera, and therefore survived

00:44:34.000 --> 00:44:38.000
better, et cetera,
et cetera, et cetera.

00:44:38.000 --> 00:44:41.000
Why don't bacteria have this?
Sorry? They're not as complicated.

00:44:41.000 --> 00:44:45.000
That's one though is we can take
a sort of condescending attitude to

00:44:45.000 --> 00:44:48.000
these bacteria.
They're not very,

00:44:48.000 --> 00:44:52.000
they're just not so complicated.
There's another point of view which

00:44:52.000 --> 00:44:56.000
is bacteria are far more
sophisticated than we are because

00:44:56.000 --> 00:45:00.000
they're under incredibly
rigorous evolutionary selection.

00:45:00.000 --> 00:45:04.000
You might argue that if I'm a
bacteria, can I really afford all

00:45:04.000 --> 00:45:08.000
this extra DNA? Now, the
metabolic cost of all that

00:45:08.000 --> 00:45:12.000
extra DNA is huge to a bacteria
which competes on replication.

00:45:12.000 --> 00:45:16.000
It's got to divide every 20 minutes,
and trying to put in all these extra

00:45:16.000 --> 00:45:20.000
bases would be very news. So,
you might imagine, just to be,

00:45:20.000 --> 00:45:24.000
you know, stand things on its head,
that early life all had introns and

00:45:24.000 --> 00:45:28.000
bacteria, in the process of
competing to be more and more

00:45:28.000 --> 00:45:31.000
efficient go rid of their introns.
There's actually a large camp of

00:45:31.000 --> 00:45:34.000
people who think it went that way,
that early life evolved with introns,

00:45:34.000 --> 00:45:37.000
and then bacteria, in
the pressure to compete,

00:45:37.000 --> 00:45:41.000
got rid of them. And there's
some evidence to support that.

00:45:41.000 --> 00:45:44.000
Bacteria don't have introns.
Small eukaryotes like yeast that

00:45:44.000 --> 00:45:47.000
sort of do compete on
replication have some introns,

00:45:47.000 --> 00:45:50.000
but a small number. There
are only about 250 introns in

00:45:50.000 --> 00:45:53.000
yeast. Only about 5% of the genes
have an intron and they're small.

00:45:53.000 --> 00:45:57.000
Bigger eukaryotes have bigger
introns. And the bigger you get,

00:45:57.000 --> 00:46:00.000
often on average the bigger
the genome sizes are the more

00:46:00.000 --> 00:46:03.000
you can tolerate it.
And so I actually think,

00:46:03.000 --> 00:46:07.000
I actually probably favor this
notion that introns were the

00:46:07.000 --> 00:46:10.000
original state and
they've been gotten rid of.

00:46:10.000 --> 00:46:14.000
And the more pressure you're under
to replicate rapidly the less you

00:46:14.000 --> 00:46:17.000
can tolerate this interesting
and complicated innovation.

00:46:17.000 --> 00:46:21.000
Anyway, that's another
way that things differ.

00:46:21.000 --> 00:46:24.000
And then, finally, viruses
can do it either way.

00:46:24.000 --> 00:46:28.000
Viruses, depending on whether
they are prokaryotic viruses or

00:46:28.000 --> 00:46:31.000
eukaryotic viruses,
are able to replicate,

00:46:31.000 --> 00:46:35.000
are able to either do
or don't have splicing.

00:46:35.000 --> 00:46:41.000
Last topic. Translation.
Here eukaryotes are relatively

00:46:41.000 --> 00:46:48.000
simple. You get a message, you
get a gene, you get an mRNA.

00:46:48.000 --> 00:46:54.000
The mRNA goes to the
ribosome. Here's a ribosome.

00:46:54.000 --> 00:47:01.000
The ribosome goes to the mRNA,
actually, and it starts turning out

00:47:01.000 --> 00:47:09.000
one protein as it chugs along.
Prokaryotes differ in an interesting

00:47:09.000 --> 00:47:18.000
way. I get a promoter here that
is transcribed into my mRNA,

00:47:18.000 --> 00:47:27.000
but it turns out that the mRNA can
encode multiple independent proteins,

00:47:27.000 --> 00:47:36.000
protein one, protein two,
protein three on the same mRNA.

00:47:36.000 --> 00:47:40.000
And a ribosome will hop on
here and synthesize this one.

00:47:40.000 --> 00:47:44.000
A ribosome will hop on here
and synthesize this one,

00:47:44.000 --> 00:47:48.000
and a ribosome will hop on
here and synthesize that one.

00:47:48.000 --> 00:47:52.000
And you have what is called
a polycistronic message.

00:47:52.000 --> 00:47:56.000
Poly, many. Cystronic, cystrons
were an old name for coding

00:47:56.000 --> 00:48:00.000
regions of genes here.
Polycystronic messages.

00:48:00.000 --> 00:48:03.000
Why would you want to do that,
have a single mRNA that encodes

00:48:03.000 --> 00:48:06.000
multiple distinct proteins, each
starting with its own ribosome

00:48:06.000 --> 00:48:09.000
start site there?
Efficiency. Maybe,

00:48:09.000 --> 00:48:12.000
in fact, these would be, how
about, oh, this would be cleaver,

00:48:12.000 --> 00:48:15.000
make them multiple steps
in a biochemical pathway?

00:48:15.000 --> 00:48:18.000
Have them coded on a single
messenger so then you'd only have to

00:48:18.000 --> 00:48:21.000
worry about regulating that
once. If you have the regulatory

00:48:21.000 --> 00:48:24.000
machinery to turn on, you'll
make all the enzymes for the

00:48:24.000 --> 00:48:27.000
pathway. And that's exactly what
bacteria do. They tend to put all

00:48:27.000 --> 00:48:30.000
the enzymes for a pathway on a
single message so when they want to

00:48:30.000 --> 00:48:33.000
call up, let's digest hexose this
morning, they have a whole thing

00:48:33.000 --> 00:48:37.000
that will let them be able
to do that, poly-cystronic.

00:48:37.000 --> 00:48:41.000
That's because they're small
genomes. They're pressed for space.

00:48:41.000 --> 00:48:46.000
And, because of that, they have
to slam a lot into a single unit.

00:48:46.000 --> 00:48:50.000
And this single unit that has
multiple genes encoded in a single

00:48:50.000 --> 00:48:55.000
message is called an operon,
and we'll talk more about that.

00:48:55.000 --> 00:49:00.000
Last of all viruses. Viruses.
Viruses have very little room.

00:49:00.000 --> 00:49:05.000
Their genomes can be tiny. A
typical virus might have a genome

00:49:05.000 --> 00:49:10.000
of 5,000 bases to 10, 00
bases to, in some cases,

00:49:10.000 --> 00:49:15.000
200,000 bases, but it hasn't got a
lot of room. It wants to pack a lot

00:49:15.000 --> 00:49:21.000
of protein coating information in.
And some viruses have come up with

00:49:21.000 --> 00:49:26.000
the most extraordinary way of doing
that. Some viruses have gone to the

00:49:26.000 --> 00:49:32.000
extreme of having RNAs that get made
from them that have a sequence --

00:49:32.000 --> 00:49:39.000
I'm just going to pick up in
the middle of the sequence here.

00:49:39.000 --> 00:49:46.000
A-C-U-A-C-U-A-C-U-A-C-U. You might
decide to read the sequence like

00:49:46.000 --> 00:49:53.000
this, that those are the codons,
and you'd get a certain protein.

00:49:53.000 --> 00:50:00.000
But I might also decide to read
that sequence C-U-A-C-U-A-C-U-A.

00:50:00.000 --> 00:50:04.000
And, of course, I'm giving
this in a repeating form

00:50:04.000 --> 00:50:08.000
because it's easy to note. I
could give you any sequence and I

00:50:08.000 --> 00:50:12.000
could read it in this reading frame,
I could read it in this reading

00:50:12.000 --> 00:50:17.000
frame, or I could read
it as U-A-C-U-A-C-U-A-C.

00:50:17.000 --> 00:50:21.000
In other words, there are
three reading frames that,

00:50:21.000 --> 00:50:25.000
in principle, you could translate
a protein from. In a typical

00:50:25.000 --> 00:50:30.000
prokaryotic gene or eukaryotic
gene only one of those is used.

00:50:30.000 --> 00:50:34.000
You start at the first AUG and
that sets the reading frame.

00:50:34.000 --> 00:50:39.000
But some viruses are so pressed for
space and are so cleaver and are so

00:50:39.000 --> 00:50:43.000
efficient that they make messages
that have tricks that they actually

00:50:43.000 --> 00:50:48.000
use two or, in some cases, all
three reading frames, which is

00:50:48.000 --> 00:50:53.000
an extraordinary packing of
information density into a simple

00:50:53.000 --> 00:50:57.000
message. So, the basic point.
We have a simple model. DNA is

00:50:57.000 --> 00:51:02.000
replicated.
Transcribed into RNA.

00:51:02.000 --> 00:51:06.000
Translated into protein. But
there are a lot of important

00:51:06.000 --> 00:51:11.000
variations between eukaryotes,
prokaryotes and viruses. And

00:51:11.000 --> 00:51:15.000
understanding them can be
useful for treating cancer,

00:51:15.000 --> 00:51:20.000
for treating AIDS, and for
treating viral and bacterial

00:51:20.000 --> 00:51:25.000
infections.
Next time.